CPSC444 : A.I. Final Project - Essay Classification: Student vs. LLM-Generated

This project aims to classify essays as either student-written (label 0) or LLM-generated (label 1) using a binary classification model based on a neural network. The model is trained using pre-processed GloVe 100D embeddings and achieves strong performance metrics despite a severely imbalanced dataset.

🧠 Project Overview

Problem Statement

Given a corpus of essays, determine whether an essay is written by a student or generated by a large language model (LLM). The key challenge lies in:

Severe class imbalance (1375 student-written vs. only 3 LLM-generated essays in train_essays.csv).
Text variability, requiring standardized vectorization.
Model optimization within the computational constraints of Google Colab.

🏗️ Architecture

Input Layer: 100D averaged GloVe embeddings per essay.
Hidden Layer: 64 nodes with ReLU activation.
Dropout Layer: 0.2 (to prevent overfitting).
Output Layer: Single sigmoid-activated node.
Loss Function: Binary Cross-Entropy.
Optimizer: Adam.

⚙️ Training & Hyperparameter Tuning

Hyperparameters Tested:
- Learning rates: 0.001, 0.0001
- Batch sizes: 32, 64
- Dropout rates: 0.2, 0.3
Best Model Configurations:
- learning_rate=0.001, batch_size=32, dropout_rate=0.2 or 0.3
Validation Strategy: Early stopping (patience=3) and threshold optimization (from 0.5 to 0.7) boosted performance metrics.

📊 Results

Metric	Value
F1-Score	0.996
Accuracy	0.997
Threshold	0.7

Note: High results should be interpreted cautiously due to class imbalance.

📁 Data Files

The following files are required to train and run this project:

Filename	Description
`train_essays.csv`	Contains 1,378 labeled essays for training.
`train_prompts.csv`	Includes prompts associated with training essays.
`test_essays.csv`	Test set for generating final predictions.
`sample_submissions.csv`	Format for final submission file.

Ensure these files are placed in the root directory or update file paths accordingly.

🚀 How to Run

Clone this repo and place the necessary CSV files in the root folder.
Run cpsc444_final_tf.py in Google Colab or locally.
Adjust model thresholds and evaluate using validation metrics.

🔧 Future Improvements

Increase the LLM-generated sample size.
Add more layers or increase hidden node count (e.g., 128 nodes).
Explore more robust resampling and embedding techniques.
Consider using transformer-based models (e.g., BERT) for richer representations.

📑 Author

Tytus Felbor – Final Project for CPSC 444 with Prof. Hu.

Disclaimer: This project operates under limited computational resources and with a highly imbalanced dataset. Results are promising but may not generalize.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
CPSC444_Final_Presentation.pdf		CPSC444_Final_Presentation.pdf
CPSC444_Final_Report.pdf		CPSC444_Final_Report.pdf
CPSC444_Final_TF.ipynb		CPSC444_Final_TF.ipynb
LICENSE		LICENSE
README.md		README.md
cpsc444_final_tf.py		cpsc444_final_tf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CPSC444 : A.I. Final Project - Essay Classification: Student vs. LLM-Generated

🧠 Project Overview

Problem Statement

🏗️ Architecture

⚙️ Training & Hyperparameter Tuning

📊 Results

📁 Data Files

🚀 How to Run

🔧 Future Improvements

📑 Author

About

Uh oh!

Languages

License

TFelbor/essay-classification-model-python

Folders and files

Latest commit

History

Repository files navigation

CPSC444 : A.I. Final Project - Essay Classification: Student vs. LLM-Generated

🧠 Project Overview

Problem Statement

🏗️ Architecture

⚙️ Training & Hyperparameter Tuning

📊 Results

📁 Data Files

🚀 How to Run

🔧 Future Improvements

📑 Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages