This project aims to classify essays as either student-written (label 0
) or LLM-generated (label 1
) using a binary classification model based on a neural network. The model is trained using pre-processed GloVe 100D embeddings and achieves strong performance metrics despite a severely imbalanced dataset.
Given a corpus of essays, determine whether an essay is written by a student or generated by a large language model (LLM). The key challenge lies in:
- Severe class imbalance (1375 student-written vs. only 3 LLM-generated essays in
train_essays.csv
). - Text variability, requiring standardized vectorization.
- Model optimization within the computational constraints of Google Colab.
- Input Layer: 100D averaged GloVe embeddings per essay.
- Hidden Layer: 64 nodes with ReLU activation.
- Dropout Layer: 0.2 (to prevent overfitting).
- Output Layer: Single sigmoid-activated node.
- Loss Function: Binary Cross-Entropy.
- Optimizer: Adam.
-
Hyperparameters Tested:
- Learning rates:
0.001
,0.0001
- Batch sizes:
32
,64
- Dropout rates:
0.2
,0.3
- Learning rates:
-
Best Model Configurations:
learning_rate=0.001
,batch_size=32
,dropout_rate=0.2 or 0.3
-
Validation Strategy: Early stopping (patience=3) and threshold optimization (from 0.5 to 0.7) boosted performance metrics.
Metric | Value |
---|---|
F1-Score | 0.996 |
Accuracy | 0.997 |
Threshold | 0.7 |
Note: High results should be interpreted cautiously due to class imbalance.
The following files are required to train and run this project:
Filename | Description |
---|---|
train_essays.csv |
Contains 1,378 labeled essays for training. |
train_prompts.csv |
Includes prompts associated with training essays. |
test_essays.csv |
Test set for generating final predictions. |
sample_submissions.csv |
Format for final submission file. |
Ensure these files are placed in the root directory or update file paths accordingly.
- Clone this repo and place the necessary CSV files in the root folder.
- Run
cpsc444_final_tf.py
in Google Colab or locally. - Adjust model thresholds and evaluate using validation metrics.
- Increase the LLM-generated sample size.
- Add more layers or increase hidden node count (e.g., 128 nodes).
- Explore more robust resampling and embedding techniques.
- Consider using transformer-based models (e.g., BERT) for richer representations.
Tytus Felbor – Final Project for CPSC 444 with Prof. Hu.
Disclaimer: This project operates under limited computational resources and with a highly imbalanced dataset. Results are promising but may not generalize.