Skip to content

TFelbor/essay-classification-model-python

Repository files navigation

CPSC444 : A.I. Final Project - Essay Classification: Student vs. LLM-Generated

This project aims to classify essays as either student-written (label 0) or LLM-generated (label 1) using a binary classification model based on a neural network. The model is trained using pre-processed GloVe 100D embeddings and achieves strong performance metrics despite a severely imbalanced dataset.

🧠 Project Overview

Problem Statement

Given a corpus of essays, determine whether an essay is written by a student or generated by a large language model (LLM). The key challenge lies in:

  • Severe class imbalance (1375 student-written vs. only 3 LLM-generated essays in train_essays.csv).
  • Text variability, requiring standardized vectorization.
  • Model optimization within the computational constraints of Google Colab.

🏗️ Architecture

  • Input Layer: 100D averaged GloVe embeddings per essay.
  • Hidden Layer: 64 nodes with ReLU activation.
  • Dropout Layer: 0.2 (to prevent overfitting).
  • Output Layer: Single sigmoid-activated node.
  • Loss Function: Binary Cross-Entropy.
  • Optimizer: Adam.

⚙️ Training & Hyperparameter Tuning

  • Hyperparameters Tested:

    • Learning rates: 0.001, 0.0001
    • Batch sizes: 32, 64
    • Dropout rates: 0.2, 0.3
  • Best Model Configurations:

    • learning_rate=0.001, batch_size=32, dropout_rate=0.2 or 0.3
  • Validation Strategy: Early stopping (patience=3) and threshold optimization (from 0.5 to 0.7) boosted performance metrics.

📊 Results

Metric Value
F1-Score 0.996
Accuracy 0.997
Threshold 0.7

Note: High results should be interpreted cautiously due to class imbalance.

📁 Data Files

The following files are required to train and run this project:

Filename Description
train_essays.csv Contains 1,378 labeled essays for training.
train_prompts.csv Includes prompts associated with training essays.
test_essays.csv Test set for generating final predictions.
sample_submissions.csv Format for final submission file.

Ensure these files are placed in the root directory or update file paths accordingly.

🚀 How to Run

  1. Clone this repo and place the necessary CSV files in the root folder.
  2. Run cpsc444_final_tf.py in Google Colab or locally.
  3. Adjust model thresholds and evaluate using validation metrics.

🔧 Future Improvements

  • Increase the LLM-generated sample size.
  • Add more layers or increase hidden node count (e.g., 128 nodes).
  • Explore more robust resampling and embedding techniques.
  • Consider using transformer-based models (e.g., BERT) for richer representations.

📑 Author

Tytus Felbor – Final Project for CPSC 444 with Prof. Hu.


Disclaimer: This project operates under limited computational resources and with a highly imbalanced dataset. Results are promising but may not generalize.

About

Essay Classification: Student vs LLM-Generated

Resources

License

Stars

Watchers

Forks