Automatic Essay Grading

Comprehensive instruction tuning, evaluation, and optimization of large language models (LLMs) for automated essay grading, using refined data analysis, preprocessing, and feature engineering. Conducted 39 experiments on models like Mistral, Qwen2.5, and SmolLM2 to assess their performance across real and synthetic datasets using both score and rationale evaluation.

Abstraction

This work is a study of the performance of various models on an automated essay grading task, exploring what results Instruction Tuning can achieve on such a task, and which factors significantly affect performance compared to model-specific factors, such as the structure itself, size, and the amount of input tokens during learning. The results were promising, despite the relatively small data set and limited computational resources. However, you'll find that we used the most efficient optimization techniques, which allowed us to experiment and achieve the desired results.

Introduction

When we, AI pioneers, want to solve a problem or find a flexible and easy solution, we first understand the problem. Then, the steps become clear, one after the other, until we reach the crucial step. This is the step that all AI pioneers and developers stop at: Which model will we use? How will we know it's the best? Based on what will we choose? What are the weaknesses we face, and how will we choose the best model, taking into account our weaknesses? There is a type of pioneer who searches and finds that a certain model performed well on a task similar to theirs. They immediately test the model on their task and discover that its performance was weak. This opens another door for them: Is the data the reason? How will I solve the problem, and most importantly, which model will I choose? We say that the answer is easy and everyone knows it, but it only requires love, passion, and effort. In short, it is: experimentation. We ran 39 documented experiments to answer this question. The number of experiments is small, but we can't do anything. The experiments we ran were limited by computational resources. However, we were able to unleash our creativity and answer the easy, difficult question.

Dataset

The dataset includes EngSAF, a collection of real short-answer responses from engineering exams, and Books, a synthetic dataset of essay-style entries generated from classic literature. Both are designed to support automated grading research.

EngSAF

The EngSAF dataset, in its raw and unprocessed form, consists of approximately 5,800 short-answer responses collected from real-life engineering examinations administered at a reputed academic institute. These responses are spread across 119 unique questions drawn from a wide range of engineering disciplines, making the dataset both diverse and domain-specific. Each data point includes a student’s answer and an associated human-annotated score, serving as a benchmark for evaluating automated grading models.

The dataset is divided into three primary subsets: 70% is allocated for training, 16% is reserved for evaluation on unseen answers (UA), and 14% is dedicated to evaluating performance on entirely new questions (UQ). At this stage, it is important to note that the dataset is considered in its original state; no preprocessing, transformation, or filtering has yet been applied. All subsequent improvements and refinements to the data will be described in later sections. This dataset is known as EngSAF version 1.0 and was introduced in the paper titled "I understand why I got this grade": Automatic Short Answer Grading (ASAG) with Feedback, authored by Aggarwal et al., and set to appear in the proceedings of AIED 2025. The dataset is released strictly for academic and research purposes; any commercial use or redistribution without explicit permission is prohibited. Researchers are also urged to avoid publicly disclosing any sensitive content that may be contained in the dataset.

For more details, the paper can be accessed at: https://arxiv.org/abs/2407.12818.

Books

The Books dataset is a synthetic collection of essay-style data points generated using public domain literature and large language model prompting. The dataset comprises a total of 300 entries and is built from six classic books. Four of these: The Life of James Watt, The Life of Julius Caesar, The Moonstone, and North and South; were used during the training phase, while the remaining two: The Life of Napoleon and Sense and Sensibility; were held out for benchmarking purposes. Each book contributed exactly 50 entries, leading to a structured split of 200 training samples and 100 benchmark samples.

All entries were generated using Le Chat Mistral, a model developed by Mistral AI. A carefully crafted prompt was used to ensure each generated entry included a question, a reference answer written by an expert, a student answer meant to simulate a real-world response, a mark scheme outlining the grading criteria, a score between 1 and 4, and a rationale explaining why the score was assigned. The prompt enforced strict quality control: no duplicate questions or answers were allowed, all required fields had to be present, and the scoring range was strictly limited to valid values. The final output was formatted as CSV files to maintain consistency and ensure compatibility with downstream processing.

For more details, the metadata can be accessed at: metadata.

Methodology

We followed a systematic process beginning with thorough data analysis and preprocessing, followed by model training, structured output postprocessing, and a comprehensive evaluation using quantitative and qualitative methods.

Exploratory Data Analysis

We began our data exploration by focusing exclusively on the EngSAF dataset, since it comprises real-world data that requires rigorous inspection; the Books dataset, being synthetic and small, was analyzed manually using tools like Excel. Our goal was to investigate the cleanliness, structure, and readiness of the EngSAF data for downstream tasks, while also ensuring there was no semantic leakage between training and evaluation subsets.

The initial analysis revealed that the training split contained a small number of missing values, specifically in the Question_id and Student Answer columns. Given that these were minimal; just 12 rows; we dropped them outright. Furthermore, we decided to eliminate the Question_id column altogether, as it held no value for our modeling goals. To bring consistency and alignment with our pipeline, we standardized the column names: "Question" became question, " Student Answer" was renamed student_answer, "Correct Answer" became reference_answer, "output_label" was renamed score, and "feedback" was renamed rationale. Although the mark_scheme column was not included in the original CSV, it was documented in the dataset’s official resources, and we added it later during feature engineering.

In terms of duplication, we discovered a few records that were fully duplicated; these were removed. After cleaning, the training data comprised 3,662 entries with 106 unique questions, 3,516 unique student answers, and 3,614 unique feedback texts. The distribution of the score labels was nearly balanced, with a slight bias toward label 2 and the lowest representation in label 0. Importantly, student answers and feedback lengths varied substantially, introducing valuable diversity into the dataset. A similar set of operations was performed on the unseen_answers split, which contained 980 entries across 103 unique questions. We found 954 unique student answers and 963 unique feedback entries, and we followed the same cleaning and renaming steps. The same applied to the unseen_question set, which was already cleaner; it had no missing or fully duplicated rows, and included 765 samples with 12 unique questions, 751 unique student answers, and 765 unique feedbacks.

We then turned our attention to semantic similarities and potential data leakage between these subsets. We wanted to quantify how many evaluation entries were semantically similar to training entries, which could artificially inflate performance metrics. To achieve this, we concatenated the values of each row into a single string and embedded them using the all-MiniLM-L6-v2 sentence-transformer, a fast and effective model for short text sequences. We indexed these embeddings using FAISS and searched for semantic overlaps above a 90% similarity threshold. This threshold was selected after iterative testing, starting from 80%, where we observed too many false positives due to repeated questions and correct answers. The 90% cutoff better reflected genuine semantic overlap without penalizing natural repetitions. The results were revealing. In the unseen_answers split, 909 of the 976 entries showed high semantic similarity to training data, leaving only 67 non-leaking samples. In contrast, the unseen_question split had only 17 overlapping samples, leaving 748 that were clean. The val set contained 376 overlapping entries and just 29 non-leakages, making it unsuitable for evaluation. All leakage indices were saved in a leakage_indices.json file for downstream processing.

Rather than discarding these leakages, we leveraged them during the resplitting phase. We isolated all non-leakage entries from unseen_answers, unseen_question, and val, merged them into a unified unseen set, and then split this into validation and test subsets with a 40-60 ratio, respectively. After filtering out duplicate records, we ended up with 844 unique, clean entries; the validation set was assigned 338 entries, and the test set received 506 entries. While a few student answers were repeated, each was associated with a unique feedback response, aligning with our expectation of varied responses to the same question. The remaining overlapping entries; previously filtered due to leakage; were pooled and merged back with the training set. After removing a few full duplicates and dropping repeated student_answer values to enhance training diversity, we finalized a training set size of 4,735 clean entries across 107 unique questions. This preprocessing ensured that our model would train on non-redundant, high-quality data without any semantic contamination of evaluation splits.

Finally, we checked the text lengths across splits. The longest sentence in the training set contained 481 words, while the longest in the evaluation set had 361. Given that our static prompt template added only 34 words, and considering the models used support much longer sequences, truncation was not required. This confirmed that we could safely proceed without modifying the data length, preserving the full context of each answer during training and inference.

Preprocessing

The preprocessing stage implemented all the data cleaning strategies identified during exploratory data analysis while preparing the datasets for model training and evaluation. Building on the EDA findings, we developed a comprehensive preprocessing pipeline using a custom Preprocessor class that handles prompt formulation, dataset reformatting, and tokenization.

For the EngSAF dataset, we first addressed the quality issues identified during EDA. This included removing the unnecessary Question_id column, dropping rows with missing values (only 12 instances), and standardizing column names to more descriptive formats (question, student_answer, etc.). We eliminated full duplicate records while preserving cases where only the student answers or feedback were duplicated, as these represented legitimate variations in responses to the same question. The dataset was then split according to our leakage-aware strategy, combining non-leaky records into validation and test sets while merging appropriate leaky records with the training data. This resulted in a final training set of 4,735 clean records.

The preprocessing pipeline included sophisticated prompt engineering to ensure consistent formatting across all examples. Each data point was transformed into a structured chat format containing system instructions, user content ( with the question and answers), and assistant responses (with scores and rationales). The system message established the model's role as a "precise grading assistant," while the user content provided all necessary grading context including the question, reference answer, student answer, and mark scheme. Assistant responses were formatted as JSON-like strings containing both scores and rationales for easy parsing.

Tokenization was handled carefully to maintain compatibility with different model architectures. The preprocessing class included functionality to count tokens across the entire dataset while excluding special tokens, helping us monitor our token budget. For training, we applied standard tokenization with truncation where necessary, though our earlier length analysis showed this would rarely be needed. For inference, we added generation prompts and configured tensor outputs appropriately for the target device. The entire preprocessing pipeline was designed to be model-agnostic, allowing easy switching between different architectures while maintaining consistent input formats.

Modeling

The modeling approach for this study was carefully designed to evaluate the performance of different large language models (LLMs) on the automated essay grading task. We selected three distinct model architectures: Mistral, Qwen2.5, and SmolLM2, to represent a range of model sizes and design philosophies. Each model was instruction-tuned on both Books and EngSAF datasets with variant sizes, with hyperparameters optimized to balance computational efficiency and performance. The experiments were conducted on GPU-accelerated hardware, leveraging techniques like gradient checkpointing, flash attention, and mixed-precision training to maximize resource utilization.

Postprocessing

The postprocessing phase was centered on enforcing a consistent and machine-readable output format, with particular emphasis on ensuring every model response could be reliably converted into JSON. This was a critical design choice that enabled seamless extraction of the key fields, "score" and "rationale", across both training and evaluation pipelines. Without strict adherence to this structure, downstream processes such as metric computation, model comparison, and error analysis would have been compromised.

To achieve this, we implemented a dedicated Postprocessor class. Its core component, the extract method, handled two primary scenarios. When working with structured chat histories, it directly located the assistant's response by role, applied lightweight JSON repair logic, and parsed the result into a Python dictionary. This ensured that each response, regardless of model origin, could be accessed with uniform syntax and no additional parsing logic. When working with raw string outputs, the method performed targeted substring extraction to isolate the JSON-like block, followed by the same repair and parse steps. This dual-path approach made the pipeline resilient to variations in output formatting, while guaranteeing that every valid entry could be transformed into structured data.

A secondary utility method, strip, was developed to remove assistant responses from full prompt sequences, retaining only system and user roles. This allowed us to regenerate clean prompts for re-evaluation or ablation studies without risking contamination from previous outputs. All postprocessing logic was implemented with one core principle in mind: every model output must be convertible to a well-structured JSON object. This constraint enabled consistent parsing, robust indexing, and deterministic behavior during evaluation. It also played a key role during training, where ground-truth rationales and scores had to be compared against model predictions with exact field-level granularity. Our focus on JSON compatibility was not optional or cosmetic; it was a structural necessity that underpinned the stability, reproducibility, and interpretability of the entire system.

Evaluation

The evaluation methodology employed both quantitative metrics and qualitative analysis. For quantitative assessment, we computed accuracy, precision, recall, F1 score, root mean squared error (RMSE), and Cohen's kappa score (CKS) for the scoring task, while using BERT-Score precision, recall, and F1 for rationale evaluation. On a held-out test set of 100 samples. Qualitative examination of models' outputs revealed cases where most of the models correctly identified key aspects of student answers but sometimes failed to properly align its scoring with the rubric criteria.

Results

Our experiments show Mistral achieves the highest accuracy in score prediction, while Qwen2.5 outperforms others in generating quality rationales. Tables 1 and 2 summarize the top-performing models for score and rationale evaluations, respectively, across key metrics.

Table 1: Top-performing model per type based on `score` evaluation:

Model	F1	Precision	Recall	Accuracy	CKS	RMSE
Mistral-7b-instruct-v0.2-bnb-4bit-EngSaf-231K-tokens	0.642	0.683	0.633	0.65	0.457	0.707
Qwen2.5-3B-Instruct-EngSaf-628K	0.6141	0.6415	0.6046	0.62	0.4123	0.6633
SmolLM2-1.7B-Instruct-EngSaf-429K	0.3614	0.4496	0.3939	0.4	0.0789	1.0392

Table 2: Top-performing model per type based on `rationale` evaluation:

Model	F1	Precision	Recall
Mistral-7b-instruct-v0.2-bnb-4bit-EngSaf-231K-tokens	0.633	0.638	0.633
Qwen2.5-3B-Instruct-EngSaf-628K	0.6438	0.653	0.6382
SmolLM2-1.7B-Instruct-EngSaf-429K	0.6335	0.6381	0.6333

Conclusion

This study shows that with careful instruction tuning, preprocessing, and evaluation, LLMs can deliver strong performance on automated essay grading, even with limited data and compute. Across 39 experiments using models like Mistral, Qwen2.5, and SmolLM2, we observed that score accuracy and rationale quality were more influenced by prompt design and data integrity than by model size alone. Leakage-aware dataset splitting and embedding-based similarity checks proved critical in ensuring fair evaluation. Mistral achieved the highest score accuracy, while Qwen2.5 led in rationale quality. Our preprocessing and postprocessing pipelines enabled structured, interpretable outputs across most models. Systematic testing was critical to uncovering the real levers of grading accuracy and rationale quality.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
backend		backend
data		data
evaluate		evaluate
experiments		experiments
frontend		frontend
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
TM.pdf		TM.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Essay Grading

Abstraction

Introduction

Dataset

EngSAF

Books

Methodology

Exploratory Data Analysis

Preprocessing

Modeling

Postprocessing

Evaluation

Results

Table 1: Top-performing model per type based on `score` evaluation:

Table 2: Top-performing model per type based on `rationale` evaluation:

Conclusion

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

IsmaelMousa/automatic-essay-grading

Folders and files

Latest commit

History

Repository files navigation

Automatic Essay Grading

Abstraction

Introduction

Dataset

EngSAF

Books

Methodology

Exploratory Data Analysis

Preprocessing

Modeling

Postprocessing

Evaluation

Results

Table 1: Top-performing model per type based on score evaluation:

Table 2: Top-performing model per type based on rationale evaluation:

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Table 1: Top-performing model per type based on `score` evaluation:

Table 2: Top-performing model per type based on `rationale` evaluation:

Packages