This release contains all artifacts related to the paper, organized into three main categories:
Code and data for creating prompt variations used in the paper.
- code/: Code for generating dataset variations (llm_summary, paragraph_sampling, sentence_block_masking)
- datasets/: Generated source datasets (the actual datasets used in experiments)
- config/: Configuration files (YAML) for dataset generation
- scripts/: Scripts to run dataset generation
Code for generating LLM outputs from datasets and evaluating them.
- code-generation/: Scripts for generating LLM code completions
- evaluation-code/: Scripts for evaluating generated code (Python evaluation, pass@k)
- scripts/: Orchestration scripts for generate-and-evaluate pipelines
- sample-outputs/: Sample generated outputs (full set too large for release)
Note: ParEval framework should be cloned as a git submodule (see Setup instructions below)
Analysis notebooks, final results, and visualizations.
- paper.pdf: The paper
- notebooks/: Jupyter notebooks for analysis and creating plots
- results/: Final results (CSV files, PDF plots, analysis outputs)
- requirements.txt: Python dependencies
If you use this code or data, please cite:
@misc{zi2025scoreprobingimpactprompt,
title={More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation},
author={Yangtian Zi and Harshitha Menon and Arjun Guha},
year={2025},
eprint={2508.03678},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.03678},
}