Artifact: More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

This release contains all artifacts related to the paper, organized into three main categories:

Structure

1. Dataset Generation (`1-dataset-generation/`)

Code and data for creating prompt variations used in the paper.

code/: Code for generating dataset variations (llm_summary, paragraph_sampling, sentence_block_masking)
datasets/: Generated source datasets (the actual datasets used in experiments)
config/: Configuration files (YAML) for dataset generation
scripts/: Scripts to run dataset generation

2. Evaluation (`2-evaluation/`)

Code for generating LLM outputs from datasets and evaluating them.

code-generation/: Scripts for generating LLM code completions
evaluation-code/: Scripts for evaluating generated code (Python evaluation, pass@k)
scripts/: Orchestration scripts for generate-and-evaluate pipelines
sample-outputs/: Sample generated outputs (full set too large for release)

Note: ParEval framework should be cloned as a git submodule (see Setup instructions below)

3. Results Presentation (`3-results-presentation/`)

Analysis notebooks, final results, and visualizations.

paper.pdf: The paper
notebooks/: Jupyter notebooks for analysis and creating plots
results/: Final results (CSV files, PDF plots, analysis outputs)

Shared Resources (`shared/`)

requirements.txt: Python dependencies

Citation

If you use this code or data, please cite:

@misc{zi2025scoreprobingimpactprompt,
      title={More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation}, 
      author={Yangtian Zi and Harshitha Menon and Arjun Guha},
      year={2025},
      eprint={2508.03678},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.03678}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1-dataset-generation		1-dataset-generation
2-evaluation		2-evaluation
3-results-presentation		3-results-presentation
shared		shared
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Artifact: More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Structure

1. Dataset Generation (`1-dataset-generation/`)

2. Evaluation (`2-evaluation/`)

3. Results Presentation (`3-results-presentation/`)

Shared Resources (`shared/`)

Citation

About

Uh oh!

Releases

Packages

Languages

nuprl/partialordereval

Folders and files

Latest commit

History

Repository files navigation

Artifact: More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Structure

1. Dataset Generation (1-dataset-generation/)

2. Evaluation (2-evaluation/)

3. Results Presentation (3-results-presentation/)

Shared Resources (shared/)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Dataset Generation (`1-dataset-generation/`)

2. Evaluation (`2-evaluation/`)

3. Results Presentation (`3-results-presentation/`)

Shared Resources (`shared/`)

Packages