Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,2 @@
data/nextcoder-synthetic.jsonl
notebook.ipynb
git-credential-manager
models
*.ipynb
*.parquet
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,17 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

<img src="assets/spider-plot.png" width=400></img>


| Model | MMLU | GSM8K | HumanEval+ | MBPP+ |
|-------|------|-------|------------|-------|
| Qwen2.5-Coder-7B-Instruct | 53.0 | 83.40 | 85.4 | 72.5 |
| NextCoder-7B | 54.5 | 81.65 | 84.8 | 72.0 |
| Qwen2.5-Coder-32B-Instruct | 71.9 | 93.71 | 87.2 | 76.7 |
| NextCoder-32B | 72.7 | 92.65 | 85.9 | 76.4 |

*Generalization properties kept across different benchmarks among base and nextcoder versions*


**A detailed evaluation and ablations can be found in our paper**

## Contributing
Expand Down
1 change: 1 addition & 0 deletions data/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- `config` contains the yaml file to map prompts to their corresponding location
- `utils.py` contains the helper code to extract and parse data from LLM responses
- `data_pipeline.py` contains the main source code for generating synthetic data according to the pipeline explained in our paper.
- `commitpackft_subset.csv` file contains the `repo` and `commit` fields of the samples used in training, this can be used to map to the original commitpackft for extracting respective samples

# Usage
- Make sure the proper packages are installed via the `environment.yaml` file provided at root folder
Expand Down
Loading