microsoft · adityakanade · May 6, 2025 · May 6, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,2 @@
-data/nextcoder-synthetic.jsonl
-notebook.ipynb
-git-credential-manager
-models
+*.ipynb
 *.parquet
diff --git a/README.md b/README.md
@@ -78,6 +78,17 @@ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
 
 <img src="assets/spider-plot.png" width=400></img>
 
+
+| Model | MMLU | GSM8K | HumanEval+ | MBPP+ |
+|-------|------|-------|------------|-------|
+| Qwen2.5-Coder-7B-Instruct | 53.0 | 83.40 | 85.4 | 72.5 |
+| NextCoder-7B | 54.5 | 81.65 | 84.8 | 72.0 |
+| Qwen2.5-Coder-32B-Instruct | 71.9 | 93.71 | 87.2 | 76.7 |
+| NextCoder-32B | 72.7 | 92.65 | 85.9 | 76.4 |
+
+*Generalization properties kept across different benchmarks among base and nextcoder versions*
+
+
 **A detailed evaluation and ablations can be found in our paper**
 
 ## Contributing

diff --git a/data/README.md b/data/README.md
@@ -5,6 +5,7 @@
 - `config` contains the yaml file to map prompts to their corresponding location
 - `utils.py` contains the helper code to extract and parse data from LLM responses
 - `data_pipeline.py` contains the main source code for generating synthetic data according to the pipeline explained in our paper.
+- `commitpackft_subset.csv` file contains the `repo` and `commit` fields of the samples used in training, this can be used to map to the original commitpackft for extracting respective samples
 
 # Usage
 - Make sure the proper packages are installed via the `environment.yaml` file provided at root folder