Data, code, model used for the paper "The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks"
The raw dataset we constructed for different ML tasks and the non-ml tasks in json files are in raw_dataset folder
split_of_training = ['train', 'dev', 'test']
split_of_tasks = ['data', 'model', 'eval', 'noml', 'uni']
raw_dataset/[split_of_training]_[split_of_tasks].json
This is the link to download the model snap shot: link.
git clone https://github.com/pcyin/tranX.git
cd tranX
-
save the json files to a directory in the repository: e.g. datasets/conala/data/conala/
-
using datasets/conala/dataset.py, binarize the json file.
python dataset.py
-
save the trained model file to a directory in the repository: e.g. saved_models/
model files: tranx_train[split_of_tasks].bin
-
run the testing script to get the decoded results (match the path to your corresponding locations).
bash scripts/conala/decode.sh [path_of_binarizezd data] [path_of_saved models] -
the decoded file will be saved in "decodes/conala/$(basename
${model_file}).$ (basename ${decode_file}).decode".
git clone https://github.com/neulab/external-knowledge-codegen.git
cd external-knowledge-codegen
-
save the json files to a directory in the repository: e.g. datasets/conala/
-
use the following command to binarize the json file.
python dataset.py --pretrain data/conala/conala-mined.jsonl --topk 100000 --include_api apidocs/processed/distsmpl/snippet_15k/goldmine_snippet_count100k_topk1_temp2.jsonl
-
save the trained model file to a diretory in the repository e.g. saved_models/
pretrained model files: ek_juice_pretrain_[split_of_tasks].bin finetuned model files: ek_finetune.[splits_of_tasks].bin
-
run the testing script to get the decoded results.
bash scripts/conala/test.sh [path_of_saved models] -
the decoded fill will be saved in "decodes/conala/$(basename $1).test.decode".
git clone https://github.com/DeepLearnXMU/CG-RL.git
cd CG-RL
-
save the json files to a directory in the repository: e.g. datasets/conala/
-
using datasets/conala/dataset.py in the tranX reposotory, binarize the json file.
python dataset.py
-
save the trained model file to a diretory in the repository e.g. saved_models/
pretrained model files: CG_RL.juice.pretrain.[split_of_tasks].bin finetuned model files: CG_RL.train_rl.[splits_of_tasks].bin
-
run the testing script to get the decoded results.
bash scripts/conala/test.sh -
the decoded fill will be saved in "decodes/conala/$(basename $1).test.decode".
git clone https://github.com/sdpmas/TreeCodeGen.git
cd TreeCodeGen
-
save the json files to a directory in the repository:
-
Build a file with NL intents of training set: Refer to datasets/conala/retrive_src.py. The file will be saved as src.txt in data/conala
-
Parse those NL intents and build a vocabulary: refer to https://github.com/nxphi47/tree_transformer for more details on setting up the parser and run convert_ln.py
-
Build train/dev/test dataset: Run datasets/conala/dataset_hie.py
-
save the trained model file to a diretory in the repository e.g. saved_models/
model files: tree_best.[split_of_tasks]_.bin
-
run the testing script to get the decoded results.
bash scripts/conala/test.sh
git clone https://github.com/BorealisAI/code-gen-TAE.git
cd code-gen-TAE
pip install -r requirements.txt
-
save the json files to a directory in the repository: e.g. datasets/conala/
-
using datasets/conala/dataset.py in the tranX reposotory, binarize the json file.
python dataset.py
-
save the trained model file to a diretory in the repository e.g. saved_models/
model files: tae_[split_of_tasks]_resume.pth
-
run the testing script to get the decoded results.
python3 train.py --dataset_name conala --save_dir saved_models/ --copy_bt --no_encoder_update --monolingual_ratio 0.5 --epochs 20 --just_evaluate --seed 4
git clone https://github.com/microsoft/PyCodeGPT.git
cd PyCodeGPT
pip install -r requirements.txt
cd cert
pip install -r requirements.txt
-
save the json files to a directory in the repository
-
binarize the json file.
bash run_encode_domain.sh
-
save the trained model file to a diretory in the repository e.g. saved_models/
model files (12 files): pcgpt_[split_of_tasks]_*
-
run the testing script to get the decoded results.
cd ../!python eval_human_eval.py \ --model_name_or_path [path_to_model_files] \ --output_dir [output_dir] \ --num_completions [number_of_generations] \ --temperature 0.6 \ --top_p 0.95 \ --max_new_tokens 100 \ --gpu_device 0 \
├── keywords.py (keywords used to parse ML APIs from the libraries)
├── make_dataset.py (script to construct the different ML task dataset)
├── make_non_m.py (script to construct the non-ml dataset)
├── utils.py (script to clean and split the dataset)