llmpp

LLM Prompt-based Parsing (llmpp) - A library for fine-tuning and evaluating the LLMs on prompt-based dependency parsing tasks.

This repository contains implementations for training, inference, and evaluation of thep roposed method in the paper "Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs.", accepted to SyntaxFest 2025.

Environment

Ubuntu 22.04
CUDA 12.1
Python 3.11 or later
- requirements.txt for base environment.
Create venv and install requirements

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Download Models

huggingface-cli download google/gemma-2-9b --local-dir models/gemma-2-9b --token {your_hf_token}

Dataset

Download and JSONL Generation

See download_ud.sh and llmpp/conllu_to_prompt.py.

You need to accept the lLicense for each UD dataset.

Process Each Language

./download_ud.sh UD_English-EWT en_ewt r2.15
for s in train dev test
do
  python -m llmpp.conllu_to_prompt templates/en-turn1-step3.toml data/en_ewt-r2.15/en_ewt-ud-$s.conllu LANGUAGE English
done

LoRA SFT

TRL

See llmpp/sft_trl.py.

Training without DeepSpeed

Using default settings:

CUDA_VISIBLE_DEVICES=0 python -m llmpp.sft_trl --c config/gemma-2-9b.yaml

Specifying model and train data path:

CUDA_VISIBLE_DEVICES=0 python -m llmpp.sft_trl \
  --c config/gemma-2-9b.yaml \
  --m models/gemma-2-2b \
  --t data/ja_gsd-r2.15/en-turn1-step3.train.jsonl

Training with DeepSpeed

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
  --main_process_port 29500 \
  --config_file config/accelerate_deepspeed_zero3_4gpu.yaml \
  -m llmpp.sft_trl \
  --c config/gemma-2-9b.yaml

OpenAI API Fine-tuning Settings

The standard hyper-parameters are:

Epochs: 2
Batch size: 9 (default)
LR multiplier: 1.8 (default)

Inference

vLLM

See llmpp/completion_vllm.py.

source venv_vllm/bin/activate
CUDA_VISIBLE_DEVICES=0 python -m llmpp.completion_vllm \
  --m models/gemma-2-9b_en_ewt-r2.15_en_turn1_step3.train/ \
  --i data/en_ewt-r2.15/en-turn1.step3.test.jsonl \
  --rsr

For above example, the output directory is models/gemma-2-9b_en_ewt-r2.15_en_turn1_step3.train/data/en_ewt-r2.15/en-turn1-step3.test.jsonl/ and the output files are:

Inference results
- completion.jsonl
Log
- completion.jsonl.log

The option --rtr is required only for gemma-2 models.

If you want to avoid bos_token duplication, add -ct '^\{\{-? bos_token \}\}\n' to the CLI options.

OpenAI API

See llmpp/completion_openai.py.

python -m llmpp.completion_openai -m ft:gpt-4o-mini-2024-07-18:your-account:someprefix:somedigest -i data/en_ewt-r2.15/en-turn1-step3.test.jsonl

Evaluation

See llmpp/eval.py.

python -m llmpp.eval models/gemma-2-9b_en_ewt-r2.15_en-turn1-step3.train/data/en_ewt-r2.15/en-turn1-step3.test.jsonl/completion.jsonl

For above example, the output files will be placed in the same directory of the input file.

Evaluation result
- completion.eval.json
Evaluation report
- completion.eval.report

Aggregate eval JSONs to TSV

./aggregate_eval_json_to_tsv.sh models/*_en_ewt-r2.15_en-turn1-step3.train/data/en_ewt-r2.15/en-turn1-step3.test.jsonl/completion.eval.json

Citation

@misc{matsuda2025llmparsing,
      title={Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs}, 
      author={Hiroshi Matsuda and Chunpeng Ma and Masayuki Asahara},
      year={2025},
      eprint={2506.09983},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.09983}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
llmpp		llmpp
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregate_eval_json_to_tsv.sh		aggregate_eval_json_to_tsv.sh
download_ud.sh		download_ud.sh
infer_and_eval.sh		infer_and_eval.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llmpp

Environment

Download Models

Dataset

Download and JSONL Generation

Process Each Language

LoRA SFT

TRL

Training without DeepSpeed

Training with DeepSpeed

OpenAI API Fine-tuning Settings

Inference

vLLM

OpenAI API

Evaluation

Aggregate eval JSONs to TSV

Citation

About

Uh oh!

Releases

Packages

Languages

License

megagonlabs/llmpp

Folders and files

Latest commit

History

Repository files navigation

llmpp

Environment

Download Models

Dataset

Download and JSONL Generation

Process Each Language

LoRA SFT

TRL

Training without DeepSpeed

Training with DeepSpeed

OpenAI API Fine-tuning Settings

Inference

vLLM

OpenAI API

Evaluation

Aggregate eval JSONs to TSV

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages