SynthKGQA

SynthKGQA is a framework to generate large, high-quality, synthetic Knowledge Graph Question Answering datasets from any KG, using LLMs.

This repository contains the code to reproduce the results of the paper Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs.

Setup

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -U dgl -f https://data.dgl.ai/wheels/torch-2.4/repo.html

SynthKGQA: KGQA dataset generation

For the required data format to store the KG, see this preprocessing notebook.

Steps 1-2: candidate generation

python3 synth_kgqa/generate.py --kg-path <path to KG directory> --num-samples <number of questions to generate> --num-edges <number of edges in answer subgraphs> --save-path <path to output>

for generating a set of question-answer graph pairs based on the provided knowledge graph. See synth_kgqa/parse.py for additional parameters and synth_kgqa/llm.py for the supported LLM APIs. If you use OpenAI models, set the OPENAI_KEY environment key.

Steps 3-4: candidate validation, classification and augmentation

python3 synth_kgqa/process_qa.py --kg-path <path to KG directory> --qa-path <save-path of generate.py>

The final data will be stored in <save-path>/processed_qa.json.

To construct question-specific subgraphs for the generated questions (Appendix C.1) and shortest path between seed and answer nodes, run

python3 synth_kgqa/compute_neighs_and_sp.py --kg-path <path to KG directory> --dataset <path to output of process_qa.py>

The output will be stored in <save-path>/<dataset>_scores.pkl. This computation can be expensive on larger KGs; see optional arguments in compute_neighs_and_sp.py for parallel processing of the dataset.

GTSQA

Using the SynthKGQA framework and the ogbl-wikikg2 KG (CC-0 license), we constructed GTSQA, a KGQA dataset containing 30,144 training questions and 1622 test questions. It is made available under the Creative Commons 4.0 license, and can be found on Hugging Face: https://huggingface.co/datasets/Graphcore/GTSQA

from datasets import load_dataset

gtsqa = load_dataset("Graphcore/GTSQA")

More details and statistics on the dataset are available in the paper.

See this notebook for the preprocessing steps of ogbl-wikikg2 and this notebook for the final post-processing steps applied to collate the data.

The question-specific subgraphs of ogbl-wikikg2 generated with synth_kgqa/compute_neighs_and_sp.py can also be downloaded with the dataset, by using the alternative config:

gtsqa = load_dataset("Graphcore/GTSQA", name="gtsqa-with-graphs")

KG-RAG benchmarks

The code to benchmark the KG-RAG models evaluated in the paper on GTSQA is available in benchmarks/

We also provide a notebook to reproduce the analysis and generate the figures from the paper.

How to cite

When referring to this work, please cite our paper.

@misc{cattaneo2025,
      title={Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs}, 
      author={Alberto Cattaneo and Carlo Luschi and Daniel Justus},
      year={2025},
      eprint={2511.04473},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.04473}, 
}

License

The included code is released under the MIT license (see details of the license).

See NOTICE.md for further details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
benchmarks		benchmarks
notebooks		notebooks
synth_kgqa		synth_kgqa
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SynthKGQA

Setup

SynthKGQA: KGQA dataset generation

Steps 1-2: candidate generation

Steps 3-4: candidate validation, classification and augmentation

GTSQA

KG-RAG benchmarks

How to cite

License

About

Uh oh!

Releases

Packages

Languages

License

graphcore-research/synth-kgqa

Folders and files

Latest commit

History

Repository files navigation

SynthKGQA

Setup

SynthKGQA: KGQA dataset generation

Steps 1-2: candidate generation

Steps 3-4: candidate validation, classification and augmentation

GTSQA

KG-RAG benchmarks

How to cite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages