LLM4SDR is a novel approach that leverages Large Language Models (LLMs) to fully automate the construction of open-source software defect repositories. It systematically addresses key challenges in repository construction through three main phases:
Data Preparation LLM4SDR uses LLMs to generate high-quality commit descriptions by synthesizing information from commit messages, issue reports, pull requests, and related comments. This ensures that commit messages are accurate and informative, even when the original messages are incomplete or ambiguous.
Defect Patch Identification To detect defect-related (bug-fixing) patches, LLM4SDR employs a Random Forest (RF) model that uses diverse features, including code diff metrics and analyses generated by LLMs and the static analysis tool Semgrep . Combining these sources improves precision and recall in patch detection.
Critical Variable Identification LLM4SDR identifies variables related to software defects by combining a patch-based technique with LLM-driven refinement. The LLM filters and augments candidate variables to produce a final set of critical variables that directly contribute to defect introduction and repair.
run_llm_message.py: Leverages LLMs to integrate information from multiple sources and generate detailed commit descriptions.
llm_analyzer.py: Analyzing commits using a large model.
train.py: Train a classifier model.
keyvar_extractor_llm.py: Using large models to assist in extracting key variables.