The Artifacts for "Mining Repositories to Understand User and Developer Challenges with Static Analysis Tools"
This artifact repository contains the code, data, and results for the paper "Mining Repositories to Understand User and Developer Challenges with Static Analysis Tools".
data/: Contains the data used in the project.data.zip: Zip file containing all issues used in the project.RQ2/: Contains the full list of original and refined topics.RQ2_Topic_List.pdf: Contains the full list of the refined topics.RQ2_Raw_Topic_Bugs.csv: Contains the original topics for bugs.RQ2_Raw_Topic_Questions.csv: Contains the original topics for questions.RQ2_Raw_Topic_Enhancements.csv: Contains the original topics for enhancements.
RQ3/: Contains the results of the manual analysis performed on the 60 issues.
code/: Contains the code of the project.analysis/: Contains the various analysis performed in the paper.subject_tools/: Contains the code to provide the general statistics of the subject tools.common_properties/: Contains the code to generate the common properties of issues (RQ1).catiss_classification/: Contains the code to classify issues into bugs, questions, and enhancements using CatISS.
interest_groups/: Contains the code to generate the interest groups (RQ1).figures/: Contains the code to generate the figures in the paper.
topic_modeling/: Contains the code for topic modeling using BERTopic (RQ2).download_data/: Contains scripts to download the data from GitHub and BitBucket.utils/: Contains utility functions and constants used throughout the code.
The evaluation of this artifact does not require specific hardware. However, the recommended specifications are as listed:
- Python 3.9.6 (version project was developed on)
- Pip 25.1.1 (version project was developed on)
The full dataset of all issues used in this project are located in the data/data.zip file of this repository. Details regarding each of the files inside the zipped data file are provided below:
-
issues_metadata.csvThis file contains the raw metadata of the issues collected from GitHub or BitBucket. It includes information such as issue ID, title, body, labels, etc.. -
pull_requests_metadata.csvThis file contains the raw metadata of the pull requests collected from GitHub or BitBucket. It includes information such as pull request ID, linked commits, etc. -
commits_metadata.csvThis file contains the raw metadata of the commits collected from GitHub or BitBucket. It includes information such as commit ID, number of files changed, etc. -
issues_properties.csvThis file contains the properties of the issues which are generated from this notebook. All the columns with the prefixprop:are considered properties. The columns with the prefixig:are interest group information, having a boolean value indicating whether the corresponding issue is part of the interest group or not. -
repositories.csvThis file contains the repositories used in the paper. It includes the repository name and host (GitHub or BitBucket).
The topic modeling results for RQ2 are located in the data/RQ2/ directory.
RQ2_Topic_List.pdfThis file contains the full list of the refined topics, including the ID, Group, Topic, # of Issues for each topic and for each group, the categories that each topic comes from, and the Description of each topic group.RQ2_Raw_Topic_Bugs.csvThis file contains the original topics generated by the BERTopic model for bugs. It includes following fields:- ID: The ID of the topic.
- Count: # of Issues in the topic.
- Percentage: % of related issues in the dataset.
- Refined Topic: The refined topic label.
- Representation: The most representative terms for each topic using the class-based TF-IDF algorithm.
- Maximal Marginal Relevance: The most diverse set of terms using the Maximal Marginal Relevance algorithm.
- Top Documents: The top 5 most relevant issues for each topic.
RQ2_Raw_Topic_Questions.csvThis file contains the original topics generated by the BERTopic model for questions, with the same fields as theRQ2_Raw_Topic_Bugs.csv.RQ2_Raw_Topic_Enhancements.csvThis file contains the original topics generated by the BERTopic model for enhancements, with the same fields as theRQ2_Raw_Topic_Bugs.csv.
RQ3_Manual_Analysis.csvThis file contains the results of the manual analysis performed on the 60 issues. It includes the following fields:- ID: The ID of the issue.
- Group: The topic group of the issue (e.g., Language Features).
- Topic: The topic of the issue.
- Final 5 Examples: The final 5 examples of the issue being analyzed.
- Problem: The problem described in the issue.
- Root Cause: The root cause of the problem described in the issue.
- Response Type: The type of response provided in the issue (e.g., Workaround solution, Existing solution, etc.).
- Response: The response provided in the issue.
The data is downloaded from GitHub and BitBucket in the scripts located in the code/download_data/ directory. The scripts are as follows:
- download_issues.py: This script downloads issues and pull requests from GitHub and BitBucket, and saves results into json files in a generated
raw_data/issuesandraw_data/pull_requestsfolder. This is a temporary folder that is used to store the raw data before it is processed into the final dataset. - download_commits.py: This script downloads commits from GitHub and BitBucket, and saves results into json files in a generated
raw_data/commitsfolder. This is a temporary folder that is used to store the raw data before it is processed into the final dataset. - generate_dataset.py: This script generates the csv datasets from all the folders in the
raw_datafolder generated fromdownload_issues.pyanddownload_commits.py. It processes this raw json data and generates theissues_metadata.csv,pull_requests_metadata.csv, andcommits_metadata.csvfiles, which can be found atdata/data.zip.
The analysis performed in the paper is located in the code/analysis/ directory. They answer RQ1 of the paper. The analysis is divided into the following subdirectories:
Contains the code to provide the general statistics of the subject tools. This answers Section 3 of the paper (Research Questions and Study Objects). It identifies the number of stars, issues, and LoC of the subject tools (using cloc).
-
subject_tools.ipynb This notebook contains the code to provide the general statistics of the subject tools. It collects the number of stars and issues from each repository.
-
loc.ipynb This notebook contains the code to calculate the lines of code (LoC) of the subject tools using the
cloctool. It runsclocon each repository and collects the LoC information. -
set_tool_name.py This script is used to set the formatted names of the subject tools in the issue data.
- generate_common_properties.ipynb
Contains the code to generate the common properties of the issues. The properties are listed as follows:
- state: whether the issue is open or closed
- category: whether the issue is a bug, question, or enhancement
- resolution time: the time taken to resolve the issue (if closed)
- number of comments: the number of comments on the issue
- number of unique users: the number of unique users who commented on the issue
- number of files changed: the number of files changed in the issue (from linked pull requests and commits)
- number of lines changed: the number of lines changed in the issue (from linked pull requests and commits)
Each of these properties is calculated for each issue in the dataset. The results of this analysis are stored in the issues_properties.csv file in the zipped data file.
-
collect_specific_datapoints.ipynb Contains the code used to collect specific datapoints from the issues mentioned throughout the paper. This includes various distributions of the properties of the issues.
-
investigation.ipynb Contains the code to investigate the specific case or a large amount of files and LoC in PMD and SootUp issues.
-
catiss_classification: While all of the other properties can be easily extracted from the datasets, the category property requires additional processing, as it is not directly provided from metadata. This directory contains the code to classify issues into bugs, questions, and enhancements using CatISS.
-
catiss_classification.ipynb This notebook contains the code to classify issues into bugs, questions, and enhancements using CatISS. It uses the CatISS model to classify the issues based on their title and body. The results are stored in the
issues_properties.csvfile in the zipped data file. -
predictions_analysis.ipynb This notebook contains the code to analyze the accuracy of the CatISS model predictions. It compares the predictions with the existing labels in the dataset and calculates the accuracy of the model.
-
Contains the code to generate the interest groups (RQ1).
- interest_groups.ipynb
This notebook contains the code to generate the interest groups from the issues. The interest groups are as follows: quick resolution, slow resolution, easy fix, hard fix, hot topic, and ignored. It defines the interest group conditions, and then applies these conditions to these issues. The results are stored in the
issues_properties.csvfile in the zipped data file.
The figures in the paper are generated using the code in the code/analysis/figures/ directory. These figure notebooks are calculated from the results of the analysis. The figures are as follows:
- Figure 2: Distribution of properties of static analysis issues across all tools. (common_properties_figures.ipynb)
- Figure 3: Distribution of interest groups by tool. (interest_groups_figures.ipynb)
- Figure 4: Distribution of interest groups by topic groups. (interest_groups_per_group_figures.ipynb)
The topic modeling is performed using BERTopic and is located in the code/topic_modeling/ directory. This answers RQ2 of the paper. Each of the notebooks in this directory contains code to perform topic modeling on the three categories of issues: bugs, questions, and enhancements independently (cluster_bugs.ipynb, cluster_questions.ipynb, cluster_enhancements.ipynb). The results of the topic modeling are stored in the issues_properties.csv file in the zipped data file.
The utility functions and constants used throughout the code are located in the code/utils/ directory. This includes functions to read and write data, process data, and constants used in the code.
-
constants.py The constants used in the code are located in the
code/utils/constants.pyfile. They include stop words, formatted names, and other constants used in the code. -
dataloader.py The dataloader is responsible for loading issue, pull request, commit, and repository data from the CSV files and providing it to the analysis code.
-
diamantopoulos_preprocessor.py The
diamantopoulos_preprocessor.pyfile contains functions to preprocess the issue data using steps by Diamantopoulos et al. This includes text cleaning, tokenization, and other preprocessing steps to prepare the data for analysis.