StarScout

This repository contains the source code of StarScout, a tool to find suspicious (and possibly faked) GitHub stars at-scale. It also contains data and scripts to replicate results from the following paper:

Hao He, Haoqin Yang, Philipp Burckhardt, Alexandros Kapravelos, Bogdan Vasilescu, and Christian Kästner. 2026. Six Million (Suspected) Fake Stars on GitHub: A Growing Spiral of Popularity Contests, Spam, and Malware. In 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26), April 12–18, 2026, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3744916.3764531

@inproceedings{he2026six,
    title={Six Million (Suspected) Fake Stars on GitHub: A Growing Spiral of Popularity Contests, Spam, and Malware},
    author={He, Hao and Yang, Haoqin and Burckhardt, Philipp and Kapravelos, Alexandros and Vasilescu, Bogdan and K\"{a}stner, Christian},
    booktitle={2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE '26)},
    year={2026},
    doi={10.1145/3744916.3764531},
    url={https://doi.org/10.1145/3744916.3764531}
    publisher={{ACM}},
}

Setup

The scripts works with Python 3.12 and has only been tested on Ubuntu 22.04.

Setup Python env. For example (using Anaconda):

conda create -n fake-star python=3.12
conda activate fake-star
pip install -r requirements.txt

Configure secrets in secrets.yaml:

mongo_url: your_mongo_url
github_tokens:
  - token: your_github_token
    name: your_github_username
  - token: your_github_token
    name: your_github_username
bigquery_project: your_project_name
bigquery_dataset: your_table_name
google_cloud_bucket: your_google_cloud_bucket_name
npm_follower_postgres: your_postgresql_that_stores_npm_follower_dataset
virus_total_api_key: your_virus_total_api_key

If you only want to replicate the Jupyter Notebooks, you only need to setup the MongoDB URL.

If you only want to run fake star detector, you only need to setup the MongoDB URL and Google Cloud related fields (remember to configure Google Cloud credentials). The remaining configurations are for experimental and research scripts.

Running the Detector

The detector employs two heuristics: a low-activity heuristic and a lockstep heuristic. Their parameters are defined in scripts/init.py. Notably, you may wnat to change the END_DATE and COPYCATCH_DATE_CHUNKS to include latest data. The CopyCatch algorithm for the lockstep heuristic works on half-year chunks as specified in COPYCATCH_DATE_CHUNKS and a new chunk should be manually added on a quarterly basis (e.g., add ("240401", "241001") after Oct 2024).

To run the low-acivity heuristic, use:

python -m scripts.dagster.simple_detector_bigquery

It will run the low-activity heuristic starting from scripts.START_DATE to scripts.END_DATE on Google BigQuery, and write the results to MongoDB. Expect it to read >= 20TB of data ($6.25/TB on the default billing). The BigQuery quries won't take more than a few minutes, but the script will also fetch GitHub API to collect certain information. Expect it to be slower and output a lot of error messages (because many of the fake star repositories have been deleted). Once this script finishes, you should be able to see several CSV files in data/[END_DATE] and a new collection fake_stars.low_activity_stars in the MongoDB instance as specificed my scripts.MONGODB_URL.

To run the lockstep heuristic, first use:

python -m scripts.copycatch.bigquery --run

Allow it a week to finish all iterations and expect it to read >= 40TB of data. You can use nohup to put it as a background process. After a week, you can run the following command to collect the results into MongoDB and local CSV files:

# Write BigQuery Tables to Google Cloud Storage
# Then, export from Google Cloud Storage to MongoDB and local CSV files
python -m scripts.copycatch.bigquery --export

After they finish, you should be able to see CSV files in the data/{END_DATE} folder.

The (Suspected) Fake Stars Dataset

StarScout was executed in three different time cutoffs: 240701, 241001, 250101. The results from each run are available in the data/{240701, 241001, 250101} folders. Since the tool itself have evolved significantly between the runs, it is generally tricky to merge the data from three quarters, so we provide a bunch of data loading utilities in the scripts/analysis/data.py module. The Jupyter Notebooks from the measurement study could also serve as examples regarding how to consume and repurpose the dataset in our study.

Disclaimer. As we discussed in Section 3.4 and 3.5 in our paper, the resulting dataset are only repositories and users with suspected fake stars. The individual repositories and users in our dataset may be false positives. The main purpose of our dataset is for statistical analyses (which tolerates noises reasonably well), not for publicly shaming individual repositories. If you intend to publish subsequent work based on our dataset, please be aware of this limitation and its ethical implications.

The Measurement Study

All the scripts and results from our measurement study can be found in the Jupyter Notebooks. The notebook names should be mostly self-explanatory. It is worth noting that some of the notebook needs a MongoDB database with all the detailed detection results from StarScout, in order to correctly execute. The dataset dump is available at Zenodo as gzip files and can be recovered using the mongorestore tool like this:

mongorestore --gzip --db fake_stars [your-downloaded-file-path]/mongodb/fake_stars/

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
data		data
notebooks-deprecated		notebooks-deprecated
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
evaluation-deletion-ratio.ipynb		evaluation-deletion-ratio.ipynb
evaluation-ground-truth.ipynb		evaluation-ground-truth.ipynb
requirements.txt		requirements.txt
rq1-github-trending.ipynb		rq1-github-trending.ipynb
rq1-overall-prevalence.ipynb		rq1-overall-prevalence.ipynb
rq1-package-registry.ipynb		rq1-package-registry.ipynb
rq2-activity-duration.ipynb		rq2-activity-duration.ipynb
rq2-activity-type.ipynb		rq2-activity-type.ipynb
rq3-domain.ipynb		rq3-domain.ipynb
rq4-regression.Rmd		rq4-regression.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

StarScout

Setup

Running the Detector

The (Suspected) Fake Stars Dataset

The Measurement Study

About

Uh oh!

Releases

Uh oh!

Contributors 2

Languages

License

hehao98/StarScout

Folders and files

Latest commit

History

Repository files navigation

StarScout

Setup

Running the Detector

The (Suspected) Fake Stars Dataset

The Measurement Study

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Languages