Global Factor, Stock, and Firm data

This repo contains Python code to generate the global dataset of factor returns, stock returns, and firm characteristics from “Is there a Replication Crisis in Finance?” by Jensen, Kelly, and Pedersen (Journal of Finance, 2023).

Instructions

Prerequisites

Obtain your WRDS credentials.
Ensure you have uv installed on your system.

Steps

Clone the repo
- Clone the folder to your local machine by running the following command from your terminal:
```
git clone https://github.com/bkelly-lab/jkp-data.git
```
Input WRDS credentials
- To save your WRDS credentials, navigate to the jkp-data/ folder and run:
```
jkp connect
```
  Kindly follow the prompts.
  
  Note: If you need to change your password or credentials, run jkp connect --reset and then jkp connect
Run the script
- We run the code via a Slurm scheduler, but we also show how to run it in an interactive Python session.
- Before running the following commands, make sure you are in jkp-data/
- On a cluster with a Slurm scheduler, run:
```
sbatch slurm/submit_job_som_hpc.slurm
```
  to create the factor returns, stock returns, and firm characteristics.
  
  In an interactive session, run:
```
jkp build data/
```
  to create the stock returns and firm characteristics, and
```
jkp portfolio data/
```
  to create the factor returns.
IMPORTANT: When starting the code, you may be prompted to grant access to WRDS using two-factor authentication, for example via a Duo notification. You need to approve this request, as the program will otherwise fail. After a few seconds or minutes, you should see data being created in the output directory. If that is not the case, please check your internet connection or credentials.

When the code is finished, you can find the output in the processed/ subdirectory of your output directory (e.g. data/processed/). Please see the release notes (documentation/release_notes.html) for a description of the output files and a comparison between the output of the SAS/R codebase and the new Python codebase.

Notes

By default, output files are written in Parquet format. To output CSV files instead (with quoted strings to preserve leading zeros in identifiers like gvkey), run:
```
jkp portfolio data/ --output-format csv
```
By default, the end date for the data in the code is 2025-12-31, which you can change by editing the end_date assignment in src/jkp/data/config.py. For example, for May 6, 1992, use: END_DATE = date(1992, 5, 6).
Persistent WRDS Connection: If you're running on an HPC cluster with NAT IP rotation (such as Yale's Bouchet cluster), you may receive many MFA prompts during data download. This happens because each database query creates a new TCP connection, and the NAT gateway assigns a random outbound IP to each connection. WRDS sees these as connections from different locations and triggers MFA for each.

To avoid this, use the --persistent-connection flag, which maintains a single database connection throughout the download process:
```
# Interactive session
jkp build data/ --persistent-connection

# Slurm job (set environment variable)
sbatch --export=ALL,PERSISTENT_WRDS_CONNECTION=1 slurm/submit_job_som_hpc.slurm
```
This reduces MFA prompts from ~26 (one per table) to just 1 (at connection time).
To run the code, we utilize a high performance computing cluster, where we request 450 GB RAM and 128 CPU cores. Running the routine takes about 6 hours.
To understand the data, please refer to our documentation.
We distribute the global factor returns generated from this codebase at jkpfactors.com and the stock returns and firm characteristics at wrds-www.wharton.upenn.edu/pages/get-data/contributed-data-forms/global-factor-data/.
The original SAS/R codebase is still available at github.com/bkelly-lab/ReplicationCrisis, but we recommend using this new Python codebase for future work.

License

Code in this repository is released under the MIT License.

Data distributed in this repository is licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).

See LICENSE and DATA_LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 325 Commits
.claude		.claude
.github		.github
documentation		documentation
slurm		slurm
src/jkp/data		src/jkp/data
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Global Factor, Stock, and Firm data

Instructions

Prerequisites

Steps

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Global Factor, Stock, and Firm data

Instructions

Prerequisites

Steps

Notes

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages