This repo contains Python code to generate the global dataset of factor returns, stock returns, and firm characteristics from “Is there a Replication Crisis in Finance?” by Jensen, Kelly, and Pedersen (Journal of Finance, 2023).
- Obtain your WRDS credentials.
- Ensure you have uv installed on your system.
-
Clone the repo
- Clone the folder to your local machine by running the following command from your terminal:
git clone https://github.com/bkelly-lab/jkp-data.git
- Clone the folder to your local machine by running the following command from your terminal:
-
Input WRDS credentials
-
To save your WRDS credentials, navigate to the
jkp-data/folder and run:jkp connect
Kindly follow the prompts.
Note: If you need to change your password or credentials, run
jkp connect --resetand thenjkp connect
-
-
Run the script
-
We run the code via a Slurm scheduler, but we also show how to run it in an interactive Python session.
-
Before running the following commands, make sure you are in
jkp-data/ -
On a cluster with a Slurm scheduler, run:
sbatch slurm/submit_job_som_hpc.slurm
to create the factor returns, stock returns, and firm characteristics.
In an interactive session, run:
jkp build data/
to create the stock returns and firm characteristics, and
jkp portfolio data/
to create the factor returns.
IMPORTANT: When starting the code, you may be prompted to grant access to WRDS using two-factor authentication, for example via a Duo notification. You need to approve this request, as the program will otherwise fail. After a few seconds or minutes, you should see data being created in the output directory. If that is not the case, please check your internet connection or credentials.
-
When the code is finished, you can find the output in the processed/ subdirectory of your output directory (e.g. data/processed/).
Please see the release notes (documentation/release_notes.html) for a description of the output files and a comparison between the output of the SAS/R codebase and the new Python codebase.
-
By default, output files are written in Parquet format. To output CSV files instead (with quoted strings to preserve leading zeros in identifiers like
gvkey), run:jkp portfolio data/ --output-format csv
-
By default, the end date for the data in the code is 2025-12-31, which you can change by editing the
end_dateassignment insrc/jkp/data/config.py. For example, for May 6, 1992, use:END_DATE = date(1992, 5, 6). -
Persistent WRDS Connection: If you're running on an HPC cluster with NAT IP rotation (such as Yale's Bouchet cluster), you may receive many MFA prompts during data download. This happens because each database query creates a new TCP connection, and the NAT gateway assigns a random outbound IP to each connection. WRDS sees these as connections from different locations and triggers MFA for each.
To avoid this, use the
--persistent-connectionflag, which maintains a single database connection throughout the download process:# Interactive session jkp build data/ --persistent-connection # Slurm job (set environment variable) sbatch --export=ALL,PERSISTENT_WRDS_CONNECTION=1 slurm/submit_job_som_hpc.slurm
This reduces MFA prompts from ~26 (one per table) to just 1 (at connection time).
-
To run the code, we utilize a high performance computing cluster, where we request 450 GB RAM and 128 CPU cores. Running the routine takes about 6 hours.
-
To understand the data, please refer to our documentation.
-
We distribute the global factor returns generated from this codebase at jkpfactors.com and the stock returns and firm characteristics at wrds-www.wharton.upenn.edu/pages/get-data/contributed-data-forms/global-factor-data/.
-
The original SAS/R codebase is still available at github.com/bkelly-lab/ReplicationCrisis, but we recommend using this new Python codebase for future work.
Code in this repository is released under the MIT License.
Data distributed in this repository is licensed under Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0).
See LICENSE and DATA_LICENSE for details.