GCAM Machine Learning Emulator

This repo contains an implemention of a flexible, machine learning-based Global Change Analysis Model (GCAM) emulator.

Project Home

Project Scope

Develop and train neural-network on data derived from experiments defined in Woodard (2023) (commonly references as "exp 1 jr.").
Compare and contrast the performance of the emulator and the 'core' GCAM model performance.
Compare and contract the sensitivity of the emulator and the 'core' GCAM model performance.
Experiment with the sensitivity of emulator performance as a function of the amount of training data.

People

Researchers

Brian Hutchinson email
Abigail Snyder email
Claudia Tebaldi email

Students

Andrew Holmes
Hidemi Mitani Shen
Matt Jensen
Sarah Coffland
Logan Sizemore
Seth Bassetti
Brenn Nieva

Water Scarcity Modeling

Jonathan Lamontagne email
Flannery Dolan email
Dawn Woodard email

GCAM Background

Setup

UV Environment

To set up your environment using UV, follow these steps:

First, ensure you have UV installed on your system.

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository to your local machine:

git clone https://github.com/JGCRI/gcam-emulator.git
cd gcam-emulator

Create a UV environment using the pyproject.toml file provided in the repository:

uv sync

[optional]
uv sync --no-cache

Use the newly created environment:

# ideal way
uv run -m ml_gcam --help 

or 

# optional way
source .venv/bin/activate
python3 -m ml_gcam --help

Conda Environment

To set up your environment using Conda, follow these steps:

First, ensure you have Conda installed on your system.
Clone the repository to your local machine:

git clone https://github.com/JGCRI/gcam-emulator.git
cd gcam-emulator

Create a Conda environment using the environment.yml file provided in the repository:

conda env create -f environment.yml
# or with the Makefile:
# make conda_update

Activate the newly created environment:

conda activate ml_climate_gcam22

Virtualenv Environment

To set up your environment using venv, follow these steps:

Clone the repository to your local machine:

git clone https://github.com/JGCRI/gcam-emulator.git
cd gcam-emulator

Create a virtualenv environment using venv and the requirements.txt file provided in the repository:

# install virtualenv
python3 -m venv .venv

# load the python environment
source .venv/bin/activate
pip install -r requirements.txt
python3 -m ml_gcam --help

# or reference it directly
.venv/bin/pip install -r requirements.txt
.venv/bin/python3 -m ml_gcam --help

After activating the environment, you can proceed with the rest of the setup.

Configuration

Configuration variables can be set one of three ways via the package python-configuration and are used to in the config module to modify behaviors of the entire training and running process:

via the system environment

# prefix the key with 'ML_GCAM', then separate each nesting with two underscores:

os.environ['ML_GCAM__MODEL__HIDDEN_SIZE'] = 16
config.reload()
assert config.model.hidden_size == 16

via the .env file in the project directory root

copy .env.example and change default values to match your preferences and system:

cp .env.example .env

Update before running:

RESEARCH_DRIVE
WORKSPACE_DRIVE
REPO_ROOT
WANDB API
TARGETS PATH

via .toml files in ml_gcam/config directory

Once modified, the config can be accessed via the config object in ml_gcam.__init__.py:

# from ml_gcam/training/train.py:
from ml_gcam import config

...

emulator = DNN(
    in_size=len(config.data.input_keys),
    hidden_size=int(config.model.hidden_size),
    depth=int(config.model.depth),
    n_heads=n_heads,
    n_features=len(config.data.output_keys),
)

Wandb

We use weights and biases for logging. To use it, you will need to make an account: https://wandb.ai/site

Then set the appropriate config variables:

# .env:

WANDB_API_KEY="[replace with api key]"
ML_GCAM__WANDB__ENTITY="wandb username/entity"
ML_GCAM__WANDB__TAGS="test,perfect"
ML_GCAM__WANDB__GROUP="dd_mm_name_is_good"

Dataset Generation

Sources

Sampling	Range	Training Scenarios	Validation Scenarios	Test Scenarios	New GCAM Samples?
Binary	x \in {0, 1}	3274	409	409	Yes
Random	x \in [0, 1]	3209	401	400	Yes
Finite Diff	x \in [0, 1]	0	0	3650	Yes

Inputs

Input	Description	Key	Interpolated?
Energy demand	Demand, GDP, and population assumptions	energy	Yes
Fossil Fuel	Costs of oil, natural gas and coal	ff	Yes
Nuclear	Capital overnight cost	nuc	Yes
Solar Storage	Solar storage capital overnight cost	solarS	Yes
Solar Tech	Concentrating solar-thermal power and photovoltaic technology costs	solarT	Yes
Wind Storage	Wind storage capital overnight cost	windS	Yes
Wind Tech	Wind and wind offshore capital overnight cost	windT	Yes
Backups	Systems needed to backup solar and wind	back	Yes
Carbon Capture	Cost of ccs technologies	ccs	Yes
Electrification	Share of buildings, industries and transport using electricity	elec	No
Emissions	𝐶𝑂2 emission costs	emiss	No
Bioenergy	Potential costs to adoption of bioenergy	bio	No

Outputs

Resource	Metric	Sector	Units	Query
energy	demand_elecricity	building	EJ	elec_consumption_by_demand_sector
energy	demand_elecricity	industry	EJ	elec_consumption_by_demand_sector
energy	demand_elecricity	transport	EJ	elec_consumption_by_demand_sector
energy	demand_fuel	building	EJ	final_energy_consumption_by_sector_and_fuel
energy	demand_fuel	building	EJ	final_energy_consumption_by_sector_and_fuel
energy	demand_fuel	industry	EJ	final_energy_consumption_by_sector_and_fuel
energy	demand_fuel	industry	EJ	final_energy_consumption_by_sector_and_fuel
energy	demand_fuel	transport	EJ	final_energy_consumption_by_sector_and_fuel
energy	price	coal	1975$/GJ	final_energy_prices
energy	price	electricity	1975$/GJ	final_energy_prices
energy	price	transport	1975$/GJ	final_energy_prices
energy	price	transport	1975$/GJ	final_energy_prices
energy	supply_electricity	biomass	EJ	elec_gen_by_subsector
energy	supply_electricity	coal	EJ	elec_gen_by_subsector
energy	supply_electricity	gas	EJ	elec_gen_by_subsector
energy	supply_electricity	nuclear	EJ	elec_gen_by_subsector
energy	supply_electricity	oil	EJ	elec_gen_by_subsector
energy	supply_electricity	other	EJ	elec_gen_by_subsector
energy	supply_electricity	solar	EJ	elec_gen_by_subsector
energy	supply_electricity	wind	EJ	elec_gen_by_subsector
energy	supply_primary	biomass	EJ	primary_energy_consumption_by_region
energy	supply_primary	coal	EJ	primary_energy_consumption_by_region
energy	supply_primary	gas	EJ	primary_energy_consumption_by_region
energy	supply_primary	nuclear	EJ	primary_energy_consumption_by_region
energy	supply_primary	oil	EJ	primary_energy_consumption_by_region
energy	supply_primary	other	EJ	primary_energy_consumption_by_region
energy	supply_primary	solar	EJ	primary_energy_consumption_by_region
energy	supply_primary	wind	EJ	primary_energy_consumption_by_region
land	allocation	biomass	thousand km2	aggregated_land_allocation
land	allocation	forest	thousand km2	aggregated_land_allocation
land	allocation	grass	thousand km2	aggregated_land_allocation
land	allocation	other	thousand km2	aggregated_land_allocation
land	allocation	pasture	thousand km2	aggregated_land_allocation
land	demand	feed	Mt	demand_balances_by_crop_commodity
land	demand	food	Mt	demand_balances_by_crop_commodity
land	price	biomass	1975$/GJ	prices_by_sector
land	price	forest	1975$/m3	prices_by_sector
land	production	biomass	EJ	ag_production_by_crop_type
land	production	forest	billion m3	ag_production_by_crop_type
land	production	grass	Mt	ag_production_by_crop_type
land	production	other	Mt	ag_production_by_crop_type
land	production	pasture	Mt	ag_production_by_crop_type
water	demand	crops	km3	water_withdrawls_by_tech
water	demand	electricity	km3	water_withdrawls_by_tech

To generate the raw data used in the supervised learning task, runs of GCAM core with the specific set of configurations outlined in Woodward (2023) are required. In this study, GCAM core was run 2^12 times, one for each permutation of the 12 input variables. Reference configuration files are provided in the ml_gcam/core/configs/*.xml files. This process will generate a set of basex files, one for each configuration.

Interpolation

To complete the interpolation study, these configuration files need to be modified. Scripts for doing so are available:

> python -m ml_gcam --help | grep interpolate
core-interpolate:make-configs  create configs from set of interpolated...
core-interpolate:make-inputs   create paths containing interpolated inputs
core-interpolate:sample        creates a metadata.csv with [--sample]...

Extraction

GCAM core outputs, in the form of basex files, can be extracted to .csv using either the ModelInterface tool, or the python gcamreader package. Both tools rely on query .xml files to define the data to extract from the basex files. Reference query files are available in the ml_gcam/core/queries/*.xml files. A typical run to extract a query using gcamreader looks something like this:

> python3 -m gcamreader local --help
Usage: python -m gcamreader local [OPTIONS]

  query gcam scenario databases

Options:
  -d, --database_path DIRECTORY  path to database file (i.e. parent of *.basex
                                 dir)  [required]
  -q, --query_path FILE          path to xml with queries to run (i.e:
                                 Main_queries.xml)  [required]
  -o, --output_path DIRECTORY    path to output (i.e. where .csv files should
                                 be created)
  -f, --force BOOLEAN            overwrite existing .csv in output path
  --help                         Show this message and exit.

> python3 -m gcamreader local \
 --query_path ml_gcam/core/queries/Main_queries.xml \
 --database_path /path/to/.basex/parent \
 --outpub_path data/query_outputs

...

Once extracted, individual queries are aggregated to generate the targets for the emulator. A scripts for going so are available:

> python -m ml_gcam data:create-extracts --help
Usage: python -m ml_gcam data:create-extracts [OPTIONS]

  aggregated raw extract csv from all experiments via sql templates

Options:
  -e, --experiment [binary|interp_hypercube|interp_dgsm]
  -q, --queries [agriculture_prices|electricity_supply|emissions_capture|energy_demand_share_electricity|energy_demand_share_primary|energy_prices|energy_supply_share_electricity|energy_supply_share_primary|land_demand|land_prices|land_supply_allocation|land_supply_production|water_demand|water_consumption]
  -g, --gcamreader-outputs DIRECTORY
                                  directory with the .csv outputs from
                                  gcamreader  [required]
  --save_path DIRECTORY           directory to save {experiment}.csv outputs
                                  (/path/to/data/targets/)
  -f, --forc--pretend / --no-pretend
  --help                          Show this message and exit.

To turn these extracts into an aggregated version for training, run the following command:

> python3 -m ml_gcam data:create-targets --help
Usage: python -m ml_gcam data:create-targets [OPTIONS]

  create meta.scenarios table

Options:
  -e, --experiment [binary|interp_hypercube|interp_dgsm]
  --targets_path DIRECTORY        /path/to/targets/
  --scenarios_path FILE           path to scenarios.csv
  -s, --save_path FILE            /path/to/targets/targets.parquet (output
                                  partitioned by experiment and split)
  -f, --force
  --help                          Show this message and exit.

This will create a parquet file named targets.parquet that is used by all downstream tasks for training and validation.

Training

To train the emulator Our project uses a command-line interface (CLI) to manage various training loops for machine learning models using PyTorch and Weights & Biases (wandb). Below, you'll find detailed instructions on how to use each command in the training module.

General Usage

All training loop commands are executed as a module argument to Python and require specific options depending on the task:

python -m ml_gcam <command> [OPTIONS]

Commands Overview

training:run: Main training loop.
training:sample-size: Training loop for experimenting with different sample sizes.
training:cartesian: Training loop for experimenting with combination of different sampling strategies.
training:sweep-init: Initial setup for hyperparameter sweeps.
training:sweep-run: Execution of hyperparameter sweeps.

Example Command

python -m ml_gcam --no-wandb training:run \
  --targets_path /path/to/targets.parquet \
  --train_source binary \
  --dev_source binary \
  --checkpoint_path /path/to/checkpoint

Command Help

> python -m ml_gcam training:run --help
Usage: python -m ml_gcam training:run [OPTIONS]

  main training loop

Options:
  -n, --normalization-strategy [z_score|min_max|robust]
                                  how to handle normalization of target before
                                  training  [required]
  -c, --checkpoint_path FILE      /path/to/model/checkpoint/name/
  -d, --dev_source [mixed|binary|interp_hypercube]
                                  name of experiment(s) to use as dev_source
                                  [default: interp_sobol; required]
  -t, --train_source [mixed|interp_hypercube|binary]
                                  name of experiment(s) to use as train_source
                                  [default: binary; required]
  --targets_path FILE             /path/to/targets.parquet  [default: /path/to
                                  /targets.parquet; required]
  --help                          Show this message and exit.

> python -m ml_gcam training:sample-size --help
Usage: python -m ml_gcam training:sample-size [OPTIONS]

  training loop for sample size experiments

Options:
  --targets_path FILE             /path/to/targets.parquet  [default: /path/to
                                  /targets.parquet; required]
  -t, --train_source [mixed|interp_hypercube|binary]
                                  name of experiment(s) to use as train_source
                                  [default: binary; required]
  -d, --dev_source [mixed|interp_hypercube|binary]
                                  name of experiment(s) to use as dev_source
                                  [default: interp_sobol; required]
  -c, --checkpoint_path DIRECTORY
                                  /path/to/model/checkpoint/name/
  -s, --splits FLOAT              samples to use for training, in percent of
                                  train_set  [default: 0.01, 0.02, 0.05,
                                  0.075, 0.1, 0.15, 0.2, 0.3, 0.5, 0.6, 0.65,
                                  0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0;
                                  required]
  --help                          Show this message and exit.

> python -m ml_gcam training:cartesian --help
Usage: python -m ml_gcam training:cartesian [OPTIONS]

  configure a wandb hyperparameter sweep and make a sweep id

Options:
  --targets_path FILE             /path/to/targets.parquet  [default: /path/to
                                  /targets.parquet; required]
  -t, --train_source [mixed|interp_hypercube|binary]
                                  name of experiment(s) to use as train_source
                                  [default: mixed, interp_hypercube,
                                  binary; required]
  -d, --dev_source [mixed|interp_hypercube|binary]
                                  name of experiment(s) to use as dev_source
                                  [default: mixed, interp_hypercube,
                                  binary; required]
  -c, --checkpoint_path DIRECTORY
                                  /path/to/model/checkpoint/name/
  --help                          Show this message and exit.

Evaluation

We picked r2 scores as the more important evaluation metric for these experiment. During training, some aggregate statistics are logged to the console and to wandb if enabled. To generate these metrics from a saved model checkpoint, use the following commands:

Usage: python -m ml_gcam evaluate:sample-size [OPTIONS]

  load and create r2 scores for training size sweep

Options:
  -c, --checkpoint_path DIRECTORY
                                  /path/to/model/checkpoints/for-sample-size-sweep/
                                  [required]
  -t, --train_source [mixed|binary|interp_hypercube]
                                  [required]
  -o, --save_path FILE            path to save score.csv  [required]
  -f, --force                     forces the --save_path to overwrite existing
                                  data
  --targets_path FILE             /path/to/targets.parquet  [default: /path/to
                                  /targets.parquet; required]
  --help                          Show this message and exit.

> python -m ml_gcam evaluate:cartesian --help
Usage: python -m ml_gcam evaluate:cartesian [OPTIONS]

  load and create r2 scores for cartesian sweep

Options:
  -c, --checkpoint_path DIRECTORY
                                  /path/to/model/checkpoints/for-cartesian-sweep/
                                  [required]
  -o, --save_path FILE            path to save score.csv  [required]
  -f, --force                     forces the --save_path to overwrite existing
                                  data
  --targets_path FILE             /path/to/targets.parquet  [default:
                                  /path/to/targets.parquet; required]
  --help                          Show this message and exit.``

Plots and Tables

> python -m ml_gcam table:cartesian --help
Usage: python -m ml_gcam table:cartesian [OPTIONS]

  display the cartesian product of training x dev sets

Options:
  -f, --force                     forces the --save_path to overwrite existing
                                  data
  -o, --save_path FILE            path to save figure.png  [required]
  -t, --table_format              [latex|markdown]
  -s, --score_path FILE           /path/to/cartesian/scores.csv  [required]
  --help                          Show this message and exit.

> python -m ml_gcam plot:sample-size --help
Usage: python -m ml_gcam plot:sample-size [OPTIONS]

  plot r2 vs. training size from previously generated data

Options:
  -f, --force                     forces the --save_path to overwrite existing
                                  data
  -o, --save_path FILE            path to save figure.png  [required]
  -a, --aggregation [median|mean|above_0_9|above_0_95|overall]
  -s, --score_path FILE           /path/to//sample-size/scores.csv  [required]
  --help                          Show this message and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
dist		dist
docs		docs
logs		logs
ml_gcam		ml_gcam
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GCAM Machine Learning Emulator

Project Scope

People

GCAM Background

Setup

UV Environment

Conda Environment

Virtualenv Environment

Configuration

Wandb

Dataset Generation

Sources

Inputs

Outputs

Interpolation

Extraction

Training

General Usage

Example Command

Command Help

Evaluation

Plots and Tables

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

hutchresearch/GCAMnet

Folders and files

Latest commit

History

Repository files navigation

GCAM Machine Learning Emulator

Project Scope

People

GCAM Background

Setup

UV Environment

Conda Environment

Virtualenv Environment

Configuration

Wandb

Dataset Generation

Sources

Inputs

Outputs

Interpolation

Extraction

Training

General Usage

Example Command

Command Help

Evaluation

Plots and Tables

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages