Update to use pandas v2.* #932

jpn-- · 2025-03-18T18:23:50Z

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

DataFrame Index objects are all one class with different datatypes, instead of being different classes (e.g. there is no more Int64Index class).
The read_csv function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python object None.
The groupby operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).
A simple df.join() also potentially sorts the resulting rows differently unless an explicit sort argument is given.
Index objects no longer can be checked as is_monotonic but instead need is_monotonic_increasing.
The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).

This pull request includes several changes across multiple files to address these pandas changes. The most important changes include modifications to sorting operations, error handling in logging, and the introduction of a new fast_eval function to optimize DataFrame evaluations, because the regular pandas.eval has some significant performance degradations.

Data Handling Improvements:

activitysim/abm/models/disaggregate_accessibility.py: Added sorting operations to ensure data consistency in expand_template_zones, create_proto_pop, and merge_persons methods. [1] [2] [3]
activitysim/abm/models/school_escorting.py: Reset index for escort_bundles to maintain data integrity.
activitysim/abm/models/util/school_escort_tours_trips.py: Reset index for create_chauf_escort_trips and create_escortee_trips methods to ensure proper data handling. [1] [2]

Error Handling Enhancements:

activitysim/abm/models/input_checker.py: Added try-except blocks to improve error logging for dataframe and element-wise validators. [1] [2]
activitysim/core/test/_tools.py: Enhanced error reporting in progressive_checkpoint_test by including exception details.

Evaluation Process Optimization:

Introduced fast_eval function in activitysim/core/fast_eval.py to optimize DataFrame evaluations by handling special characters in column names and improving performance.
Updated references to df.eval in activitysim/core/interaction_simulate.py and activitysim/core/simulate.py to use fast_eval for better performance and consistency. [1] [2] [3] [4] [5]

Miscellaneous Changes:

activitysim/cli/create.py: Replaced pkg_resources with importlib.resources for resource handling to modernize the codebase. [1] [2] [3]
activitysim/core/los.py: Improved handling of data types to prevent overflow in get_mazpairs method.

# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml

# Conflicts: # .github/workflows/core_tests.yml # activitysim/abm/models/trip_departure_choice.py # activitysim/abm/models/vehicle_allocation.py # activitysim/examples/prototype_mtc_extended/test/prototype_mtc_extended_reference_pipeline.zip # conda-environments/activitysim-dev.yml # conda-environments/docbuild.yml # conda-environments/github-actions-tests.yml # pyproject.toml

# Conflicts: # conda-environments/docbuild.yml

jpn-- · 2025-03-19T14:30:05Z

The changes I have made in this new branch have greatly improved runtime performance while using pandas 2.x.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

first attempt non-sharrow test timings for pandas 2.x (#838):

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

revised non-sharrow test timings for pandas 2.x (this PR, #932):

65.06s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
58.10s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
58.10s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
57.38s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

We can see that there is still a modest runtime cost to using pandas 2.x, on the order of 10% slower, but nowhere near the cost of the first attempt, which was ~200% slower. Achieving no runtime penalty appears to be possible, but it would require accessing non-public pandas functions which might break in the future, see here

Note all of these runtime issues are exclusively non-sharrow, as sharrow evaluation completely bypasses the pandas.eval function that is the source of our problem.

Copilot

Pull Request Overview

This PR updates the codebase for compatibility with pandas v2, addressing changes in DataFrame indexing, evaluation methods, and error handling while improving performance with a new fast_eval function.

Replaces several instances of DataFrame.eval with a custom fast_eval function to enhance performance and handle special characters in column names.
Introduces sorting and index reset adjustments across multiple functions to ensure data consistency.
Updates resource handling to use importlib.resources, and improves error logging in various modules.

Reviewed Changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
activitysim/examples/placeholder_sandag/test/test_sandag.py	Added new test for local compute with updated configs.
activitysim/estimation/larch/simple_simulate.py	Replaced DataFrame.eval with fast_eval for evaluation.
activitysim/estimation/larch/scheduling.py	Switched to fast_eval to optimize evaluation.
activitysim/core/workflow/state.py	Enhanced error handling when creating pa.Table from DataFrames.
activitysim/core/util.py	Adjusted index type checking to align with pandas v2.
activitysim/core/test/_tools.py	Improved error reporting with exception details.
activitysim/core/simulate.py	Updated DataFrame evaluation to use fast_eval.
activitysim/core/los.py	Modified type conversions to prevent numeric overflow.
activitysim/core/interaction_simulate.py	Replaced df.eval with fast_eval for consistency and performance.
activitysim/core/fast_eval.py	Introduced fast_eval function to optimize DataFrame evaluations.
activitysim/core/assign.py	Updated CSV reading with explicit na_values for pandas v2 behavior.
activitysim/cli/create.py	Modernized resource handling using importlib.resources.
activitysim/abm/models/vehicle_allocation.py	Enforced correct dtype conversion for vehicle choices.
activitysim/abm/models/util/school_escort_tours_trips.py	Added reset_index(drop=True) to ensure consistent indexing.
activitysim/abm/models/trip_departure_choice.py	Updated monotonic index check to is_monotonic_increasing.
activitysim/abm/models/school_escorting.py	Reset index on escort_bundles to maintain data integrity.
activitysim/abm/models/input_checker.py	Enhanced error logging with exception details in validators.
activitysim/abm/models/disaggregate_accessibility.py	Added sorting after joins to ensure template consistency.
.github/workflows/core_tests.yml	Updated CI branch references to reflect pandas v2 changes.

Files not reviewed (1)

activitysim/examples/prototype_mtc_extended/configs/trip_mode_choice_annotate_trips_preprocessor.csv: Language not supported

Comments suppressed due to low confidence (1)

activitysim/cli/create.py:183

Using a context-managed path from importlib.resources.as_file may lead to unexpected behavior when used with glob.glob. Ensure that the returned path is valid for directory globbing in all environments.

for asset_path in glob.glob(str(pth)):

activitysim/core/fast_eval.py

JoeJimFlood · 2025-04-08T16:53:22Z

Have any runtime comparisons been done with the full sandag-abm3-example or anything larger than the 25-zone prototype_mtc example? I'm concerned about nonlinearity in the relationship between the size of a model and the runtime.

i-am-sijia · 2025-04-10T17:44:36Z

Have any runtime comparisons been done with the full sandag-abm3-example or anything larger than the 25-zone prototype_mtc example? I'm concerned about nonlinearity in the relationship between the size of a model and the runtime.

Hi @JoeJimFlood, that is a valid concern. As I am reviewing this PR, I can perform the run time test with the full size example SANDAG.

i-am-sijia · 2025-04-14T20:04:05Z

I ran the full size sandag-abm3-example with this PR (e.g., pandas 2.x) and the main branch (e.g., pandas 1.4) and would like to share some quick initial reports. I do have some other comments which I will post separately.

For both runs, I used:

Sharrow: False
multiprocess: True
num_processes: 5
explicit_chunk: 0.2 (for select components)

The run time is almost the same for the two runs, see below. Pandas 2.x runs faster for some components but not the others, e.g., it's faster in non-mandatory tour scheduling but slower in mandatory tour scheduling, which could be just runtime noise. In total, pandas 2.x took ~5 mins longer which is probably negligible. The total run time is comparable to the run time I reported during Phase 9A: ActivitySim/sandag-abm3-example#9 (comment). This PR does shorten the run time for pandas 2.0 as it promised.

I have not checked if the results of the two run are the same, I will check that.

jpn-- · 2025-04-14T21:10:40Z

I see long sequences where one version is like ~10% faster, or ~10% slower, in sequential contiguous blocks across fairly disparate component types. This strongly suggests much of the runtime differences are external noise from other subprocesses or other issues (e.g. the server got too hot and throttled the compute for a couple minutes).

JoeJimFlood · 2025-04-15T19:11:52Z

Thanks for running that @i-am-sijia! The 1.5% increase in using Pandas v2 vs v1 is encouraging to see.

i-am-sijia

I'm curious about the implications on dependence lock and expression rules. With this PR, ActivitySim will use its own fast_eval() and some rewrite version of internal pandas methods until pandas releases an official version (say pandas 3.0) that fixes our problem at hand. Is the plan for us to be locked with pandas 2.2 and fast_eval.py until then? Otherwise we are adding an overhead to maintain the compatibility of fast_eval() when we'd use pandas >2.2. In terms of expression rules, I saw the comment related to pd.Series in fast_eval.py, was wondering if we should proactively alert users about that.

activitysim/core/interaction_simulate.py

conda-environments/activitysim-dev-base.yml

.github/workflows/core_tests.yml

activitysim/core/fast_eval.py

restore accidentally removed larch

i-am-sijia

Thank you for responding to my comments. I'll approve this PR.

jpn-- added 29 commits March 18, 2024 18:11

updates for pandas 2.2

7b850ca

pytables 3.9

002604d

input checker message failbacks

8819b8c

fix veh type categoricals

be5c024

restore original pandas read_csv NaNs

98bc2e4

is_monotonic_increasing

5beffda

fix disagg acc sorting

9b67fec

drop unused indexes

234a420

update pipeline ref

58003ed

temporarily disable sharrow in vehicle alloc

012e92e

fix dtype problem

c6975a4

ensure MAX index does not overflow

2a899e5

sort on join to preserve index ordering from old pandas

a752ea4

local compute test simplifies debugging

543b19a

Merge branch 'main' into depend-pandas-2

8ed8fb9

more robust conversion to pyarrow

50c9f6d

Merge branch 'main' into depend-pandas-2

a393dbd

Merge branch 'main' into depend-pandas-2

c06d737

# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml

rewrite df.eval to fast_eval

4f89ef6

change xarray pin

59872fc

fix zarr pin

5191684

update numpy and dask pins

8091bd5

wrap raw fast_eval in pd.Series

cf9fb21

don't skip sharrow in veh alloc

7becbca

rebuild ref pipeline

8be8e0d

Merge commit 'c59dc4cdf66e3f53816b00ca28fdbc2ca4fd0c8a' into pandas-2

804e780

# Conflicts: # conda-environments/docbuild.yml

make fast_eval more robust

b019c4b

revise external targets

501e249

jpn-- changed the title ~~Pandas 2~~ Update to use pandas v2.* Mar 19, 2025

jpn-- requested a review from i-am-sijia March 19, 2025 14:30

jpn-- mentioned this pull request Mar 19, 2025

Update to use pandas 2.x #838

Closed

prefer public API

a0b3c27

jpn-- requested a review from Copilot March 31, 2025 19:51

Copilot AI reviewed Mar 31, 2025

View reviewed changes

activitysim/core/fast_eval.py Show resolved Hide resolved

i-am-sijia requested changes Apr 21, 2025

View reviewed changes

activitysim/core/interaction_simulate.py Show resolved Hide resolved

conda-environments/activitysim-dev-base.yml Outdated Show resolved Hide resolved

.github/workflows/core_tests.yml Show resolved Hide resolved

activitysim/core/fast_eval.py Show resolved Hide resolved

jpn-- added 4 commits April 24, 2025 13:02

Merge branch 'main' into pandas-2

560db6b

Update activitysim-dev-base.yml

61a97b1

restore accidentally removed larch

add note about why fast_eval exists and how to undo it

4b4906e

Merge branch 'main' into pandas-2

d780be6

i-am-sijia approved these changes May 20, 2025

View reviewed changes

jpn-- merged commit 146c7ff into ActivitySim:main May 20, 2025
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update to use pandas v2.* #932

Update to use pandas v2.* #932

Uh oh!

jpn-- commented Mar 18, 2025 •

edited

Loading

Uh oh!

jpn-- commented Mar 19, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

JoeJimFlood commented Apr 8, 2025

Uh oh!

i-am-sijia commented Apr 10, 2025

Uh oh!

i-am-sijia commented Apr 14, 2025

Uh oh!

jpn-- commented Apr 14, 2025

Uh oh!

JoeJimFlood commented Apr 15, 2025

Uh oh!

i-am-sijia left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

i-am-sijia left a comment

Uh oh!

Uh oh!

Uh oh!

Update to use pandas v2.* #932

Update to use pandas v2.* #932

Uh oh!

Conversation

jpn-- commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data Handling Improvements:

Error Handling Enhancements:

Evaluation Process Optimization:

Miscellaneous Changes:

Uh oh!

jpn-- commented Mar 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

JoeJimFlood commented Apr 8, 2025

Uh oh!

i-am-sijia commented Apr 10, 2025

Uh oh!

i-am-sijia commented Apr 14, 2025

Uh oh!

jpn-- commented Apr 14, 2025

Uh oh!

JoeJimFlood commented Apr 15, 2025

Uh oh!

i-am-sijia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

i-am-sijia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jpn-- commented Mar 18, 2025 •

edited

Loading