Skip to content

Update to use pandas v2.* #932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
May 20, 2025
Merged

Update to use pandas v2.* #932

merged 34 commits into from
May 20, 2025

Conversation

jpn--
Copy link
Member

@jpn-- jpn-- commented Mar 18, 2025

Addresses #794.

The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:

  • DataFrame Index objects are all one class with different datatypes, instead of being different classes (e.g. there is no more Int64Index class).
  • The read_csv function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python object None.
  • The groupby operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).
  • A simple df.join() also potentially sorts the resulting rows differently unless an explicit sort argument is given.
  • Index objects no longer can be checked as is_monotonic but instead need is_monotonic_increasing.
  • The handling of dtypes appears to have improved in some instances, where dtypes used to be promoted by some operations now they are not (e.g. variables that are originally int16 used to become int64 after some operations and now they don't).

This pull request includes several changes across multiple files to address these pandas changes. The most important changes include modifications to sorting operations, error handling in logging, and the introduction of a new fast_eval function to optimize DataFrame evaluations, because the regular pandas.eval has some significant performance degradations.

Data Handling Improvements:

Error Handling Enhancements:

Evaluation Process Optimization:

  • Introduced fast_eval function in activitysim/core/fast_eval.py to optimize DataFrame evaluations by handling special characters in column names and improving performance.
  • Updated references to df.eval in activitysim/core/interaction_simulate.py and activitysim/core/simulate.py to use fast_eval for better performance and consistency. [1] [2] [3] [4] [5]

Miscellaneous Changes:

jpn-- added 29 commits March 18, 2024 18:11
# Conflicts:
#	conda-environments/activitysim-dev.yml
#	conda-environments/github-actions-tests.yml
# Conflicts:
#	.github/workflows/core_tests.yml
#	activitysim/abm/models/trip_departure_choice.py
#	activitysim/abm/models/vehicle_allocation.py
#	activitysim/examples/prototype_mtc_extended/test/prototype_mtc_extended_reference_pipeline.zip
#	conda-environments/activitysim-dev.yml
#	conda-environments/docbuild.yml
#	conda-environments/github-actions-tests.yml
#	pyproject.toml
# Conflicts:
#	conda-environments/docbuild.yml
@jpn-- jpn-- changed the title Pandas 2 Update to use pandas v2.* Mar 19, 2025
@jpn--
Copy link
Member Author

jpn-- commented Mar 19, 2025

The changes I have made in this new branch have greatly improved runtime performance while using pandas 2.x.

non-sharrow test timings for pandas 1.x:

58.60s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
53.71s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
53.66s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
53.23s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

first attempt non-sharrow test timings for pandas 2.x (#838):

148.50s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
148.14s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
147.83s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode
140.09s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp

revised non-sharrow test timings for pandas 2.x (this PR, #932):

65.06s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_mp
58.10s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_chunkless
58.10s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc
57.38s call     activitysim/examples/prototype_mtc/test/test_mtc.py::test_mtc_recode

We can see that there is still a modest runtime cost to using pandas 2.x, on the order of 10% slower, but nowhere near the cost of the first attempt, which was ~200% slower. Achieving no runtime penalty appears to be possible, but it would require accessing non-public pandas functions which might break in the future, see here

Note all of these runtime issues are exclusively non-sharrow, as sharrow evaluation completely bypasses the pandas.eval function that is the source of our problem.

@jpn-- jpn-- requested a review from i-am-sijia March 19, 2025 14:30
@jpn-- jpn-- mentioned this pull request Mar 19, 2025
@jpn-- jpn-- requested a review from Copilot March 31, 2025 19:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the codebase for compatibility with pandas v2, addressing changes in DataFrame indexing, evaluation methods, and error handling while improving performance with a new fast_eval function.

  • Replaces several instances of DataFrame.eval with a custom fast_eval function to enhance performance and handle special characters in column names.
  • Introduces sorting and index reset adjustments across multiple functions to ensure data consistency.
  • Updates resource handling to use importlib.resources, and improves error logging in various modules.

Reviewed Changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
activitysim/examples/placeholder_sandag/test/test_sandag.py Added new test for local compute with updated configs.
activitysim/estimation/larch/simple_simulate.py Replaced DataFrame.eval with fast_eval for evaluation.
activitysim/estimation/larch/scheduling.py Switched to fast_eval to optimize evaluation.
activitysim/core/workflow/state.py Enhanced error handling when creating pa.Table from DataFrames.
activitysim/core/util.py Adjusted index type checking to align with pandas v2.
activitysim/core/test/_tools.py Improved error reporting with exception details.
activitysim/core/simulate.py Updated DataFrame evaluation to use fast_eval.
activitysim/core/los.py Modified type conversions to prevent numeric overflow.
activitysim/core/interaction_simulate.py Replaced df.eval with fast_eval for consistency and performance.
activitysim/core/fast_eval.py Introduced fast_eval function to optimize DataFrame evaluations.
activitysim/core/assign.py Updated CSV reading with explicit na_values for pandas v2 behavior.
activitysim/cli/create.py Modernized resource handling using importlib.resources.
activitysim/abm/models/vehicle_allocation.py Enforced correct dtype conversion for vehicle choices.
activitysim/abm/models/util/school_escort_tours_trips.py Added reset_index(drop=True) to ensure consistent indexing.
activitysim/abm/models/trip_departure_choice.py Updated monotonic index check to is_monotonic_increasing.
activitysim/abm/models/school_escorting.py Reset index on escort_bundles to maintain data integrity.
activitysim/abm/models/input_checker.py Enhanced error logging with exception details in validators.
activitysim/abm/models/disaggregate_accessibility.py Added sorting after joins to ensure template consistency.
.github/workflows/core_tests.yml Updated CI branch references to reflect pandas v2 changes.
Files not reviewed (1)
  • activitysim/examples/prototype_mtc_extended/configs/trip_mode_choice_annotate_trips_preprocessor.csv: Language not supported
Comments suppressed due to low confidence (1)

activitysim/cli/create.py:183

  • Using a context-managed path from importlib.resources.as_file may lead to unexpected behavior when used with glob.glob. Ensure that the returned path is valid for directory globbing in all environments.
for asset_path in glob.glob(str(pth)):

@JoeJimFlood
Copy link
Contributor

Have any runtime comparisons been done with the full sandag-abm3-example or anything larger than the 25-zone prototype_mtc example? I'm concerned about nonlinearity in the relationship between the size of a model and the runtime.

@i-am-sijia
Copy link
Contributor

Have any runtime comparisons been done with the full sandag-abm3-example or anything larger than the 25-zone prototype_mtc example? I'm concerned about nonlinearity in the relationship between the size of a model and the runtime.

Hi @JoeJimFlood, that is a valid concern. As I am reviewing this PR, I can perform the run time test with the full size example SANDAG.

@i-am-sijia
Copy link
Contributor

I ran the full size sandag-abm3-example with this PR (e.g., pandas 2.x) and the main branch (e.g., pandas 1.4) and would like to share some quick initial reports. I do have some other comments which I will post separately.

For both runs, I used:

  • Sharrow: False
  • multiprocess: True
  • num_processes: 5
  • explicit_chunk: 0.2 (for select components)

The run time is almost the same for the two runs, see below. Pandas 2.x runs faster for some components but not the others, e.g., it's faster in non-mandatory tour scheduling but slower in mandatory tour scheduling, which could be just runtime noise. In total, pandas 2.x took ~5 mins longer which is probably negligible. The total run time is comparable to the run time I reported during Phase 9A: ActivitySim/sandag-abm3-example#9 (comment). This PR does shorten the run time for pandas 2.0 as it promised.

I have not checked if the results of the two run are the same, I will check that.

image

@jpn--
Copy link
Member Author

jpn-- commented Apr 14, 2025

I see long sequences where one version is like ~10% faster, or ~10% slower, in sequential contiguous blocks across fairly disparate component types. This strongly suggests much of the runtime differences are external noise from other subprocesses or other issues (e.g. the server got too hot and throttled the compute for a couple minutes).

@JoeJimFlood
Copy link
Contributor

Thanks for running that @i-am-sijia! The 1.5% increase in using Pandas v2 vs v1 is encouraging to see.

Copy link
Contributor

@i-am-sijia i-am-sijia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the implications on dependence lock and expression rules. With this PR, ActivitySim will use its own fast_eval() and some rewrite version of internal pandas methods until pandas releases an official version (say pandas 3.0) that fixes our problem at hand. Is the plan for us to be locked with pandas 2.2 and fast_eval.py until then? Otherwise we are adding an overhead to maintain the compatibility of fast_eval() when we'd use pandas >2.2. In terms of expression rules, I saw the comment related to pd.Series in fast_eval.py, was wondering if we should proactively alert users about that.

Copy link
Contributor

@i-am-sijia i-am-sijia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for responding to my comments. I'll approve this PR.

@jpn-- jpn-- merged commit 146c7ff into ActivitySim:main May 20, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants