-
Notifications
You must be signed in to change notification settings - Fork 109
Update to use pandas v2.* #932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # conda-environments/activitysim-dev.yml # conda-environments/github-actions-tests.yml
# Conflicts: # .github/workflows/core_tests.yml # activitysim/abm/models/trip_departure_choice.py # activitysim/abm/models/vehicle_allocation.py # activitysim/examples/prototype_mtc_extended/test/prototype_mtc_extended_reference_pipeline.zip # conda-environments/activitysim-dev.yml # conda-environments/docbuild.yml # conda-environments/github-actions-tests.yml # pyproject.toml
# Conflicts: # conda-environments/docbuild.yml
The changes I have made in this new branch have greatly improved runtime performance while using pandas 2.x. non-sharrow test timings for pandas 1.x:
first attempt non-sharrow test timings for pandas 2.x (#838):
revised non-sharrow test timings for pandas 2.x (this PR, #932):
We can see that there is still a modest runtime cost to using pandas 2.x, on the order of 10% slower, but nowhere near the cost of the first attempt, which was ~200% slower. Achieving no runtime penalty appears to be possible, but it would require accessing non-public pandas functions which might break in the future, see here Note all of these runtime issues are exclusively non-sharrow, as sharrow evaluation completely bypasses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the codebase for compatibility with pandas v2, addressing changes in DataFrame indexing, evaluation methods, and error handling while improving performance with a new fast_eval function.
- Replaces several instances of DataFrame.eval with a custom fast_eval function to enhance performance and handle special characters in column names.
- Introduces sorting and index reset adjustments across multiple functions to ensure data consistency.
- Updates resource handling to use importlib.resources, and improves error logging in various modules.
Reviewed Changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
activitysim/examples/placeholder_sandag/test/test_sandag.py | Added new test for local compute with updated configs. |
activitysim/estimation/larch/simple_simulate.py | Replaced DataFrame.eval with fast_eval for evaluation. |
activitysim/estimation/larch/scheduling.py | Switched to fast_eval to optimize evaluation. |
activitysim/core/workflow/state.py | Enhanced error handling when creating pa.Table from DataFrames. |
activitysim/core/util.py | Adjusted index type checking to align with pandas v2. |
activitysim/core/test/_tools.py | Improved error reporting with exception details. |
activitysim/core/simulate.py | Updated DataFrame evaluation to use fast_eval. |
activitysim/core/los.py | Modified type conversions to prevent numeric overflow. |
activitysim/core/interaction_simulate.py | Replaced df.eval with fast_eval for consistency and performance. |
activitysim/core/fast_eval.py | Introduced fast_eval function to optimize DataFrame evaluations. |
activitysim/core/assign.py | Updated CSV reading with explicit na_values for pandas v2 behavior. |
activitysim/cli/create.py | Modernized resource handling using importlib.resources. |
activitysim/abm/models/vehicle_allocation.py | Enforced correct dtype conversion for vehicle choices. |
activitysim/abm/models/util/school_escort_tours_trips.py | Added reset_index(drop=True) to ensure consistent indexing. |
activitysim/abm/models/trip_departure_choice.py | Updated monotonic index check to is_monotonic_increasing. |
activitysim/abm/models/school_escorting.py | Reset index on escort_bundles to maintain data integrity. |
activitysim/abm/models/input_checker.py | Enhanced error logging with exception details in validators. |
activitysim/abm/models/disaggregate_accessibility.py | Added sorting after joins to ensure template consistency. |
.github/workflows/core_tests.yml | Updated CI branch references to reflect pandas v2 changes. |
Files not reviewed (1)
- activitysim/examples/prototype_mtc_extended/configs/trip_mode_choice_annotate_trips_preprocessor.csv: Language not supported
Comments suppressed due to low confidence (1)
activitysim/cli/create.py:183
- Using a context-managed path from importlib.resources.as_file may lead to unexpected behavior when used with glob.glob. Ensure that the returned path is valid for directory globbing in all environments.
for asset_path in glob.glob(str(pth)):
Have any runtime comparisons been done with the full sandag-abm3-example or anything larger than the 25-zone prototype_mtc example? I'm concerned about nonlinearity in the relationship between the size of a model and the runtime. |
Hi @JoeJimFlood, that is a valid concern. As I am reviewing this PR, I can perform the run time test with the full size example SANDAG. |
I ran the full size sandag-abm3-example with this PR (e.g., pandas 2.x) and the main branch (e.g., pandas 1.4) and would like to share some quick initial reports. I do have some other comments which I will post separately. For both runs, I used:
The run time is almost the same for the two runs, see below. Pandas 2.x runs faster for some components but not the others, e.g., it's faster in non-mandatory tour scheduling but slower in mandatory tour scheduling, which could be just runtime noise. In total, pandas 2.x took ~5 mins longer which is probably negligible. The total run time is comparable to the run time I reported during Phase 9A: ActivitySim/sandag-abm3-example#9 (comment). This PR does shorten the run time for pandas 2.0 as it promised. I have not checked if the results of the two run are the same, I will check that. |
I see long sequences where one version is like ~10% faster, or ~10% slower, in sequential contiguous blocks across fairly disparate component types. This strongly suggests much of the runtime differences are external noise from other subprocesses or other issues (e.g. the server got too hot and throttled the compute for a couple minutes). |
Thanks for running that @i-am-sijia! The 1.5% increase in using Pandas v2 vs v1 is encouraging to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the implications on dependence lock and expression rules. With this PR, ActivitySim will use its own fast_eval()
and some rewrite version of internal pandas methods until pandas releases an official version (say pandas 3.0) that fixes our problem at hand. Is the plan for us to be locked with pandas 2.2 and fast_eval.py until then? Otherwise we are adding an overhead to maintain the compatibility of fast_eval() when we'd use pandas >2.2. In terms of expression rules, I saw the comment related to pd.Series in fast_eval.py, was wondering if we should proactively alert users about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for responding to my comments. I'll approve this PR.
Addresses #794.
The update from pandas 1.x to 2.x introduces a number of small but material changes that affect ActivitySim:
Index
objects are all one class with different datatypes, instead of being different classes (e.g. there is no moreInt64Index
class).read_csv
function by default now interprets "None" as a missing value (i.e. NaN) instead of being the Python objectNone
.groupby
operation, when applied to categorical data, now sorts the categories in the result unless told not to (resulting in different order of rows in outputs for some operations).df.join()
also potentially sorts the resulting rows differently unless an explicitsort
argument is given.Index
objects no longer can be checked asis_monotonic
but instead needis_monotonic_increasing
.This pull request includes several changes across multiple files to address these pandas changes. The most important changes include modifications to sorting operations, error handling in logging, and the introduction of a new
fast_eval
function to optimize DataFrame evaluations, because the regularpandas.eval
has some significant performance degradations.Data Handling Improvements:
activitysim/abm/models/disaggregate_accessibility.py
: Added sorting operations to ensure data consistency inexpand_template_zones
,create_proto_pop
, andmerge_persons
methods. [1] [2] [3]activitysim/abm/models/school_escorting.py
: Reset index forescort_bundles
to maintain data integrity.activitysim/abm/models/util/school_escort_tours_trips.py
: Reset index forcreate_chauf_escort_trips
andcreate_escortee_trips
methods to ensure proper data handling. [1] [2]Error Handling Enhancements:
activitysim/abm/models/input_checker.py
: Added try-except blocks to improve error logging for dataframe and element-wise validators. [1] [2]activitysim/core/test/_tools.py
: Enhanced error reporting inprogressive_checkpoint_test
by including exception details.Evaluation Process Optimization:
fast_eval
function inactivitysim/core/fast_eval.py
to optimize DataFrame evaluations by handling special characters in column names and improving performance.df.eval
inactivitysim/core/interaction_simulate.py
andactivitysim/core/simulate.py
to usefast_eval
for better performance and consistency. [1] [2] [3] [4] [5]Miscellaneous Changes:
activitysim/cli/create.py
: Replacedpkg_resources
withimportlib.resources
for resource handling to modernize the codebase. [1] [2] [3]activitysim/core/los.py
: Improved handling of data types to prevent overflow inget_mazpairs
method.