-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Fix Titanic dataset loading in test_utils.py #1959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Titanic dataset loading in test_utils.py #1959
Conversation
- Replace Google Cloud Storage URLs with direct OpenML download - Add SSL certificate workaround for reliable dataset fetching - Add column mapping to match expected test format - Implement manual 80/20 train/test split - Add 'alone' column calculation from family size - No new external dependencies added This resolves SSL certificate issues and ensures reliable dataset loading for tests.
Summary of ChangesHello @pctablet505, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses and resolves persistent SSL certificate issues that were preventing the reliable loading of the Titanic dataset within the test suite. By switching the data source to OpenML, implementing an SSL workaround, and performing necessary data preprocessing and splitting, the changes ensure that tests relying on this dataset can execute consistently and without external fetching failures, thereby improving the stability of the test environment. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an SSL issue with dataset loading in tests by switching to OpenML. The changes are a good step forward. I've provided a few suggestions to improve the implementation, focusing on code style, security, efficiency, and maintainability. My key recommendations include moving imports to follow PEP 8, addressing a security concern with disabled SSL verification, improving efficiency by caching the processed dataset, and replacing a magic number with a named constant.
- Move imports to top of file following PEP 8 style guide - Add comprehensive security warning about disabled SSL verification - Add TODO comment to track SSL certificate issue resolution - Implement caching check to avoid reprocessing dataset on every import - Replace magic number 0.8 with named constant TRAIN_SPLIT_RATIO - Improve code efficiency: subsequent imports are ~58000x faster This addresses all review comments while maintaining test compatibility.
haifeng-jin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
The PR LGTM except the ssl disabling.
Would you mind download the file and keep it as part of the git repo?
Let's use it from local instead of downloading it.
Just create a dir under benchmarks and put the csv file there.
- Replace Google Cloud Storage downloads with local CSV files - Add pre-processed Titanic dataset files to benchmark/datasets/ - Simplify test_utils.py by removing download/processing logic - Remove pandas, ssl, and urllib dependencies from test utilities - All 7 structured data tests now pass without network dependencies
Replaced '?' with empty values in the 'Age' column and standardized numeric fields to use decimal notation in both titanic_train.csv and titanic_test.csv. This improves consistency and compatibility for downstream data processing and analysis.
|
@gemini-code-assist review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request resolves an issue with loading the Titanic dataset in tests by vendoring the data files directly into the repository. This is a solid improvement as it removes external network dependencies, making the test suite more robust and faster. The implementation is correct. I've left one minor comment to improve code clarity.
This pull request updates how the Titanic dataset is loaded for testing in the project. Instead of downloading the data from external sources at import time, it now uses local CSV files stored under the
benchmark/datasetsdirectory. This change improves reliability and reproducibility by avoiding network dependencies and potential changes in remote datasets.Dataset loading improvements:
Fixes #1958