Skip to content

Conversation

@pctablet505
Copy link
Contributor

@pctablet505 pctablet505 commented Nov 19, 2025

This pull request updates how the Titanic dataset is loaded for testing in the project. Instead of downloading the data from external sources at import time, it now uses local CSV files stored under the benchmark/datasets directory. This change improves reliability and reproducibility by avoiding network dependencies and potential changes in remote datasets.

Dataset loading improvements:

Fixes #1958

- Replace Google Cloud Storage URLs with direct OpenML download
- Add SSL certificate workaround for reliable dataset fetching
- Add column mapping to match expected test format
- Implement manual 80/20 train/test split
- Add 'alone' column calculation from family size
- No new external dependencies added

This resolves SSL certificate issues and ensures reliable dataset loading for tests.
@gemini-code-assist
Copy link

Summary of Changes

Hello @pctablet505, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves persistent SSL certificate issues that were preventing the reliable loading of the Titanic dataset within the test suite. By switching the data source to OpenML, implementing an SSL workaround, and performing necessary data preprocessing and splitting, the changes ensure that tests relying on this dataset can execute consistently and without external fetching failures, thereby improving the stability of the test environment.

Highlights

  • Dataset Source Change: The Titanic dataset download source has been switched from Google Cloud Storage (GCS) URLs to a direct OpenML download link to resolve previous fetching issues.
  • SSL Certificate Workaround: An SSL certificate workaround using urllib.request and ssl._create_unverified_context() has been implemented to ensure reliable dataset fetching, bypassing potential SSL verification problems.
  • Data Preprocessing and Column Mapping: The downloaded dataset now undergoes preprocessing, including renaming columns (e.g., 'pclass' to 'class', 'sibsp' to 'n_siblings_spouses') and calculating a new 'alone' column, to match the expected format for tests.
  • Manual Train/Test Split: A manual 80/20 train/test split is performed on the preprocessed dataset, and the resulting splits are saved as separate CSV files for consistent test execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an SSL issue with dataset loading in tests by switching to OpenML. The changes are a good step forward. I've provided a few suggestions to improve the implementation, focusing on code style, security, efficiency, and maintainability. My key recommendations include moving imports to follow PEP 8, addressing a security concern with disabled SSL verification, improving efficiency by caching the processed dataset, and replacing a magic number with a named constant.

- Move imports to top of file following PEP 8 style guide
- Add comprehensive security warning about disabled SSL verification
- Add TODO comment to track SSL certificate issue resolution
- Implement caching check to avoid reprocessing dataset on every import
- Replace magic number 0.8 with named constant TRAIN_SPLIT_RATIO
- Improve code efficiency: subsequent imports are ~58000x faster

This addresses all review comments while maintaining test compatibility.
Copy link
Collaborator

@haifeng-jin haifeng-jin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
The PR LGTM except the ssl disabling.
Would you mind download the file and keep it as part of the git repo?
Let's use it from local instead of downloading it.
Just create a dir under benchmarks and put the csv file there.

- Replace Google Cloud Storage downloads with local CSV files
- Add pre-processed Titanic dataset files to benchmark/datasets/
- Simplify test_utils.py by removing download/processing logic
- Remove pandas, ssl, and urllib dependencies from test utilities
- All 7 structured data tests now pass without network dependencies
Replaced '?' with empty values in the 'Age' column and standardized numeric fields to use decimal notation in both titanic_train.csv and titanic_test.csv. This improves consistency and compatibility for downstream data processing and analysis.
@pctablet505
Copy link
Contributor Author

@gemini-code-assist review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves an issue with loading the Titanic dataset in tests by vendoring the data files directly into the repository. This is a solid improvement as it removes external network dependencies, making the test suite more robust and faster. The implementation is correct. I've left one minor comment to improve code clarity.

@hertschuh hertschuh merged commit a2446cf into keras-team:master Nov 25, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test failing due to dataset not existing at defined url

3 participants