Fix Titanic dataset loading in test_utils.py #1959

pctablet505 · 2025-11-19T05:30:56Z

This pull request updates how the Titanic dataset is loaded for testing in the project. Instead of downloading the data from external sources at import time, it now uses local CSV files stored under the benchmark/datasets directory. This change improves reliability and reproducibility by avoiding network dependencies and potential changes in remote datasets.

Dataset loading improvements:

Fixes #1958

- Replace Google Cloud Storage URLs with direct OpenML download - Add SSL certificate workaround for reliable dataset fetching - Add column mapping to match expected test format - Implement manual 80/20 train/test split - Add 'alone' column calculation from family size - No new external dependencies added This resolves SSL certificate issues and ensures reliable dataset loading for tests.

gemini-code-assist · 2025-11-19T05:31:08Z

Summary of Changes

Hello @pctablet505, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves persistent SSL certificate issues that were preventing the reliable loading of the Titanic dataset within the test suite. By switching the data source to OpenML, implementing an SSL workaround, and performing necessary data preprocessing and splitting, the changes ensure that tests relying on this dataset can execute consistently and without external fetching failures, thereby improving the stability of the test environment.

Highlights

Dataset Source Change: The Titanic dataset download source has been switched from Google Cloud Storage (GCS) URLs to a direct OpenML download link to resolve previous fetching issues.
SSL Certificate Workaround: An SSL certificate workaround using urllib.request and ssl._create_unverified_context() has been implemented to ensure reliable dataset fetching, bypassing potential SSL verification problems.
Data Preprocessing and Column Mapping: The downloaded dataset now undergoes preprocessing, including renaming columns (e.g., 'pclass' to 'class', 'sibsp' to 'n_siblings_spouses') and calculating a new 'alone' column, to match the expected format for tests.
Manual Train/Test Split: A manual 80/20 train/test split is performed on the preprocessed dataset, and the resulting splits are saved as separate CSV files for consistent test execution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request fixes an SSL issue with dataset loading in tests by switching to OpenML. The changes are a good step forward. I've provided a few suggestions to improve the implementation, focusing on code style, security, efficiency, and maintainability. My key recommendations include moving imports to follow PEP 8, addressing a security concern with disabled SSL verification, improving efficiency by caching the processed dataset, and replacing a magic number with a named constant.

autokeras/test_utils.py

- Move imports to top of file following PEP 8 style guide - Add comprehensive security warning about disabled SSL verification - Add TODO comment to track SSL certificate issue resolution - Implement caching check to avoid reprocessing dataset on every import - Replace magic number 0.8 with named constant TRAIN_SPLIT_RATIO - Improve code efficiency: subsequent imports are ~58000x faster This addresses all review comments while maintaining test compatibility.

haifeng-jin

Thanks for the PR!
The PR LGTM except the ssl disabling.
Would you mind download the file and keep it as part of the git repo?
Let's use it from local instead of downloading it.
Just create a dir under benchmarks and put the csv file there.

- Replace Google Cloud Storage downloads with local CSV files - Add pre-processed Titanic dataset files to benchmark/datasets/ - Simplify test_utils.py by removing download/processing logic - Remove pandas, ssl, and urllib dependencies from test utilities - All 7 structured data tests now pass without network dependencies

Replaced '?' with empty values in the 'Age' column and standardized numeric fields to use decimal notation in both titanic_train.csv and titanic_test.csv. This improves consistency and compatibility for downstream data processing and analysis.

pctablet505 · 2025-11-20T04:46:30Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request resolves an issue with loading the Titanic dataset in tests by vendoring the data files directly into the repository. This is a solid improvement as it removes external network dependencies, making the test suite more robust and faster. The implementation is correct. I've left one minor comment to improve code clarity.

autokeras/test_utils.py

pctablet505 requested review from fchollet and haifeng-jin as code owners November 19, 2025 05:30

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

autokeras/test_utils.py Outdated Show resolved Hide resolved

autokeras/test_utils.py Outdated Show resolved Hide resolved

autokeras/test_utils.py Outdated Show resolved Hide resolved

autokeras/test_utils.py Outdated Show resolved Hide resolved

haifeng-jin reviewed Nov 19, 2025

View reviewed changes

pctablet505 added 4 commits November 20, 2025 09:39

Update test_utils.py

1e444df

Update test_utils.py

d7056d9

pctablet505 requested a review from haifeng-jin November 20, 2025 04:48

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

autokeras/test_utils.py Outdated Show resolved Hide resolved

pctablet505 added 2 commits November 20, 2025 10:25

Update test_utils.py

2c77eeb

Update test_utils.py

28a8e70

hertschuh approved these changes Nov 25, 2025

View reviewed changes

hertschuh merged commit a2446cf into keras-team:master Nov 25, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Titanic dataset loading in test_utils.py #1959

Fix Titanic dataset loading in test_utils.py #1959

Uh oh!

pctablet505 commented Nov 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haifeng-jin left a comment

Uh oh!

pctablet505 commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix Titanic dataset loading in test_utils.py #1959

Fix Titanic dataset loading in test_utils.py #1959

Uh oh!

Conversation

pctablet505 commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

haifeng-jin left a comment

Choose a reason for hiding this comment

Uh oh!

pctablet505 commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pctablet505 commented Nov 19, 2025 •

edited

Loading