Skip to content

Conversation

@ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Nov 16, 2025

This PR implements graceful truncation behavior for datasets with extremely wide schemas (i.e., thousands of columns), addressing #1172 and related discussions on improving the viewer’s robustness for modern AI-scale tabular datasets.

Previously, when the number of columns exceeded columns_max_number (default: 1000), several viewer steps—such as first-rows and opt-in/out URL scan—would raise TooManyColumnsError.
This made the viewer unusable for many large-scale datasets, even when a partial preview would have been perfectly acceptable.

Instead of failing, we now gracefully truncate the schema to the first columns_max_number columns and continue processing normally.

Implemented in:

libs/libcommon/src/libcommon/viewer_utils/rows.py

1] Replaces the hard error with truncation

2] Adds response["truncated_columns"] (list of dropped columns)

3] Marks response["truncated"] = True when applicable

services/worker/src/worker/job_runners/split/opt_in_out_urls_scan_from_streaming.py

1] Truncates image_url_columns instead of raising TooManyColumnsError

2] Emits a warning

3] Propagates truncation info to get_rows_or_raise

Log a warning and truncate image URL columns if they exceed the maximum allowed number.
@ArjunJagdale
Copy link
Contributor Author

@severo would like your thoughts on this :)

@ArjunJagdale ArjunJagdale changed the title Process part Truncate Excess Columns Instead of Failing When Columns > columns_max_number Nov 16, 2025
Copy link
Collaborator

@severo severo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Can you also add the tests for these cases?

)

if columns_were_truncated:
response["truncated_columns"] = truncated_columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need the list of missing columns in the response. Just a boolean, I guess.

response = response_features_only
response["rows"] = row_items
response["truncated"] = (not rows_content.all_fetched) or truncated
response["truncated"] = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinl we should keep this field for truncated rows (we could have named it truncated_rows to be more explicit--maybe we can add that field, and deprecate truncated at some point?), and have another field for truncated columns (let's call it truncated_columns).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to apply this truncation to first_rows, I guess. And we should update the docs (the openapi spec in particular)

num_scanned_rows=num_scanned_rows,
has_urls_columns=True,
full_scan=rows_content.all_fetched,
truncated_columns=truncated,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better here, a boolean, not a list of column names

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Nov 17, 2025

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

In opt_in_out_urls_scan_from_streaming.py, the truncated_columns=truncated is already passing a boolean value as you suggested.

also in rows.py -

response["truncated"] = (
    (not rows_content.all_fetched)
    or truncated
    or columns_were_truncated
)

response["truncated_rows"] = (not rows_content.all_fetched) or truncated
response["truncated_columns"] = columns_were_truncated

Do I also need to update the type definition for SplitFirstRowsResponse to include the new truncated_rows and truncated_columns fields? If so, which file should I modify?

@severo
Copy link
Collaborator

severo commented Nov 17, 2025

We should keep "truncated" as it was before (only for truncated rows), otherwise we would report incorrectly in the dataset viewer for previously computed datasets.

SplitFirstRowsResponse: yes, in https://github.com/huggingface/dataset-viewer/blob/main/libs/libcommon/src/libcommon/dtos.py. And also update https://github.com/huggingface/dataset-viewer/blob/main/docs/source/openapi.json

@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well?

indeed

@ArjunJagdale
Copy link
Contributor Author

i will do the changes, and will let you know!

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Nov 17, 2025

@severo the changes are now applied in all four affected files.
The logic in libs/libcommon/viewer_utils/rows.py seems consistent with the new behavior, but let me know if you see anything else that should be adjusted.

@ArjunJagdale ArjunJagdale requested a review from severo November 20, 2025 18:54
@severo
Copy link
Collaborator

severo commented Nov 21, 2025

It's in good shape! As I mentioned before, can you add unit tests for the changes?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants