-
Notifications
You must be signed in to change notification settings - Fork 101
Truncate Excess Columns Instead of Failing When Columns > columns_max_number #3255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Log a warning and truncate image URL columns if they exceed the maximum allowed number.
|
@severo would like your thoughts on this :) |
severo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
Can you also add the tests for these cases?
| ) | ||
|
|
||
| if columns_were_truncated: | ||
| response["truncated_columns"] = truncated_columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need the list of missing columns in the response. Just a boolean, I guess.
| response = response_features_only | ||
| response["rows"] = row_items | ||
| response["truncated"] = (not rows_content.all_fetched) or truncated | ||
| response["truncated"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinl we should keep this field for truncated rows (we could have named it truncated_rows to be more explicit--maybe we can add that field, and deprecate truncated at some point?), and have another field for truncated columns (let's call it truncated_columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to apply this truncation to first_rows, I guess. And we should update the docs (the openapi spec in particular)
| num_scanned_rows=num_scanned_rows, | ||
| has_urls_columns=True, | ||
| full_scan=rows_content.all_fetched, | ||
| truncated_columns=truncated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better here, a boolean, not a list of column names
|
@severo Also, regarding first_rows.py - since it calls create_first_rows_response(), will it automatically get these new fields, or does something need to be changed there as well? In opt_in_out_urls_scan_from_streaming.py, the truncated_columns=truncated is already passing a boolean value as you suggested. also in rows.py - Do I also need to update the type definition for SplitFirstRowsResponse to include the new truncated_rows and truncated_columns fields? If so, which file should I modify? |
|
We should keep "truncated" as it was before (only for truncated rows), otherwise we would report incorrectly in the dataset viewer for previously computed datasets. SplitFirstRowsResponse: yes, in https://github.com/huggingface/dataset-viewer/blob/main/libs/libcommon/src/libcommon/dtos.py. And also update https://github.com/huggingface/dataset-viewer/blob/main/docs/source/openapi.json
indeed |
|
i will do the changes, and will let you know! |
|
@severo the changes are now applied in all four affected files. |
|
It's in good shape! As I mentioned before, can you add unit tests for the changes? |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This PR implements graceful truncation behavior for datasets with extremely wide schemas (i.e., thousands of columns), addressing #1172 and related discussions on improving the viewer’s robustness for modern AI-scale tabular datasets.
Previously, when the number of columns exceeded columns_max_number (default: 1000), several viewer steps—such as first-rows and opt-in/out URL scan—would raise TooManyColumnsError.
This made the viewer unusable for many large-scale datasets, even when a partial preview would have been perfectly acceptable.
Instead of failing, we now gracefully truncate the schema to the first columns_max_number columns and continue processing normally.
Implemented in:
libs/libcommon/src/libcommon/viewer_utils/rows.py1] Replaces the hard error with truncation
2] Adds response["truncated_columns"] (list of dropped columns)
3] Marks response["truncated"] = True when applicable
services/worker/src/worker/job_runners/split/opt_in_out_urls_scan_from_streaming.py1] Truncates image_url_columns instead of raising TooManyColumnsError
2] Emits a warning
3] Propagates truncation info to get_rows_or_raise