-
Notifications
You must be signed in to change notification settings - Fork 191
Cloud IO Improvements #1226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Cloud IO Improvements #1226
Conversation
Signed-off-by: Praateek <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Praateek <[email protected]>
| # TODO : I think we can remove this | ||
| self.fs = get_fs(self.ids_to_remove_path, storage_options=self.read_kwargs.get("storage_options", {})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
|
|
||
| # TODO generating filesystem for each task will be inefficient, we should benchmark pq.read_table # noqa: TD004 | ||
| fs = get_fs(paths[0], storage_options=read_kwargs.get("storage_options", {})) | ||
| # pop storage_options from read_kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a more explanatory comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc this was because pd.read_parquet(list_of_files) fails for cloud paths.. the hack was using filesystem=fs.. i think for now we shouldn't do this but rather pd.concat([pd.read_parquet(f) for f in files])
| # Add any additional kwargs, allowing them to override defaults | ||
| write_kwargs.update(self.write_kwargs) | ||
| df.to_parquet(file_path, **write_kwargs) | ||
| # Pop storage_options as we're directly passing the filesystem to the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgotten why do we need filesystem here? would prefer to not pass filesystem
Signed-off-by: Praateek <[email protected]>
Description
Usage
# Add snippet demonstrating usageChecklist