Skip to content

Conversation

@praateekmahajan
Copy link
Contributor

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Praateek <[email protected]>
Comment on lines 61 to 62
# TODO : I think we can remove this
self.fs = get_fs(self.ids_to_remove_path, storage_options=self.read_kwargs.get("storage_options", {}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀


# TODO generating filesystem for each task will be inefficient, we should benchmark pq.read_table # noqa: TD004
fs = get_fs(paths[0], storage_options=read_kwargs.get("storage_options", {}))
# pop storage_options from read_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a more explanatory comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc this was because pd.read_parquet(list_of_files) fails for cloud paths.. the hack was using filesystem=fs.. i think for now we shouldn't do this but rather pd.concat([pd.read_parquet(f) for f in files])

# Add any additional kwargs, allowing them to override defaults
write_kwargs.update(self.write_kwargs)
df.to_parquet(file_path, **write_kwargs)
# Pop storage_options as we're directly passing the filesystem to the writer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgotten why do we need filesystem here? would prefer to not pass filesystem

Signed-off-by: Praateek <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants