-
Notifications
You must be signed in to change notification settings - Fork 107
Also strip 'wasb' and 'wasbs' protocol schemes #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Can this be merged? It has been tested on both Azurite and Azure Blob Storage. |
Any update on this? When this will be merged |
@christophediprima @heman026 Could you elaborate on why this is needed? And what is this change blocking? I saw there's quite a few cross-linked GitHub issues on the Iceberg repository, but I'm not sure I'm able to fully untangle the line of reasoning for needing this change. My main hesitations right now are:
That being said if the motivation is strong for supporting this, I'm open to pulling it in. |
hey @kyleknap chiming in from the (py)iceberg side. We use both pyarrow and fsspec for filesystem operations. It would be great for the 2 to have feature parity regarding the supported protocols. Pyarrow AzureFileSystem added support for these parameters
This allows us to test using the different protocol against azurite. |
regarding your concerns above,
+1, i would recommend the approach pyarrow took. keep the same defaults but make it configurable.
seems weird to add a feature that is deprecated for removal, but for now id like to be feature parity with pyarrow fs Let me know what you think! |
i see that the
In order to support @christophediprima were you able to get |
Okay I tested this PR locally and confirm that it is sufficient to support |
Quick question here: |
Thanks @kevinjqliu for chiming in! Just replying back to some comments/questions
Would it be possible to just format a connection string when testing against azurite? That is what
Yeah instead of adding a
Yeah basically these protocols are treated as aliases of I'm not too familiar with PyIceberg and thank you for the patience here, but I guess what I'm trying to understand is who/what is setting the protocol to Also does PyArrow document which protocols it currently supports? I was trying to dive more into it to understand what the parity gaps were in terms of protocol support, but only found comments in the code like this which indicate Thanks! |
Caught up with @kyleknap offline. There are a couple of different protocol schemes for azure storage. Amongst those are abfs[s], az, and wasb[s]. These are written depending on the specific libraries used. For wasb[s], any client that uses https://hadoop.apache.org/docs/stable/hadoop-azure/wasb.html will write wasb[s] as the uri. Snowflake is also writing this scheme (see apache/iceberg-python#1606 and apache/iceberg#10127). Even though wasb[s] is marked for deprecation, its uri and scheme will still be out in the wild and the underlying storage files are still accessible. As a filesystem implementation for azure storage, adlfs should also support wasb[s]. The support here only means to allow parsing this scheme and its related uri. There is enough information in the wasb[s] uri to be used by the underlying adlfs to interact with storage. (see apache/iceberg#10127 (comment)) We should add wasb[s] support to pyarrow as well. |
This PR allowlists wasb[s] protocol schemes. and things do just work afterwards. I verified locally :) |
@kevinjqliu Thanks for the context and updating the thread! That all makes sense on how it fits together. It also sounds like Snowflake has updated its recent versions to use I'm going to spend some time to confirm that If we decide to proceed forward, we should also make sure to update this PR so that there are tests to make sure |
@martindurant after diving into the adlfs and other fsspec implementations, I think I get what you may be getting at here... It seems like adlfs is inconsistent with the other fsspec implementations, where most use the shared So, I'm wondering if it makes sense to just:
@martindurant Do you foresee any unintended side effects/gotchas in updating the Assuming this approach makes sense, I do like it in the sense that:
|
No, it should be fine. Be aware, that the tuple is used for prefix stripping as discussed here, but the dispatch mechanism (e.g., calling fsspec.filesystem() ) uses fsspec.registry as a lookup, so it is very likely, but not guaranteed, that one of the .protocol options are used to bootstrap a file operation. Still, .protocol is meant to be the authoritative one, better than hard-coding within a function. |
@martindurant Thanks for the confirmation!
Yep! PyIceberg does not use any of the fsspec dispatching methods; it imports and instantiates the @kevinjqliu @christophediprima I'm thinking we go forward with the approach I suggested here to unblock PyIceberg: #493 (comment). Let me know if either of you have the cycles to update the PR. Otherwise, I can send a PR later in the week with this approach. Let me know! |
Add support to wasb scheme.