Skip to content

Also strip 'wasb' and 'wasbs' protocol schemes #493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

christophediprima
Copy link

Add support to wasb scheme.

@christophediprima
Copy link
Author

christophediprima commented Mar 10, 2025

Can this be merged? It has been tested on both Azurite and Azure Blob Storage.

@heman026
Copy link

heman026 commented Aug 1, 2025

Any update on this? When this will be merged

@martindurant
Copy link
Member

@kyleknap ?

@kyleknap
Copy link
Collaborator

kyleknap commented Aug 1, 2025

@christophediprima @heman026 Could you elaborate on why this is needed? And what is this change blocking? I saw there's quite a few cross-linked GitHub issues on the Iceberg repository, but I'm not sure I'm able to fully untangle the line of reasoning for needing this change.

My main hesitations right now are:

  • Adlfs already accepts several schemes, and I'd like to avoid expanding the number of schemes accepted for simplicity and maintainability reasons.
  • Support for wasb and wasbs protocols seems to refer to the WASB driver: https://hadoop.apache.org/docs/stable/hadoop-azure/wasb.html which is on the path for deprecation and removal.

That being said if the motivation is strong for supporting this, I'm open to pulling it in.

@kevinjqliu
Copy link

hey @kyleknap chiming in from the (py)iceberg side. We use both pyarrow and fsspec for filesystem operations. It would be great for the 2 to have feature parity regarding the supported protocols.

Pyarrow AzureFileSystem added support for these parameters

blob_storage_authority str, default None
hostname[:port] of the Blob Service. Defaults to .blob.core.windows.net. Useful for connecting to a local emulator, like Azurite.

blob_storage_scheme str, default None
Either http or https. Defaults to https. Useful for connecting to a local emulator, like Azurite.

dfs_storage_authority str, default None 
hostname[:port] of the Data Lake Gen 2 Service. Defaults to .dfs.core.windows.net. Useful for connecting to a local emulator, like Azurite. 

dfs_storage_scheme str, default None 
Either http or https. Defaults to https. Useful for connecting to a local emulator, like Azurite.

This allows us to test using the different protocol against azurite.

@kevinjqliu
Copy link

regarding your concerns above,

Adlfs already accepts several schemes, and I'd like to avoid expanding the number of schemes accepted for simplicity and maintainability reasons.

+1, i would recommend the approach pyarrow took. keep the same defaults but make it configurable.

Support for wasb and wasbs protocols seems to refer to the WASB driver: https://hadoop.apache.org/docs/stable/hadoop-azure/wasb.html which is on the path for deprecation and removal.

seems weird to add a feature that is deprecated for removal, but for now id like to be feature parity with pyarrow fs

Let me know what you think!

@kevinjqliu
Copy link

i see that the account_host acts similar to blob_storage_authority/dfs_storage_authority params from Pyarrow fs; it allows us to override the url (account_url) to the blob storage account

account_host: str
The storage account host. This string is the entire url to the for the storage after the https://, i.e. “https://{account_host}”. This parameter is only required for Azure clouds where account urls do not end with “blob.core.windows.net”. Note that the account_name parameter is still required.

In order to support wasb/wasbs, i think we'll also need to make the scheme configurable.

@christophediprima were you able to get wasb working without configuring the scheme?

@kevinjqliu
Copy link

Okay I tested this PR locally and confirm that it is sufficient to support wasb and wasbs.
The way _strip_protocol is implemented is weird. We first strip the "supported protocols" ("abfs://", "az://", "abfss://", "wasb://", "wasbs://"), then add abfs:// and finally use infer_storage_options to ignore the protocol again.
But in either case, the path is correctly parsed

@martindurant
Copy link
Member

Quick question here: _strip_protocol is supposed to take account of all of the members of the class .protocol tuple. This goes to what @kevinjqliu says: perhaps there is a more obvious place to put the protocol options and things will just work.

@kyleknap
Copy link
Collaborator

Thanks @kevinjqliu for chiming in! Just replying back to some comments/questions

This allows us to test using the different protocol against azurite.

Would it be possible to just format a connection string when testing against azurite? That is what adlfs does for testing and it should allow you to set the endpoint and make the scheme http.

i see that the account_host acts similar to blob_storage_authority/dfs_storage_authority params from Pyarrow fs; it allows us to override the url (account_url) to the blob storage account

Yeah instead of adding a blob_storage_authority or dfs_storage_authority, I'd prefer we stick with the account_host parameter already provided or lean into any suggestions @martindurant has on this. The only blocker that we've ran into in trying to use Azurite without a connection string to my knowledge is just being able to set the scheme to http (adlfs interfaces currently force HTTPS usage for non-connection string usages). We'd be open to making HTTP vs HTTPS configurable given needing it for use with Azurite is a solid use case.

confirm that it is sufficient to support wasb and wasbs.
The way _strip_protocol is implemented is weird. We first strip the "supported protocols" ("abfs://", "az://", "abfss://", "wasb://", "wasbs://"), then add abfs:// and finally use infer_storage_options to ignore the protocol again.
But in either case, the path is correctly parsed

Yeah basically these protocols are treated as aliases of abfs://.

I'm not too familiar with PyIceberg and thank you for the patience here, but I guess what I'm trying to understand is who/what is setting the protocol to wasb:// and wasbs://? Is it being set as part of the PyIceberg implentation or is an upstream library that uses PyIceberg set the protocol to wasb:///wasbs://? Why can't abfs:// or another supported protocol (e.g., az://) be used instead?

Also does PyArrow document which protocols it currently supports? I was trying to dive more into it to understand what the parity gaps were in terms of protocol support, but only found comments in the code like this which indicate abfs:// and abfss:// are only supported.

Thanks!

@kevinjqliu
Copy link

kevinjqliu commented Aug 14, 2025

Caught up with @kyleknap offline.

There are a couple of different protocol schemes for azure storage. Amongst those are abfs[s], az, and wasb[s]. These are written depending on the specific libraries used. For wasb[s], any client that uses https://hadoop.apache.org/docs/stable/hadoop-azure/wasb.html will write wasb[s] as the uri. Snowflake is also writing this scheme (see apache/iceberg-python#1606 and apache/iceberg#10127). Even though wasb[s] is marked for deprecation, its uri and scheme will still be out in the wild and the underlying storage files are still accessible.

As a filesystem implementation for azure storage, adlfs should also support wasb[s]. The support here only means to allow parsing this scheme and its related uri. There is enough information in the wasb[s] uri to be used by the underlying adlfs to interact with storage. (see apache/iceberg#10127 (comment))

We should add wasb[s] support to pyarrow as well.

@kevinjqliu
Copy link

perhaps there is a more obvious place to put the protocol options and things will just work.

This PR allowlists wasb[s] protocol schemes. and things do just work afterwards. I verified locally :)

@kyleknap
Copy link
Collaborator

@kevinjqliu Thanks for the context and updating the thread! That all makes sense on how it fits together. It also sounds like Snowflake has updated its recent versions to use abfs: https://docs.snowflake.com/en/release-notes/bcr-bundles/2025_03/bcr-1935 so it's really just older versions of Snowflake/metadata files that have not been updated recently, which would still produce the old wasb protocol.

I'm going to spend some time to confirm that wasb protocol can be treated as an alias of the other protocols and explore if there are other alternatives to get this all unblocked. I'll update the thread.

If we decide to proceed forward, we should also make sure to update this PR so that there are tests to make sure adlfs accepts wasb and wasbs based URIs and also add a changelog entry to note support.

@kyleknap
Copy link
Collaborator

Quick question here: _strip_protocol is supposed to take account of all of the members of the class .protocol tuple. This goes to what @kevinjqliu says: perhaps there is a more obvious place to put the protocol options and things will just work.

@martindurant after diving into the adlfs and other fsspec implementations, I think I get what you may be getting at here...

It seems like adlfs is inconsistent with the other fsspec implementations, where most use the shared _strip_protocol(), which takes into account the class .protocol string/tuple. Whereas adlfs does not leverage its protocol tuple at all and uses that hardcoded tuple instead.

So, I'm wondering if it makes sense to just:

  1. Hoist the current this hardcoded tuple to the class's protocol property
  2. And then update adlfs's _strip_protocol() class method to use the class protocol tuple instead of that hardcoded one.

@martindurant Do you foresee any unintended side effects/gotchas in updating the protocol on the filesystem class here to ("abfs", "az", "abfss")? I don't think I have a strong enough grasp yet of how the protocol is used throughout fsspec to be aware of what side effects this may cause.

Assuming this approach makes sense, I do like it in the sense that:

  1. Now downstream consumers of adlfs can now have a hook to unblock themselves if adlfs does not support a particular protocol scheme in the future; they would just need to instantiate the file system and then reset the protocol instance member to include any additional protocols.
  2. This approach could be used for wasb and wasbs to unblock PyIceberg while not giving more life to the deprecated WASB style URIs by explicitly including it in the default protocols tuple in adlfs.

@martindurant
Copy link
Member

Do you foresee any unintended side effects/gotchas in updating the protocol on the filesystem class here to ("abfs", "az", "abfss")

No, it should be fine. Be aware, that the tuple is used for prefix stripping as discussed here, but the dispatch mechanism (e.g., calling fsspec.filesystem() ) uses fsspec.registry as a lookup, so it is very likely, but not guaranteed, that one of the .protocol options are used to bootstrap a file operation. Still, .protocol is meant to be the authoritative one, better than hard-coding within a function.

@kyleknap
Copy link
Collaborator

@martindurant Thanks for the confirmation!

but the dispatch mechanism (e.g., calling fsspec.filesystem() ) uses fsspec.registry as a lookup, so it is very likely, but not guaranteed, that one of the .protocol options are used to bootstrap a file operation.

Yep! PyIceberg does not use any of the fsspec dispatching methods; it imports and instantiates the AzureBlobFileSystem directly. So, the registry won't need to be updated to use this approach.
I think abfss (note the extra s) will be the only value that is in the .protocol tuple that is not in the fsspec registry. I think I'd prefer to we hold off adding abfss to the registry though so that we avoid polluting the global fsspec registry with more Azure storage aliases and continue to direct users to the main ones of abfs and az.

@kevinjqliu @christophediprima I'm thinking we go forward with the approach I suggested here to unblock PyIceberg: #493 (comment). Let me know if either of you have the cycles to update the PR. Otherwise, I can send a PR later in the week with this approach. Let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants