PDEP-15: Reject PDEP-10 #58623

lithomas1 · 2024-05-07T21:56:25Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Just something quick and dirty I threw up that we should talk about tomorrow at the dev call.

lithomas1 · 2024-05-07T23:04:22Z

/preview

github-actions · 2024-05-07T23:05:48Z

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58623/

lithomas1 · 2024-05-07T23:12:38Z

Ugh, looks like the bullets aren't rendering correctly.

WillAyd · 2024-05-07T23:17:35Z

Did I miss an official vote on rejecting this? I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

lithomas1 · 2024-05-07T23:21:51Z

Nope, just opening since I said I would in the discussion issue.

We'll still need a formal vote - I'm just kicking off the discussion here.

WillAyd · 2024-05-07T23:18:40Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.
+
+   For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics


PDEP 14 does not change performance or memory savings if you do not have pyarrow installed

added a note in parentheses at the end of that sentence.

Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

I updated the PDEP-15 text, and forgot to remove the PDEP-10 changes.

I've removed the PDEP-10 changes now.

WillAyd · 2024-05-07T23:22:43Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+The primary reasons for rejecting this PDEP are twofold:
+
+1) Requiring pyarrow as a dependency causes installation problems.
+   - Pyarrow does not fit or has a hard time fitting in space-constrained environments 


I think what we could learn from this process is what caused this to change our minds? These issues were discussed leading up to the acceptance of PDEP-10.

The way this is written I think reads more as "we discovered this after the fact" instead of "we decided that X amount of negative feedback on these points was enough to revert". I think there is some value to future PDEPs to set expectations around the latter

lithomas1 · 2024-05-08T15:29:27Z

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

xref #57073 (comment) for context

simonjayhawkins

Thanks @lithomas1 for making the updates needed to formally reject PDEP-10.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

Dr-Irv · 2024-05-08T19:03:46Z

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

simonjayhawkins · 2024-05-08T19:08:35Z

I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

I'm now leaning towards approving the rejection. My approval of the original PDEP was based solely on improvements to default inference for other dtypes. Despite some recent comments about this, no discussion/clarification has followed on this topic. I'd need to see some positive evidence that the original PDEP-10 authors still intend to support delivering the promised enhancements in this area. Now that the implications of using pd.NA as a default has been discussed in more depth, I suspect that any improved inference would need a couple of dtype variants.

lithomas1 · 2024-05-08T21:27:44Z

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

Yep, I'm planning on updating this current PR to do that, so if anyone has any objections or whatever, we can still discuss here.

WillAyd · 2024-05-18T14:29:34Z

Minor note - do we need to rename this PR? Right now PDEP-10 shows twice on the website

lithomas1 · 2024-05-18T21:16:17Z

Yeah, I'll probably change the name to PDEP-15 once I get around to moving this to a separate PDEP (probably tomorrow).

I was travelling the past week, so didn't really have time then.

WillAyd · 2024-05-20T12:41:57Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+The primary reasons for rejecting this PDEP are twofold:
+
+1) Requiring pyarrow as a dependency causes installation problems.
+   - Pyarrow does not fit or has a hard time fitting in space-constrained environments 


Within the context of recent conversation I don't think this comment about AWS is true. AWS distributes an official pandas image for lambda which already includes pyarrow, pandas, and NumPy. This is all required by their own "AWS SDK on pandas" library.

The issue more finely scoped I think is that the default wheel installation via pip into a lambda image exceeds the 256 MB limit. Either using the official AWS provided image or using miniconda should not exceed the space limits

WillAyd · 2024-05-20T12:46:35Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.
+
+   For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics


Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

Dr-Irv

My main comment is that PDEP-10 should be minimally modified, and that PDEP-15 has all the discussion about why we did the rejection.

web/pandas/pdeps/0010-required-pyarrow-dependency.md

web/pandas/pdeps/0015-do-not-require-pyarrow.md

WillAyd · 2024-05-20T19:15:46Z

web/pandas/pdeps/0010-required-pyarrow-dependency.md

+   While both of these reasons are mentioned in the drawbacks section of this PDEP, at the time of the writing
+of the PDEP, we underestimated the impact this would have on users, and also downstream developers.
+
+2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.


I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

High performance string operations

Direct string creation from I/O routines (i.e. no intermediate copies)

Zero copy data exchange through Arrow C Data Interface

Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

High performance string operations

Direct string creation from I/O routines (i.e. no intermediate copies)

Zero copy data exchange through Arrow C Data Interface

Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

In PDEP-10, there are 3 benefits listed

pyarrow strings (possible to provide users this benefit without making pyarrow required)

Nested datatypes (can't have this without arrow, but this is a bit niche)

Interopability (the alternative is the dataframe interchange protocol, which is more widely adopted at the moment. Not sure about the zero-copy stuff for that, though. I think it also might be possible to implement Arrow C Data interface support without taking on a hard dep on pyarrow)

Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

So, IMO, this argument is accurate, in that most of the benefits in PDEP-10 can be made possible (for those user that have pyarrow installed) without making pyarrow required.

The future benefits of Arrow are very compelling, but decisions on making a dependency required should be based on immediate and not future benefits. Like I said before, it is easy to reconsider this decision in a years time if those future benefits are materialize.

If you think points 1 and 3 are possible without pyarrow then the alternatives for that should be laid out in this PDEP, at least at a super high level. I'm assuming point 1 refers to the nanoarrow POC I was sharing; point 3 requires reimplementing the conversions that pyarrow already has. (I personally don't think building either of those from scratch is a good long term solution but it can at least be discussed)

For point 2 how do you know those are niche applications? Its easy to dismiss things that don't exist today as not worthwhile, but I get the feeling that there could be plenty of use cases for the aggregate types, since they have a natural fit with many of the Python containers.

On interoperability the long term prospects for the dataframe interchange protocol seem dubious, and we have even discussed moving that out of pandas (see #56732).

Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

The Arrow interchange protocol can be used by any library that needs to work with Arrow data - there is no limit to it being used by other dataframe libraries. It provides a standardized API so that third parties don't need to hack into our internals, which is a direct benefit for us. It also works in two directions - we can be a consumer just as much as a producer.

Nested datatypes (can't have this without arrow, but this is a bit niche)

Also wanted to point out that arrow has a decimal128 and decimal256 type which is especially useful for financial calculations where floating point inaccuracies cannot be tolerated, and the arrow decimal types are an extremely significant improvement over using object.

Sure, will update and add a note in the PDEP when I get time again.

Co-authored-by: Irv Lustig <[email protected]>

Dr-Irv

I'm fine with this PDEP, although I'm unsure whether using language as "we, the core team", should appear in a PDEP.

github-actions · 2024-06-22T00:06:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

WillAyd · 2024-09-03T16:04:33Z

web/pandas/pdeps/0015-do-not-require-pyarrow.md

+     arrays to those dtypes by default, without forcing a pyarrow requirement on users,
+     as there is no Python/numpy equivalent for these dtypes).
+
+   - Interoperability


I think @MarcoGorelli point was to remove the Interoperability section entirely, but if that's not true then I don't understand the point this is trying to make in its current form.

The beneficiary of the Arrow C Data interface is not just other dataframe libraries - a decent listing can be found here:

apache/arrow#39195 (comment)

For a direct benefit to pandas, it helps with I/O to boost performance, ensure proper data types, and reduce the amount of code burden. We have already seen this benefit with ADBC drivers, and from the link above, it looks like there is some near-term potential for it to help Excel I/O via fastexcel

Sure, will take it out the next go.

The fact that we don't require PyArrow might put is in a bind for downstream libraries that want interchange with pandas, but themselves probably aren't in a position to require PyArrow. In particular, this conversation is happening with seaborn:

mwaskom/seaborn#3782 (comment)

Somewhat unfortunately, this may mean that we are asked to put more maintenance work into the interchange protocol

github-actions · 2024-11-09T00:07:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

datapythonista · 2025-06-06T14:08:48Z

If there are no objections I'll open a vote on this shortly. While in general we've been assuming that we don't want to make pyarrow a required dependency as previously agreed, no formal vote has happened (other than in other PDEPs that assumed this one was already approved). While the discussion here could continue for longer, I think it had enough iterations to be ready for a vote.

CC @pandas-dev/pandas-core

MarcoGorelli

Just left a comment on the musllinux topic, which I think has since been addressed

I'm glad that the decision require PyArrow was postponed, and that in the ~2 years since the decision was taken it gave a chance for several issues to be addressed:

pyodide support (pyarrow's supported now 🥳 https://pyodide.org/en/stable/console.html)
musllinux support
smaller conda package
many, many bugs have been fixed. In Narwhals we parametrise tests over all pandas dtype backends, and since pandas 2.2, the pyarrow dtypes seem really solid

It's true that this increases the load on PyPI, and that's something I'd still feel uneasy about (Thomas is right that there are worse offenders - has anyone seen PySpark 4.0? https://pypi.org/project/pyspark/#pyspark-4.0.0.tar.gz), Having said that, I did give up my right to vote in #61163 , so I can't (and won't) vote on this.

At this point, it does seem harder to make the case against requiring PyArrow

MarcoGorelli · 2025-06-06T17:34:56Z

web/pandas/pdeps/0015-do-not-require-pyarrow.md

+     - While pyarrow has made great strides towards supporting most platforms that pandas is installable on
+       (e.g. the recent addition of pyodide support in pyarrow), we would still have to drop support for some
+       platforms like musllinux (the feature request is tracked [here](https://github.com/apache/arrow/issues/18036)) if pyarrow was to be required.


not sure if I'm reading apache/arrow#18036 (comment) correctly, but it looks like PyArrow v20 does indeed support musllinux? this was mentioned in #50511 (comment) too

simonjayhawkins · 2025-06-07T08:24:02Z

I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

I'm now leaning towards approving the rejection. My approval of the original PDEP was based solely on improvements to default inference for other dtypes. Despite some recent comments about this, no discussion/clarification has followed on this topic. I'd need to see some positive evidence that the original PDEP-10 authors still intend to support delivering the promised enhancements in this area. Now that the implications of using pd.NA as a default has been discussed in more depth, I suspect that any improved inference would need a couple of dtype variants.

Thinking some more I would now probably vote against rejecting PDEP-10. PDEP-10 states "NumPy object dtype will be avoided as much as possible." and gives a list of new dtypes that would be implemented in the future. Without these being included in another approved PDEP, rejecting PDEP-10 would be effectively removing this agreed upon direction from our roadmap which I don't think provides clarity to the community. If the issue is just the timing of the pyArrow requirement, then an update to the timelines in PDEP-10 may suffice?

datapythonista · 2025-06-07T09:11:31Z

This PDEP only rejects PDEP-10, and doesn't discuss how to move forward, which will have to be decided later (for example via PDEP-16 if this one is approved).

So I think the sooner we vote on this, the sooner we can decide what exactly to do for pandas 3.0, and the sooner we can start working on it.

I started the voting at #61596. The deadline for voting is 15 days, until June 22nd (I guess if everybody who can votes cast their vote before, we can move accept or reject the PDEP earlier).

simonjayhawkins · 2025-06-07T12:13:19Z

in #61596 (comment) @rhshadrach wrote

A call for vote is in violation of PDEP-1.

After 30 days, with a note that there is at most 30 days remaining for discussion, and that a vote will be called for if no discussion occurs in the next 15 days.

After 45 days, with a note that there is at most 15 days remaining for discussion, and that a vote will be called for in 15 days.
...
After 30 discussion days, in case 15 days passed without any new unaddressed comments, the authors may close the discussion period preemptively, by sending an early reminder of 15 days remaining until the voting period starts.

The timeline requires a 15 day announcement to commence with voting. This has not occurred.

I may be misunderstanding or our PDEP guidelines may need to be updated but these cases are for when we move to an early vote. The PDEP process states "A PDEP discussion will remain open for up to 60 days.". we have far exceeded this requirement and so we are not calling an early vote?

also may need clarification but PDEP-1 also states

Invalid PDEP #
For submitted PDEPs that do not contain proper documentation, are out of scope, or are not useful to the community for any other reason...

going stale is IMO a good reason to declare this PDEP invalid? We need better guidelines in PDEP-1 about when a PDEP can be closed automatically to eliminate the situation we find ourselves in here and other PDEPs such as PDEP-16. Stale PDEP do not provide clarity to the community on the direction of pandas and the intent of the development team.

datapythonista · 2025-06-07T12:27:13Z

I officially call the vote for this PDEP.

No fully agree this is needed, as this PDEP exceeded the discussion period by almost a year, but that gives people 15 days to finish the discussion. Note that the PDEP author is not an active pandas maintainer, and I'm not aware of anyone who is planning to lead the discussion or update the PDEP with the feedback. So, while it feels just like a waste of time, I'll reopen the voting issue in 15 days. I'll also be sending an email as stated in PDEP-1, but note that may vote on the PDEP-1 is -1, so my intention here is not to get the PDEP accepted, but to make a decision regarding PDEP-10 (requiring PyArrow in pandas 3.0), which so far continues to be approved.

rhshadrach · 2025-06-07T14:35:30Z

@simonjayhawkins

I may be misunderstanding or our PDEP guidelines may need to be updated but these cases are for when we move to an early vote. The PDEP process states "A PDEP discussion will remain open for up to 60 days.". we have far exceeded this requirement and so we are not calling an early vote?

In all cases there needs to be a notice that there will be a vote, even in the case of early voting.

by sending an early reminder of 15 days remaining until the voting period starts.

I do not think this needs updating. We need to give proper notice that a vote will be taking place. Anyone with final thoughts on a PDEP needs time to voice them prior to the vote starting.

going stale is IMO a good reason to declare this PDEP invalid? We need better guidelines in PDEP-1 about when a PDEP can be closed automatically to eliminate the situation we find ourselves in here and other PDEPs such as PDEP-16.

Agreed that deciding not to put a PDEP up for a vote does not require a notice. That is not what happened here. I don't think we need better guidelines - the guidelines are clear on what needs to happen in order to have a vote. If that doesn't happen, then there is no vote. The author of the PDEP can close it, or we can close one due to inactivity as in any other PR. It can always be reopened, and if another core member wants to take up a PDEP someone else has backed away from they are free to do so.

datapythonista · 2025-06-07T14:48:03Z

I do not think this needs updating. We need to give proper notice that a vote will be taking place. Anyone with final thoughts on a PDEP needs time to voice them prior to the vote starting.

I find it a paradox that we acted like PDEP-10 was already rejected immediately just because of some informal discussions, and now we need to wait one month to have confirmation if people really wanted it. Anyway, Marco brought some very good points, maybe we do benefit from some extra discussion time, we'll see.

Dr-Irv · 2025-06-07T15:22:24Z

This is a bit of a mess at this point.

PDEP-10 states:

Starting in pandas 2.2, pandas raises a FutureWarning when PyArrow is not installed in the users environment when pandas is imported. This will ensure that only one warning is raised and users can easily silence it if necessary. This warning will point to the feedback issue.

The feedback issue is here: #54466

We then received a lot of complaints, and we held a vote on Feb. 20, 2024, at #57424 to remove the pyarrow warning, with the idea that PDEP-15 would be created. We then removed the warning in version 2.2.1 .

So if we are to require pyarrow in pandas 3.0, we really haven't given the community the proper warning from versions 2.2.1 through 2.3.

One option we have is to release a 2.3.1 with the warning put back in so that people get the appropriate warning. We could also create a new feedback issue to see if the community still feels the requirement is onerous. After a month or two of collecting feedback, we can then have the vote on PDEP-15.

Personally, I'm really uncomfortable requiring pyarrow for 3.0 because we haven't provided the warning for over a year, so if we do make the requirement without having the warning present for a reasonable amount of time, we may get some pretty negative feedback once a 3.0 is released that requires pyarrow.

datapythonista · 2025-06-07T15:47:51Z

That vote took 6 days to be decided based on the issue log. It was won by just one vote, with no previous time for discussion, and with an ad-hoc voting not in the governance, and not discusssed in advance. Everybody seems ok with it (including myself).

Can't understand the double standards compared to adding 15 extra days delay here for something that people has been discussing and thinking about for more than two years. And that it just needs a final decision so things can be unblock so devs and users can act accordingly.

Good point about the warning. Personally not too bothered about it. Most users won't even realize if pandas installed some extra depemdencies or not. And users know in major versions things change (particularly from 2 to 3, if we think of Python or Gnome). In any case, PDEP-15 is about rejecting the previous PDEP-10. Whether it's accepted or not, another discussion will be required to see how to move forward.

rhshadrach · 2025-06-08T12:38:00Z

Going through the feedback issue, the concerns voiced by users that still remain are:

Package size
Users on systems that need to compile their own wheels

I too am concerned about making PyArrow a hard dependency. Some of that is based on not understanding how many users we will leave without an upgrade path, and some of it is based on what maintenance PyArrow itself will receive (I've heard the situation there is not great, but don't have a good understanding).

One option we have is to release a 2.3.1 with the warning put back in so that people get the appropriate warning.

Assuming we go forward with this, I'm on board with adding back the warning if we also add a config to disable it. Many complained about the inability to silence the warning globally. Being able to silence it from the system environment is also a need-to-have in my opinion.

simonjayhawkins · 2025-06-08T16:07:57Z

Thanks @rhshadrach. In my mind this is not a binary issue. There are three options:

reject PDEP-10 and effectively remove the future intentions of improved Arrow interoperability and improvements to default inference of some new datatypes from the pandas roadmap. That's achieved by accepting this PDEP.
do not reject PDEP-10 and make pyArrow a required dependency in pandas 3.0. That's achieved by rejecting this PDEP.
postpone making pyArrow a required dependency until the next major release and continue development assuming that pyArrow will be a required dependency in the future. That's achieved by rejecting this PDEP and approving a revision to PDEP-10 instead.

Surely welcome suggestions from others for alternative options.

Now my rejection of this PDEP in the "invalid" vote is based on not wanting to remove some previously agreed roadmap items.

in #61596 (comment) I said I may vote differently if the motivation for rejecting PDEP-10 was to keep PyArrow an optional dependency indefinitely. I am not adverse to that but would want to clarify the pandas roadmap for default inference of some new dtypes before agreeing to that. I would probably want to see some object fallback versions of these dtypes for default inference when PyArrow is not installed.

datapythonista · 2025-06-08T16:23:03Z

I agree with you @simonjayhawkins, but except for the 3 point. I think this PDEP leaves the door open to adding PyArrow in the future. Rejecting this in my opinion means PDEP-10 is what we want.

Otherwise we'll wait one month for the outcome here, and then more than another extra month to vote on a change to PDEP-10, and we may keep the uncertainty until 2026. I think this uncertainty is harming both pandas development and users.

Regarding the warning, the one who are in favor kf adding back the warning, can you clarify what's the goal? I think the previous warning had the goal of gathering feedback from users that don't follow pandas development. We got it, and it was useful. I don't think the new feedback is particularly useful, in that in my opinion we know the implications. Or would you just have the warning as a message so users know what's coming? I think we may be the first project I'm aware of that does this. Python broke everything in version 3 and I don't remember any warning in tbe interpreter. There are better ways to communicate with users.

Dr-Irv · 2025-06-09T14:14:52Z

Or would you just have the warning as a message so users know what's coming?

Yes, if we decide to require pyarrow, we need to put the warning back. Not necessarily to get feedback, but just so people know this will be happening.

I think we may be the first project I'm aware of that does this.

Not sure what "this" is, but we have always created warnings when we do things like this. What's odd in this case is that we had a warning, and removed it, but now there is no visibility to the community if we decide to reject PDEP-15 and require pyarrow.

There are three options:

reject PDEP-10 and effectively remove the future intentions of improved Arrow interoperability and improvements to default inference of some new datatypes from the pandas roadmap. That's achieved by accepting this PDEP.

do not reject PDEP-10 and make pyArrow a required dependency in pandas 3.0. That's achieved by rejecting this PDEP.

postpone making pyArrow a required dependency until the next major release and continue development assuming that pyArrow will be a required dependency in the future. That's achieved by rejecting this PDEP and approving a revision to PDEP-10 instead.

With respect to (2), if that is decision, then I think we also need to decide whether we put the warning back in prior to 3.0. So maybe (2) gets split into 2 options - one with the warning and one without.

I like the options, and maybe we need to just vote on those 3 (or 4) options as opposed to a approve/reject on PDEP-15. Not sure what threshold to use when we have multiple options.

datapythonista · 2025-06-09T15:10:22Z

Not sure what "this" is

Sorry I wasn't clear. "Doing this" I meant showing a warning to users of one version because we will add an extra dependency to next version. pandas didn't even add an extra dependency, so there wasn't the opportunity. And I don't of any other Python library (or any software at all) that has shown a warning. No idea how common is to add dependencies that significantly invrease the installation size, or that don't work in all previously supported architectures (is this still the case?). So, while we can do whatever we want, I diaagree we must show a warning because that's a standard practice, it's not. And I personally don't find it useful, most users won't care about the new dependecy, and if you care because you can't upgrade pandas, in my opinion it's better to know when you try to upgrade, that when you are using a previous version and it's not a problem.

WillAyd · 2025-06-09T15:24:20Z

Yes, if we decide to require pyarrow, we need to put the warning back.

What I struggle with in this conversation is that we did decide to require pyarrow through the initial vote, but we have implicitly backtracked on that in implementation

I am guessing we all must have a different interpretation over the scope of PDEP-10. I think part of the team interpreted it as "pyarrow is required now, let's figure out the implementation details and any problems as we go," whereas the other part of the team may be coming from the mindset of "PDEP-10 did not clarify the scope of these implementation details or address challenge X, so we should revisit it."

There are probably other interpretations therein, but I think we collectively need to clarify how far-reaching the scope is of a PDEP like this; otherwise I think the value of our PDEP process is questionable.

I am particularly frustrated with PDEP-10 and some of the text in PDEP-15 because the goals we used to subvert PDEP-10 have never been goals of the project since I have been involved. Taking package size as an example, adding pyarrow of course increases the size versus the current state, but it would still be smaller than what pandas distributed for much of its history, when it alone was 100MB+.

Of course its always nicer to have a smaller package, but I don't recall it being a huge issue when it was reduced however many years ago. We use Cython currently without much concern even though it creates bloated packages, so its hard to understand what we are aiming for. These things feel like moving targets, and if the idea is a PDEP needs to address them in detail before being valid, then I worry that the PDEP process won't be suitable to clarify larger enhancements to the project, as that is an impossible ask

simonjayhawkins · 2025-06-09T15:46:33Z

Not sure what "this" is

Sorry I wasn't clear. "Doing this" I meant showing a warning to users of one version because we will add an extra dependency to next version. pandas didn't even add an extra dependency, so there wasn't the opportunity. And I don't of any other Python library (or any software at all) that has shown a warning. No idea how common is to add dependencies that significantly invrease the installation size, or that don't work in all previously supported architectures (is this still the case?). So, while we can do whatever we want, I diaagree we must show a warning because that's a standard practice, it's not. And I personally don't find it useful, most users won't care about the new dependecy, and if you care because you can't upgrade pandas, in my opinion it's better to know when you try to upgrade, that when you are using a previous version and it's not a problem.

While I fully agree that breaking changes in program behavior merit explicit warnings, the shift of a dependency from optional to required isn’t the same kind of risk. This change is fundamentally an infrastructure and packaging decision, not one that alters how pandas behaves once installed.

Warnings Should Address Behavioral Breaks, Not Packaging Concerns

The primary purpose of warnings (such as a FutureWarning) is to alert users to changes that may directly impact code execution. When a dependency switches from optional to required, the runtime behavior of pandas remains unchanged for users who have configured their environment correctly. Users who encounter installation issues or packaging challenges (for example, in constrained environments) are best served by clear installation documentation and targeted platform-specific guidance—not by a runtime warning that can easily be misinterpreted as a change in behavior.
Avoiding Noise and Warning Fatigue

Introducing a warning for an infrastructure change risks adding unnecessary noise. Our users already need to monitor warnings that signal genuine changes in functionality. Adding another warning for packaging difficulties may dilute the impact of more critical alerts. This nonessential message could lead to confusion about the source of runtime issues, as the warning wouldn’t pertain to any new logic or data handling behavior.
Infrastructure Difficulties Are Best Handled Outside Code

Installation and environment-specific limitations (e.g., issues in AWS Lambda build sizes) fall squarely under the domain of dependency management and release engineering. These challenges have established solutions (such as providing proper wheels, Docker images, or Lambda layers) and should be documented in installation guides. Expecting end users to troubleshoot these infrastructure details via runtime warnings diverts attention away from the core library functionality.
Ecosystem – Let the Packaging Tools Do Their Job

Modern package management tools (pip, conda, etc.) already signal missing dependencies through installation failures or clear error messages. The responsibility for addressing package size or compatibility concerns lies with the packaging and deployment ecosystem. Enhancing our installation documentation and offering platform-specific installation instructions is the more appropriate path for accommodating infrastructure challenges.

In summary, while it’s crucial to warn users about genuine breaking changes, issuing a warning for a dependency’s transition from optional to required misplaces the focus. Instead, we should concentrate on improving our installation guidance and rely on established packaging practices—keeping the runtime warnings for actual behavioral changes that directly affect the user’s code.

Dr-Irv · 2025-06-09T16:19:45Z

In summary, while it’s crucial to warn users about genuine breaking changes, issuing a warning for a dependency’s transition from optional to required misplaces the focus. Instead, we should concentrate on improving our installation guidance and rely on established packaging practices—keeping the runtime warnings for actual behavioral changes that directly affect the user’s code.

I see your point. But we did put in PDEP-10 that we would include a warning, and we did that in pandas 2.2, removed in 2.2.1. We did create the warning because there was concern that requiring pyarrow would cause issues for part of the community. Those issues were obtained via feedback. My concern is that some of those issues identified at that time may still exist. While the warning that we created doesn't fit into your categories, it did serve a useful purpose, and the question is whether we still need it to make sure that requiring pyarrow for 3.0 is still the right thing to do.

Now if we can address all of those issues brought up a year ago by better documentation of how to deal with the pyarrow requirement, then let's make sure that documentation gets written.

simonjayhawkins · 2025-06-09T16:36:37Z

and if the idea is a PDEP needs to address them in detail before being valid, then I worry that the PDEP process won't be suitable to clarify larger enhancements to the project, as that is an impossible ask

our project roadmap states

The roadmap is defined as a set of major enhancement proposals named PDEPs.

We use the roadmap to solicit project funding and so that the community has an idea of the future direction of pandas.

So I think that we need to be able to approve PDEPs which are presented as no more than a concept, without significant discussion on implementation details, naming or agreed timelines.

simonjayhawkins · 2025-06-10T11:47:28Z

Now if we can address all of those issues brought up a year ago by better documentation of how to deal with the pyarrow requirement, then let's make sure that documentation gets written.

IIUC from prior discussion the issue is only expected to affect a small proportion of the pandas userbase? And presumably mainly commercial users that in theory have the pockets to fund any work needed? So do you see the need for documentation as a blocker? Affected users can stay on pandas 2.x until the documentation is written?

simonjayhawkins · 2025-06-10T12:31:44Z

3. postpone making pyArrow a required dependency until the next major release and continue development assuming that pyArrow will be a required dependency in the future. That's achieved by rejecting this PDEP and approving a revision to PDEP-10 instead.

after further discussions here and elsewhere, I accept that providing object fallback versions so that PyArrow can remain an optional dependency for now and still use pyarrow backed arrays when pyArrow is installed is a maintenance/development burden and, at this time, the project does not have the resources. PDEP-14 has shown that the work involved is not trivial and our release cadence has suffered as a result.

With this is mind, I will not change my vote but accept that rejecting this PDEP does accept making pyArrow as required dependency in the 3.0 release.

So I am happy to withdraw option 3 unless others favor that approach.

simonjayhawkins · 2025-06-11T10:11:41Z

@rhshadrach

AFAICT this PDEP has had no announcements on the pandas-dev list.

In your comment #61596 (comment) you seemed to miss out the first part of the paragraph from the PDEP. PDEP-1 states

To enable and encourage discussions on PDEPs, we follow a notification schedule. At each of the following steps, the pandas team, and the pandas-dev mailing list are notified via GitHub and E-mail:

Once a PDEP is ready for discussion.
...

So the official discussion period was never started? Closing the vote for a procedural irregularity is IMO unjustified.

PDEP-10: Change status to rejected

98eb85a

lithomas1 added the PDEP pandas enhancement proposal label May 7, 2024

WillAyd reviewed May 7, 2024

View reviewed changes

lithomas1 marked this pull request as ready for review May 8, 2024 15:09

lithomas1 requested a review from datapythonista as a code owner May 8, 2024 15:09

simonjayhawkins reviewed May 8, 2024

View reviewed changes

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved

lithomas1 changed the title ~~PDEP-10: Change status to rejected~~ PDEP-15: Change status to rejected May 19, 2024

Split out into new pdep

5e451db

WillAyd reviewed May 20, 2024

View reviewed changes

Dr-Irv changed the title ~~PDEP-15: Change status to rejected~~ PDEP-15: Change status of PDEP-10 to rejected May 20, 2024

Dr-Irv requested changes May 20, 2024

View reviewed changes

WillAyd reviewed May 20, 2024

View reviewed changes

lithomas1 force-pushed the reject-pdep10 branch from a5905d4 to 46a0cea Compare May 21, 2024 05:28

remove pdep-10 changes

2af5632

lithomas1 force-pushed the reject-pdep10 branch from 46a0cea to 2af5632 Compare May 21, 2024 05:29

lithomas1 and others added 2 commits May 20, 2024 22:30

Apply suggestions from code review

6e4efe5

Co-authored-by: Irv Lustig <[email protected]>

Apply suggestions from code review

45754bf

Co-authored-by: Irv Lustig <[email protected]>

Dr-Irv reviewed May 21, 2024

View reviewed changes

lithomas1 requested review from WillAyd and MarcoGorelli September 2, 2024 18:47

WillAyd reviewed Sep 3, 2024

View reviewed changes

asishm mentioned this pull request Oct 14, 2024

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings #54466

Open

github-actions bot added the Stale label Nov 9, 2024

datapythonista removed the Stale label Jun 6, 2025

MarcoGorelli reviewed Jun 6, 2025

View reviewed changes

datapythonista mentioned this pull request Jun 7, 2025

VOTE: Voting issue for PDEP-15: Reject adding PyArrow as a required dependency #61596

Closed

1 task


		2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.

		For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics

Uh oh!

PDEP-15: Reject PDEP-10 #58623

Are you sure you want to change the base?

PDEP-15: Reject PDEP-10 #58623

Uh oh!

Conversation

lithomas1 commented May 7, 2024

Uh oh!

lithomas1 commented May 7, 2024

Uh oh!

github-actions bot commented May 7, 2024

Uh oh!

lithomas1 commented May 7, 2024

Uh oh!

WillAyd commented May 7, 2024

Uh oh!

lithomas1 commented May 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lithomas1 commented May 8, 2024

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dr-Irv commented May 8, 2024

Uh oh!

simonjayhawkins commented May 8, 2024

Uh oh!

lithomas1 commented May 8, 2024

Uh oh!

WillAyd commented May 18, 2024

Uh oh!

lithomas1 commented May 18, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dr-Irv left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 22, 2024

Uh oh!

WillAyd Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 9, 2024

WillAyd Sep 3, 2024 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

rhshadrach commented Jun 7, 2025 •

edited

Loading

WillAyd commented Jun 9, 2025 •

edited

Loading