Skip to content

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

Addresses one of the remaining TODO items from #48998

Similar as #51082 and some follow-up PRs, ensuring we also mark EAs as read-only like we do for numpy arrays, when the user gets the underlying EA from a pandas object.
For that purpose, added a _readonly attribute to the EA class that is False by default.

Still need to add more tests and fix a bunch of tests

@@ -269,6 +270,8 @@ class ExtensionArray:
# strictly less than 2000 to be below Index.__pandas_priority__.
__pandas_priority__ = 1000

_readonly = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use arr.flags.writeable to be consistent with numpy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this was easier for a quick POC ;)
It would indeed keep it more consistent in usage, so that might be a reason to add a flags attribute, so code that needs to work with both ndarray or EA can use one code path. But I don't think we would ever add any of the other flags that numpy has, so not sure it would then be worth to add a nested attribute for this.

elif self._readonly and astype_is_view(self.dtype, result.dtype):
# If the ExtensionArray is readonly, make the numpy array readonly too
result = result.view()
result.flags.writeable = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be done below the setting of na_value on L616?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because in that case the result array is already a copy, so no need to take a read-only view in that case

pd.date_range("2000", periods=4).array,
pd.timedelta_range("2000", periods=4).array,
pd.date_range("2000", periods=4).array.copy(),
pd.timedelta_range("2000", periods=4).array.copy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

._values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems that my test updates are a bit of a mix of both .array/values.copy() or _values. Will more consistently use _values

@@ -969,6 +975,8 @@ def __getitem__(
# _NestedSequence[Union[bool, int]]], ...]]"
data_slice = self.to_dense()[key] # type: ignore[index]
elif isinstance(key, slice):
if key == slice(None):
return type(self)._simple_new(self.sp_values, self.sp_index, self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this special case needed?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid that arr[:] makes a copy, and I got there because the default EA.view() implementation uses that.

But can add a comment to clarify. There is a comment just below about "# Avoid densifying when handling contiguous slices", but that does not actually avoid making a copy in its current implementation because it translates the slice in integer indices. While for the special case of a full slice, that should not even be needed.

@jbrockmendel
Copy link
Member

i get why .values and .array are made read-only, but why are we bothering with to_numpy?

@jorisvandenbossche
Copy link
Member Author

That's a good question, I didn't really think about it deeply .. But so for the non-extension dtypes, we also did it for .values / __array__ and to_numpy() (#51082), and so followed along here.

I do think there is value in being consistent in those different ways to get a numpy array from the pandas object. So could also ask, why not for to_numpy()? And then compared to .values, to_numpy() actually gives you more control with the ability to ask for a copy.
(in practice the implementation of __array__ and to_numpy() are also quite overlapping for the EAs.

@jbrockmendel
Copy link
Member

So could also ask, why not for to_numpy()?

I don't feel strongly about this, but asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes. Without that, most of this is just boilerplate edits to __getitem__ methods.

The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit copy keyword. With copy=False, the user ideally understands that they are getting a view on existing data.

@jorisvandenbossche
Copy link
Member Author

asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes.

Looking at the diff again, I think it is a bit 50/50 between to_numpy() and __array__. But to_numpy() also reuses the result from __array__ in some cases, so if we would then want to have to_numpy() consistently not return readonly data, that would also requires some changes in to_numpy(). So regarding the implementation, not entirely sure this would be a lot simpler (but didn't look in detail).

The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit copy keyword. With copy=False, the user ideally understands that they are getting a view on existing data.

Yeah, we could potentially also make the default of copy to be None instead of False, with the same meaning (i.e. avoid a copy if possible), and so then if someone explicitly passes copy=False, then we wouldn't set the readonly flag.

From previous discussions (maybe #52823), I seem to remember that we at some point did bring up whether it would be worth having a keyword to control this behaviour, i.e. so there would be a way that you could ask for a numpy array that was guaranteed to be mutable. Of course you could do to_numpy(copy=True) which also guarantees that, but that doesn't cover the case where you want to get the data zero-copy if possible, and you know that mutating it is fine (for example because the holding dataframe or series is dismissed after converting).
At the moment, the documentation (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays) suggests to manually reset the readonly flag:

arr = ser.to_numpy()
arr.flags.writeable = True

instead of adding a keyword like arr = ser.to_numpy(ensure_writable=True). But so in theory copy=False could also cover that.

(but this is probably a discussion for #52823)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants