CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

jorisvandenbossche · 2025-07-22T22:51:04Z

Addresses one of the remaining TODO items from #48998

Similar as #51082 and some follow-up PRs, ensuring we also mark EAs as read-only like we do for numpy arrays, when the user gets the underlying EA from a pandas object.
For that purpose, added a _readonly attribute to the EA class that is False by default.

Still need to add more tests and fix a bunch of tests

…y in .array/EA.to_numpy()

simonjayhawkins · 2025-07-23T10:16:18Z

pandas/core/arrays/base.py

@@ -269,6 +270,8 @@ class ExtensionArray:
    #  strictly less than 2000 to be below Index.__pandas_priority__.
    __pandas_priority__ = 1000

+    _readonly = False


why not use arr.flags.writeable to be consistent with numpy?

Because this was easier for a quick POC ;)
It would indeed keep it more consistent in usage, so that might be a reason to add a flags attribute, so code that needs to work with both ndarray or EA can use one code path. But I don't think we would ever add any of the other flags that numpy has, so not sure it would then be worth to add a nested attribute for this.

pandas/_libs/ops.pyx

jbrockmendel · 2025-07-24T16:12:16Z

pandas/core/arrays/base.py

+        elif self._readonly and astype_is_view(self.dtype, result.dtype):
+            # If the ExtensionArray is readonly, make the numpy array readonly too
+            result = result.view()
+            result.flags.writeable = False


should this be done below the setting of na_value on L616?

I don't think so, because in that case the result array is already a copy, so no need to take a read-only view in that case

jbrockmendel · 2025-07-24T16:15:57Z

pandas/tests/arrays/test_datetimelike.py

-        pd.date_range("2000", periods=4).array,
-        pd.timedelta_range("2000", periods=4).array,
+        pd.date_range("2000", periods=4).array.copy(),
+        pd.timedelta_range("2000", periods=4).array.copy(),


Yeah, it seems that my test updates are a bit of a mix of both .array/values.copy() or _values. Will more consistently use _values

jbrockmendel · 2025-07-24T18:23:49Z

pandas/core/arrays/sparse/array.py

@@ -969,6 +975,8 @@ def __getitem__(
            # _NestedSequence[Union[bool, int]]], ...]]"
            data_slice = self.to_dense()[key]  # type: ignore[index]
        elif isinstance(key, slice):
+            if key == slice(None):
+                return type(self)._simple_new(self.sp_values, self.sp_index, self.dtype)


why is this special case needed?

To avoid that arr[:] makes a copy, and I got there because the default EA.view() implementation uses that.

But can add a comment to clarify. There is a comment just below about "# Avoid densifying when handling contiguous slices", but that does not actually avoid making a copy in its current implementation because it translates the slice in integer indices. While for the special case of a full slice, that should not even be needed.

jbrockmendel · 2025-07-24T18:25:44Z

i get why .values and .array are made read-only, but why are we bothering with to_numpy?

jorisvandenbossche · 2025-07-24T22:13:37Z

That's a good question, I didn't really think about it deeply .. But so for the non-extension dtypes, we also did it for .values / __array__ and to_numpy() (#51082), and so followed along here.

I do think there is value in being consistent in those different ways to get a numpy array from the pandas object. So could also ask, why not for to_numpy()? And then compared to .values, to_numpy() actually gives you more control with the ability to ask for a copy.
(in practice the implementation of __array__ and to_numpy() are also quite overlapping for the EAs.

jbrockmendel · 2025-07-25T14:35:14Z

So could also ask, why not for to_numpy()?

I don't feel strongly about this, but asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes. Without that, most of this is just boilerplate edits to __getitem__ methods.

The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit copy keyword. With copy=False, the user ideally understands that they are getting a view on existing data.

jorisvandenbossche · 2025-08-03T09:27:04Z

asked in the first place because it seems most of the code complexity in this PR is driven by to_numpy changes.

Looking at the diff again, I think it is a bit 50/50 between to_numpy() and __array__. But to_numpy() also reuses the result from __array__ in some cases, so if we would then want to have to_numpy() consistently not return readonly data, that would also requires some changes in to_numpy(). So regarding the implementation, not entirely sure this would be a lot simpler (but didn't look in detail).

The main reason i can think of to treat to_numpy different from .array and .values is that it has an explicit copy keyword. With copy=False, the user ideally understands that they are getting a view on existing data.

Yeah, we could potentially also make the default of copy to be None instead of False, with the same meaning (i.e. avoid a copy if possible), and so then if someone explicitly passes copy=False, then we wouldn't set the readonly flag.

From previous discussions (maybe #52823), I seem to remember that we at some point did bring up whether it would be worth having a keyword to control this behaviour, i.e. so there would be a way that you could ask for a numpy array that was guaranteed to be mutable. Of course you could do to_numpy(copy=True) which also guarantees that, but that doesn't cover the case where you want to get the data zero-copy if possible, and you know that mutating it is fine (for example because the holding dataframe or series is dismissed after converting).
At the moment, the documentation (https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html#read-only-numpy-arrays) suggests to manually reset the readonly flag:

arr = ser.to_numpy()
arr.flags.writeable = True

instead of adding a keyword like arr = ser.to_numpy(ensure_writable=True). But so in theory copy=False could also cover that.

(but this is probably a discussion for #52823)

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarra…

a9df51b

…y in .array/EA.to_numpy()

jorisvandenbossche added the Copy / view semantics label Jul 22, 2025

jorisvandenbossche mentioned this pull request Dec 11, 2023

Copy-on-Write (PDEP-7) follow-up overview issue #48998

Open

38 tasks

jorisvandenbossche added 5 commits July 23, 2025 01:16

cleanup

9cd6e4f

fixup attribute name in tests

c6f37d1

fix tests

8058d9a

more test fixes

91465ee

add tests for .array being readonly

856dc02

jorisvandenbossche mentioned this pull request Jul 23, 2025

TST[string]: update expecteds for using_string_dtype to fix xfails #61727

Merged

7 tasks

simonjayhawkins reviewed Jul 23, 2025

View reviewed changes

jorisvandenbossche requested a review from jbrockmendel July 23, 2025 22:04

jbrockmendel reviewed Jul 24, 2025

View reviewed changes

pandas/_libs/ops.pyx Show resolved Hide resolved

jbrockmendel reviewed Jul 24, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into cow-ea-readonly

828fadc

typing

ee1ed6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

jorisvandenbossche commented Jul 22, 2025

Uh oh!

simonjayhawkins Jul 23, 2025

Uh oh!

jorisvandenbossche Jul 23, 2025

Uh oh!

Uh oh!

jbrockmendel Jul 24, 2025

Uh oh!

jorisvandenbossche Jul 24, 2025

Uh oh!

jbrockmendel Jul 24, 2025

Uh oh!

jorisvandenbossche Jul 24, 2025

Uh oh!

jbrockmendel Jul 24, 2025

Uh oh!

jorisvandenbossche Jul 24, 2025 •

edited

Loading

Uh oh!

jbrockmendel commented Jul 24, 2025

Uh oh!

jorisvandenbossche commented Jul 24, 2025

Uh oh!

jbrockmendel commented Jul 25, 2025

Uh oh!

jorisvandenbossche commented Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

Are you sure you want to change the base?

CoW: add readonly flag to ExtensionArrays, return read-only EA/ndarray in .array/EA.to_numpy() #61925

Conversation

jorisvandenbossche commented Jul 22, 2025

Uh oh!

simonjayhawkins Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbrockmendel Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jul 24, 2025

Uh oh!

jorisvandenbossche commented Jul 24, 2025

Uh oh!

jbrockmendel commented Jul 25, 2025

Uh oh!

jorisvandenbossche commented Aug 3, 2025

Uh oh!

Uh oh!

jorisvandenbossche Jul 24, 2025 •

edited

Loading