-
Notifications
You must be signed in to change notification settings - Fork 5
Refactor type handling in invoices #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e78e897 to
fd5a1ad
Compare
process_report/invoices/invoice.py
Outdated
| self.data[column_name] = value | ||
| self.data[column_name] = self.data[column_name].astype(column_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little puzzled about what's happening here: we're setting self.data[column_name], and then immediately setting it to another value. I guess this is to have access to the astype method, but is there way of accomplishing this that is mroe intuitive?
Looking through the docs a bit, I think we could write this as:
self.data[column_name] = pandas.Series(value, dtype=column_type)
My simple test:
>>> BALANCE_FIELD_TYPE = pandas.ArrowDtype(pyarrow.decimal128(21, 2))
>>> df = pandas.DataFrame()
>>> df['foo'] = pandas.Series([1,2,3], dtype=BALANCE_FIELD_TYPE)
>>> df
foo
0 1.00
1 2.00
2 3.00
| ### Invoice column types | ||
| BOOL_FIELD_TYPE = pandas.BooleanDtype() | ||
| STRING_FIELD_TYPE = pandas.StringDtype() | ||
| BALANCE_FIELD_TYPE = pandas.ArrowDtype(pyarrow.decimal128(21, 2)) | ||
| ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be importing these from process_report.invoices.invoice rather than re-defining them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that the tests should check if the invoicing code can handle input dataframes with specific datatypes. If I imported these types from process_report.invoices.invoice, then I wouldn't be testing the type compatibility. I imagined this would be useful in a case where if a future change accidentally or intentionally changed the typing of STRING_FIELD_TYPE or BALANCE_FIELD_TYPE, I would like my tests to show me there's typing incompatibility.
Previously, there was no consistent method to create new invoice columns with an explicit type. This has been the source of some confusion when debugging and writing test cases. As such, several changes have been added to reduce confusion over the typing of invoice columns: - A function (`_create_column`) has been added to `invoice.py` to provide a consistent way of creating new columns with an explicit type. - Many fields will now use Pandas' built-in extension types [1], such as `BooleanDtype`, which allows them to be nullable. This also enforces typing, as assigning incompatible values to these columns will raise a type error - Decimal precision has been increased to 21. This is to allow `int64` values to be converted to `pyarrow.decimal128(21, 2)`. This is useful for writing test cases (no longer have to write `[1.0, 2.0]`), and in case our invoices have balance fields that resemble integers - A new test case base class is made with a function to create a dataframe with standardized column types. This allows creating invoice dataframes that more closely resemble what our pipeline will see during execution - Removed redundant code in `_get_test_invoice()` of many test classes - `PIInvoice` will cast the dataframe to `StringDtype` to prevent TypeError with the new Pandas extension types [1] https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
Closes #191. More details in the commit message.
Currently, the test cases are failing because the PI-specific invoice injects an incompatible value in the invoice dataframe. After some thinking, I thought the PI invoice could do some refactoring in that spot as well. I'll put this as a draft for now while I work on refactor the PI invoice.