Open
Description
Why we need this change?
Poor performance of pandas, easier to use with fewer cascade calls.
How we can modify?
- roll+to_numpy:
This is a very classic combination that must be called almost every time. For users, they probably don't need to know what roll is, since we can probably just keepto_numpy
ornumpy
. Btw, these changes should not affect the use ofto_torch_data_loader
.
# API change
from bigdl.chronos.data import TSDataset
tsdata = TSDataset.from_pandas(..., lookback=48, horizon=1, with_split=False)
x, y = tsdata.to_numpy() # like to_torch_data_loader
- Optimize some existing APIs:
Perhaps too many cascade calls are not necessary, we can change some cascade calls to properties. Classified according to framework, with some usage given.
Category | pandas | tsfresh | scikit-learn | other |
---|---|---|---|---|
Method | deduplicate/impute/resample | gen_dt_feature/gen_global_feature/gen_rolling_feature | scale/unscale/unscale_numpy | to_tf_dataset/to_numpy/to_torch_data_loader/to_pandas |
Advice | Change to attributes | No change | Calling scale will change the source data, can we leave the original data unchanged so we don't need unscale and unscale_numpy either? |
Merge roll(exclude to_pandas/to_torch_data_loader) |
# Change pandas-related methods to attributes.
tsdata = TSDataset.from_pandas(..., impute=True, impute_mode="const",
const_num=0, deduplicate=True,
resample=True, interval='s', start_time=None,
end_time=None, merge_mode='mean', with_split=False)
- We can use
Descriptor
andProperty
to manage properties and methods, more info, please refer to Chronos: MakeTSDataset
more friendly #5656.
@property
def get_cycle_length(self):
cycle_length = (...)
return cycle_length
@get_cycle_length.setattr
def get_cycle_length(self, instance, value):
# Check for illegal input
if not isinstance(value, str):
raise error
return cycle_length
# Usage
tsdataset.get_cycle_length = 'min' # Set the mode of cycle_length.
- Because of the poor performance of pandas, we can add
polars
as a new backend,polars
has good parallel performance and supports the lazy API.
tsdata = TSDataset.from_pandas(df, ..., use_polars=True)
pandas
and polars
performance comparison: https://h2oai.github.io/db-benchmark/
Differences between pandas and polars:
polars
does not have indexes.groupby
can only return a single data column.