Skip to content

Chronos: Some new API suggestions for TSDataset #6054

Open
@liangs6212

Description

@liangs6212

Why we need this change?

Poor performance of pandas, easier to use with fewer cascade calls.

How we can modify?

  1. roll+to_numpy:
    This is a very classic combination that must be called almost every time. For users, they probably don't need to know what roll is, since we can probably just keep to_numpy or numpy. Btw, these changes should not affect the use of to_torch_data_loader.
# API change
from bigdl.chronos.data import TSDataset
tsdata = TSDataset.from_pandas(..., lookback=48, horizon=1, with_split=False)
x, y = tsdata.to_numpy()  # like to_torch_data_loader
  1. Optimize some existing APIs:
    Perhaps too many cascade calls are not necessary, we can change some cascade calls to properties. Classified according to framework, with some usage given.
Category pandas tsfresh scikit-learn other
Method deduplicate/impute/resample gen_dt_feature/gen_global_feature/gen_rolling_feature scale/unscale/unscale_numpy to_tf_dataset/to_numpy/to_torch_data_loader/to_pandas
Advice Change to attributes No change Calling scale will change the source data, can we leave the original data unchanged so we don't need unscale and unscale_numpy either? Merge roll(exclude to_pandas/to_torch_data_loader)
# Change pandas-related methods to attributes.
tsdata = TSDataset.from_pandas(..., impute=True, impute_mode="const",
                               const_num=0, deduplicate=True,
                               resample=True, interval='s', start_time=None,
                               end_time=None, merge_mode='mean', with_split=False)
  1. We can use Descriptor and Property to manage properties and methods, more info, please refer to Chronos: Make TSDataset more friendly #5656.
@property
def get_cycle_length(self):
    cycle_length = (...)
    return cycle_length

@get_cycle_length.setattr
def get_cycle_length(self, instance, value):
    # Check for illegal input
    if not isinstance(value, str):
        raise error
    return cycle_length

# Usage 
tsdataset.get_cycle_length = 'min'  # Set the mode of cycle_length.
  1. Because of the poor performance of pandas, we can add polars as a new backend, polars has good parallel performance and supports the lazy API.
tsdata = TSDataset.from_pandas(df, ..., use_polars=True)

pandas and polars performance comparison: https://h2oai.github.io/db-benchmark/
Differences between pandas and polars:

  1. polars does not have indexes.
  2. groupby can only return a single data column.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions