diff --git a/pages/data-modeling/caching/pre-aggregations.mdx b/pages/data-modeling/caching/pre-aggregations.mdx index ad2b159..d31d0b3 100644 --- a/pages/data-modeling/caching/pre-aggregations.mdx +++ b/pages/data-modeling/caching/pre-aggregations.mdx @@ -648,12 +648,6 @@ cubes: granularity: day ``` -## Indexes - -To get the best performance out of your pre-aggregations you will likely want to define indexes too. - -Cube recommends "for most queries, there should be at least one index that makes a particular query scan very little amount of data”. You can read all about indexes [here](https://cube.dev/docs/product/caching/using-pre-aggregations#using-indexes). - ## Handling incremental data loads Sometimes your source data is updated incrementally for example: only the last few days are reloaded or updated while older data remains unchanged. In these cases, it’s more efficient to build your pre-aggregations incrementally instead of rebuilding the entire dataset. @@ -690,6 +684,271 @@ pre_aggregations: Without `update_window`, Cube refreshes partitions strictly according to `partition_granularity` (in this case, just the last day). +## Indexes + +Indexes make data retrieval faster. Think of an index as a shortcut that points directly to the relevant rows instead of searching through all the data. This speeds up queries that filter, group, or join on specific fields. + +In the context of pre-aggregations, indexes help [Cube Store](https://cube.dev/docs/product/deployment#cube-store) quickly locate and read only the data needed for a query improving performance, especially on large datasets. + +Indexes are particularly useful when: + +- For larger pre-aggregations, indexes are often required to achieve optimal performance, especially when a query doesn’t use all dimensions from the pre-aggregation. +- Queries frequently filter on **high-cardinality dimensions**, such as `product_id` or `date`. Indexes help Cube Store find matching rows faster in these cases. +- You plan to join one pre-aggregation with another, such as in a [`rollup_join`](/data-modeling/caching/pre-aggregations#rollup_join). + + +Adding indexes doesn’t change your data, it simply makes Cube Store more efficient at finding it. + + +### Using indexes in pre-aggregations + +Let’s start with a simple `products` model and define a `products_preagg` pre-aggregation. + +Here we add an index on `size` within our pre-aggregation, which Cube Store uses to quickly resolve joins and filters involving that indexed column. + +```yaml +cubes: + - name: products + sql_table: my_db.main.products + data_source: default + + dimensions: + - name: id + sql: id + type: number + primary_key: true + public: true + + - name: name + sql: name + type: string + + - name: size + sql: size + type: string + + + measures: + - name: count + type: count + title: '# of products' + + - name: price + type: sum + title: Total USD + sql: price + + joins: + - name: orders + sql: "{CUBE.id} = {orders.product_id}" + relationship: one_to_many + + pre_aggregations: + - name: products_preagg + type: rollup + dimensions: + - size + measures: + - count + - price + indexes: + - name: product_index + columns: + - size +``` + +In this example: + +- The `products_preagg` pre-aggregation stores aggregated products data by size dimension. +- The index `product_index` on `size` speeds up queries using that dimension. +- Make sure the column you’re indexing is also included in the pre-aggregation dimensions; otherwise, Cube will return an error like: + + > Error during create table: Column 'products__id' in index 'products_products_preagg_product_index' is not found in table 'products_products_preagg' + > + + +Each index adds to the pre-aggregation build time, since all indexes are created during ingestion. Add only the ones you need. + + +Learn more about indexes [here](https://cube.dev/docs/product/data-modeling/reference/pre-aggregations#indexes). + +## Rollup_join + +- Cube can run SQL joins across different data sources. For example, you might have products in [PostgreSQL](/data/credentials#postgres) and orders in [MotherDuck](/data/credentials#motherduck). + +- All pre-aggregations so far have been of type rollup (which is the default pre-aggregation type). Cube also supports `rollup_join`, which combines data from two or more rollups coming from different data sources. + +- `rollup_join` joins pre-aggregated data inside [cube store](https://cube.dev/docs/product/deployment#cube-store), so you can query it together efficiently. + + +You don’t need a rollup_join to join cubes from the same data source. Just include the other cube’s dimensions and measures directly in your rollup definition as mentioned [here](/data-modeling/caching/pre-aggregations#performing-joins-across-cubes-in-your-pre-aggregations) + + +Let’s extend the example from the [indexes](/data-modeling/caching/pre-aggregations#indexes) section. We’ll keep the products model from the PostgreSQL (default) data source. Since it joins to the orders model on the id column, we’ll need to update the pre-aggregation to include id and name and add an index on it. + +```yaml + + pre_aggregations: + - name: products_preagg + type: rollup + dimensions: + - id + - name + - size + measures: + - count + - price + indexes: + - name: product_index + columns: + - id + refresh_key: + every: 1 hour +``` + +The new orders model from MotherDuck data source will be added to show how to run analytics across databases. + + +```yaml +cubes: + - name: orders + sql_table: public.orders + data_source: motherduck + + dimensions: + - name: id + sql: id + type: number + primary_key: true + + - name: created_at + sql: created_at + type: time + + - name: product_id + sql: product_id + type: number + public: false + + measures: + - name: count + type: count + title: "# of orders" + + joins: + - name: products + sql: "{CUBE.product_id} = {products.id}" + relationship: many_to_one + + pre_aggregations: + - name: orders_preagg + type: rollup + dimensions: + - product_id + - created_at + measures: + - count + time_dimension: CUBE.created_at + granularity: day + indexes: + - name: orders_index + columns: + - product_id + refresh_key: + every: 1 hour + + - name: orders_with_products_rollup + type: rollup_join + dimensions: + - products.name + - orders.created_at + measures: + - orders.count + time_dimension: orders.created_at + granularity: day + rollups: + - products.products_preagg + - orders_preagg +``` + +**Things to notice:** + +- `orders` uses the **MotherDuck** data source. +- `products` uses **default** data source (for example, PostgreSQL). Learn more about connecting to multiple datasources [here](/data/credentials). +- Always reference dimensions explicitly in your joins between models, especially when using a `rollup_join`: + + ```yaml + joins: + - name: products + sql: "{CUBE.product_id} = {products.id}" + relationship: many_to_one + ``` + + If you use `{CUBE}.product_id` or `{products}.id`, Cube will not recognise them as dimension references and will return an error like: + + ``` + From members are not found in [] for join ... + Please make sure join fields are referencing dimensions instead of columns. + ``` + +- Indexes are required when using `rollup_join` pre-aggregations so Cube Store can join multiple pre-aggregations efficiently. + + Without the right index, Cube may fail to plan the join and return an error like: + + ``` + Error during planning: Can't find index to join table ... + Consider creating index ... ON ... (orders__product_id) + ``` + + Therefore, notice that we have indexed the **join keys on both sides**: + + ``` + - `products.products_preagg` → index on `id` + - `orders.orders_preagg` → index on `product_id` + ``` + +- `orders_with_products_rollup` combines both pre-aggregations inside **Cube Store** using the type `rollup_join`. + + The `rollups:` property lists which pre-aggregations to join together: + + ```yaml + rollups: + - products.products_preagg + - orders_preagg + ``` + +- We also added a `time_dimension` with **day-level granularity** in `orders_with_products_rollup`. + + We expect users to ask questions at a daily level, such as “How many orders were placed per product each day?”. Setting the `time_dimension` to **day** ensures Cube builds and queries this data efficiently. + + + `rollup_join` is an ephemeral pre-aggregation. It uses the referenced pre-aggregations at query time, so freshness is controlled by them, not the rollup_join itself. + + +- Notice that we’ve set the `refresh_key` to **1 hour** on both referenced pre-aggregations (`products_preagg` and `orders_preagg`) to keep the data up to date. Learn more about refreshing pre-aggregations [here](/data-modeling/caching/pre-aggregations#refreshing-pre-aggregations). + +### How `rollup_join` works in Embeddable + +In this example, we’ll find the total **number of orders** for each **product**. The **product name** comes from the `products` model, while the **orders count** comes from the `orders` model. + + + +**Things to notice:** +- The query’s FROM clause references both pre-aggregations. This is how Cube joins pre-aggregated datasets from different data sources inside Cube Store. + +### Benefits of using `rollup_join` + +- Enables **cross-database joins** inside Cube Store +- Leverages **indexed pre-aggregations** for efficient distributed joins +- Avoids the need for ETL or database federation +- Provides consistent, scalable analytics across data sources + +Learn more about rollup_join [here](https://cube.dev/docs/product/data-modeling/reference/pre-aggregations#rollup_join). + ## Next Steps The next step is to setup Embeddable’s [Caching API](/data-modeling/caching/caching-api) to refresh pre-aggregations for each of your security contexts. diff --git a/public/video/rollup_join_example.mp4 b/public/video/rollup_join_example.mp4 new file mode 100644 index 0000000..0d7bfe5 Binary files /dev/null and b/public/video/rollup_join_example.mp4 differ