|
| 1 | +# Search |
| 2 | + |
| 3 | +<!-- TOC --> |
| 4 | +- [Search](#search) |
| 5 | + - [Usage](#usage) |
| 6 | + - [Implementation 1: Filter on DB Table](#implementation-1-filter-on-db-table) |
| 7 | + - [Implementation 2: Subscription Search](#implementation-2-subscription-search) |
| 8 | + - [Postgres Text Search](#postgres-text-search) |
| 9 | + - [Materialized View subscriptions\_search](#materialized-view-subscriptions_search) |
| 10 | +<!-- TOC --> |
| 11 | + |
| 12 | +The orchestrator-core provides search functionality to find data within the Postgres DB. |
| 13 | + |
| 14 | +The below overview describes which objects can be searched through which endpoint, and the underlying search implementation. |
| 15 | + |
| 16 | +| Object | REST endpoint | GraphQL endpoint | Implementation | |
| 17 | +|----------------|------------------------------------|--------------------------------|---------------------| |
| 18 | +| Subscriptions | `/api/subscriptions/search?query=` | `subscriptions(query: "...")` | Subscription Search | |
| 19 | +| Processes | | `processes(query: "...")` | Filter on DB table | |
| 20 | +| Product Blocks | | `product_blocks(query: "...")` | Filter on DB table | |
| 21 | +| Products | | `products(query: "...")` | Filter on DB table | |
| 22 | +| Resource types | | `resource_types(query: "...")` | Filter on DB table | |
| 23 | +| Workflows | | `workflows(query: "...")` | Filter on DB table | |
| 24 | + |
| 25 | +There are 2 implementations: |
| 26 | + |
| 27 | +1. **Filter on DB Table**: generic implementation to search on values in the DB table for an object, or related tables |
| 28 | +1. **Subscription Search**: specialized implementation to search for values anywhere in a subscription, such as instance values or customer descriptions |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +Call the REST or GraphQL endpoint with the `query` parameter set to the search query. |
| 33 | + |
| 34 | +Some search query examples for Subscriptions: |
| 35 | + |
| 36 | +| Query | Matches subscriptions with | |
| 37 | +|----------------------------------------------------------|----------------------------------------------------------------------| |
| 38 | +| `test` | `test` in any field | |
| 39 | +| `description:test` | `test` in description field | |
| 40 | +| `L2VPN` | `L2VPN` in any field | |
| 41 | +| `tag:L2VPN` | `L2VPN` in product tag field | |
| 42 | +| `tag:(FW \| L2VPN)` <br/> Or: <br/>`tag:FW \| tag:L2VPN` | `FW` **OR** `L2VPN` in product tag field | |
| 43 | +| `description:test tag:L2VPN` | `test` in description field **and** `L2VPN` in product tag field | |
| 44 | +| `description:test -tag:L2VPN` | `test` in description field **and NOT** `L2VPN` in product tag field | |
| 45 | +| `test 123` <br/> Or: <br/>`123 test` | `test` **AND** `123` anywhere |
| 46 | +| `"test 123"` | `test 123` anywhere (Phrase search) | |
| 47 | +| `test*` | any field starting with `test` (Prefix search) | |
| 48 | + |
| 49 | +Note that: |
| 50 | +* For other objects the query syntax is the same |
| 51 | +* Searching is case-insensitive |
| 52 | +* Ordering of words does not matter (unless it is a Phrase) |
| 53 | +* Search words cannot contain the characters `|-*():"` as they are part of the search query grammar |
| 54 | + |
| 55 | +## Implementation 1: Filter on DB Table |
| 56 | + |
| 57 | +This implementation translates the user's search query to `WHERE` clauses on DB columns of the object's DB table. For some objects this extends to related DB tables. |
| 58 | + |
| 59 | +We can distinguish these steps: |
| 60 | + |
| 61 | +* The module `search_query.py` parses the user's query and generates a sequence of sqlalchemy `.filter()` clauses |
| 62 | +* The REST/GraphQL endpoint appends these clauses to the sqlalchemy `select()` to find objects that match the user's query |
| 63 | + |
| 64 | +## Implementation 2: Subscription Search |
| 65 | + |
| 66 | +This is a specialized implementation to allow searching values anywhere in a subscription or in related entities, without sacrificing performance. |
| 67 | + |
| 68 | +We can distinguish these steps/components: |
| 69 | + |
| 70 | + * The DB view `subscriptions_search` is a search index with _Text Search_ (TS) vectors. Both are explained in the next sections |
| 71 | + * The module `search_query.py` parses the user's query into a string that is used to create a TS query |
| 72 | + * The TS query is wrapped in a single sqlalchemy `.filter()` clause to match TS documents in `subscriptions_search` and get the corresponding `subscription_id` |
| 73 | + * The REST/GraphQL endpoint appends this clause to the sqlalchemy `select()` to find subscription objects that match the user's query |
| 74 | + |
| 75 | +The diagram below visualizes the flow and dependencies between components in the Database, API and Frontend. |
| 76 | + |
| 77 | + |
| 78 | + |
| 79 | +<!-- search_overview.png can be opened and modified in draw.io --> |
| 80 | + |
| 81 | +### Postgres Text Search |
| 82 | + |
| 83 | +Postgres _Full Text Search_ (TS) has extensive [documentation](https://www.postgresql.org/docs/current/textsearch.html) but we'll cover the fundamentals in this section. TS queries can be done on "normal" DB tables, i.e. without a search index, but this is too slow. It is recommended to maintain a search index which we do in DB view `subscriptions_search`. |
| 84 | + |
| 85 | +**Creating TS vectors** |
| 86 | + |
| 87 | +The query behind `subscriptions_search` retrieves Subscriptions joined with several other tables (as shown in the previous diagram), forming a "document" of keywords that in some way relate to the subscription. |
| 88 | + |
| 89 | +Each document is turned into a _tsvector_ with Postgres function `to_tsvector()` which consists of these phases: |
| 90 | +* Parse document into _tokens_: split text into _tokens_ using special characters as delimiters |
| 91 | +* Convert tokens into _lexemes_: a lexeme is a normalized token, i.e. this folds upper-case to lower-case. This step can also normalize based on language, but we disable that by using the `'simple'` dictionary (shown in diagram above) |
| 92 | +* Create vector optimized for search: store array of _lexemes_ along with positional information |
| 93 | + |
| 94 | +The array of _lexemes_ makes up the tsvector document that we store in `subsriptions_search`, and which we can query through the Postgres function `to_tsquery()`. |
| 95 | + |
| 96 | +We'll demonstrate how this works through a few examples. You can follow along in any postgres shell (v14 or above). If you have Docker installed, run these commands in separate shells. |
| 97 | + |
| 98 | +```sh |
| 99 | +docker run --rm -e POSTGRES_HOST_AUTH_METHOD=trust --name pg15 postgres:15 |
| 100 | +docker exec -i -t pg15 su - postgres -c psql |
| 101 | +``` |
| 102 | + |
| 103 | +Translating the input text `'color:Light_Blue count:4'` to a tsvector: |
| 104 | + |
| 105 | +``` |
| 106 | +postgres=# select to_tsvector('simple', 'color:Light_Blue count:4'); |
| 107 | + to_tsvector |
| 108 | +---------------------------------------------- |
| 109 | + '4':5 'blue':3 'color':1 'count':4 'light':2 |
| 110 | +(1 row) |
| 111 | +``` |
| 112 | + |
| 113 | +The result `'4':5 'blue':3 'color':1 'count':4 'light':2` is an array of lexemes combined with information about their position in the original text. |
| 114 | + |
| 115 | +With Postgres function `to_tsdebug()` one can investigate how the input was parsed. |
| 116 | + |
| 117 | +``` |
| 118 | +postgres=# select * from ts_debug('simple', 'color:Light_Blue count:4'); |
| 119 | + alias | description | token | dictionaries | dictionary | lexemes |
| 120 | +-----------+------------------+-------+--------------+------------+--------- |
| 121 | + asciiword | Word, all ASCII | color | {simple} | simple | {color} |
| 122 | + blank | Space symbols | : | {} | | |
| 123 | + asciiword | Word, all ASCII | Light | {simple} | simple | {light} |
| 124 | + blank | Space symbols | _ | {} | | |
| 125 | + asciiword | Word, all ASCII | Blue | {simple} | simple | {blue} |
| 126 | + blank | Space symbols | | {} | | |
| 127 | + asciiword | Word, all ASCII | count | {simple} | simple | {count} |
| 128 | + blank | Space symbols | : | {} | | |
| 129 | + uint | Unsigned integer | 4 | {simple} | simple | {4} |
| 130 | + ``` |
| 131 | + |
| 132 | +Note: when we would write "Light-Blue" with a dash instead of underscore, Postgres translates this into the vector `'blue':4 'light':3 'light-blue':2`, which makes it very complicated to query, in particular for UUID strings. To mitigate this we replace all occurrences of `-` with `_` when creating TS Vectors and before executing TS Queries. |
| 133 | + |
| 134 | +**Querying TS Vectors** |
| 135 | + |
| 136 | +To run queries against TS Vectors we have to prepare a TS Query with Postgres function `to_tsquery()`. We also pass the `'simple'` dictionary here to prevent language-specific normalization. |
| 137 | + |
| 138 | +``` |
| 139 | +postgres=# select to_tsquery('simple', 'Light_Blue'); |
| 140 | + to_tsquery |
| 141 | +-------------------- |
| 142 | + 'light' <-> 'blue' |
| 143 | + ``` |
| 144 | + |
| 145 | +The resulting TS query means as much as: `the vector must contain 'light' followed by 'blue'`. Note that the input string has been tokenized similar to TS vectors. |
| 146 | + |
| 147 | +We can execute this TS query against the TS vector: |
| 148 | + |
| 149 | +``` |
| 150 | +postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'Light_Blue'); |
| 151 | + ?column? |
| 152 | +---------- |
| 153 | + t |
| 154 | +(1 row) |
| 155 | +``` |
| 156 | + |
| 157 | +Returns `t` for `true`. |
| 158 | + |
| 159 | +As a final example we can also add an `OR` condition: |
| 160 | + |
| 161 | +``` |
| 162 | +postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'color <-> Green | count <-> 4'); |
| 163 | + ?column? |
| 164 | +---------- |
| 165 | + t |
| 166 | +(1 row) |
| 167 | +``` |
| 168 | + |
| 169 | +Which returns true because while the vector does not contain the color green, it does contain count 4. |
| 170 | + |
| 171 | +Note that the `color <-> Green | count <-> 4` string passed to `ts_query()` must be constructed in a specific way. |
| 172 | +This happens in the orchestrator-core module `search_query.py` as shown in the overview diagram. |
| 173 | + |
| 174 | +### Materialized View subscriptions_search |
| 175 | + |
| 176 | +As mentioned before, `subscriptions_search` is a DB view which lies at the heart of the implementation. |
| 177 | +If you're not familiar with database views; they represent a (usually complicated) database query in the form of a "virtual table". |
| 178 | + |
| 179 | +In this case we're using a [Materialized View](https://www.postgresql.org/docs/current/rules-materializedviews.html#RULES-MATERIALIZEDVIEWS) which is like a normal view, except that the "virtual table" is persisted to save resources and increase performance. |
| 180 | +The view's data is persisted until it is refreshed, which can be done manually or through an update trigger. (further explained below) |
| 181 | + |
| 182 | +This table has a [GIN index](https://www.postgresql.org/docs/current/textsearch-indexes.html#TEXTSEARCH-INDEXES) for efficient search queries. |
| 183 | + |
| 184 | +**Triggers** |
| 185 | + |
| 186 | +Database function `refresh_subscriptions_search_view` takes care of refreshing `subscriptions_search`. It is called by triggers on the following tables: |
| 187 | + |
| 188 | +* `fi_refresh_search` on table `fixed_inputs` |
| 189 | +* `products_refresh_search` on table `products` |
| 190 | +* `siv_refresh_search` on table `subscription_instance_values` |
| 191 | +* `sub_cust_desc_refresh_search` on table `subscription_customer_descriptions` |
| 192 | +* `sub_refresh_search` on table `subscriptions` |
| 193 | + |
| 194 | +The following query shows the current state of the triggers: |
| 195 | + |
| 196 | +```sql |
| 197 | +SELECT tgname, tgenabled |
| 198 | +FROM pg_trigger |
| 199 | +where pg_trigger.tgname in |
| 200 | + ('fi_refresh_search', |
| 201 | + 'products_refresh_search', |
| 202 | + 'sub_cust_desc_refresh_search', |
| 203 | + 'siv_refresh_search', |
| 204 | + 'sub_refresh_search'); |
| 205 | +``` |
| 206 | + |
| 207 | +When all triggers are disabled the output looks like: |
| 208 | + |
| 209 | +``` |
| 210 | + tgname | tgenabled |
| 211 | +------------------------------+----------- |
| 212 | + fi_refresh_search | D |
| 213 | + products_refresh_search | D |
| 214 | + siv_refresh_search | D |
| 215 | + sub_cust_desc_refresh_search | D |
| 216 | + sub_refresh_search | D |
| 217 | +``` |
| 218 | + |
| 219 | +And when all triggers are enabled the output looks like this: |
| 220 | +``` |
| 221 | + tgname | tgenabled |
| 222 | +------------------------------+----------- |
| 223 | + fi_refresh_search | O |
| 224 | + products_refresh_search | O |
| 225 | + siv_refresh_search | O |
| 226 | + sub_cust_desc_refresh_search | O |
| 227 | + sub_refresh_search | O |
| 228 | +``` |
| 229 | + |
| 230 | +Enabling all triggers is done with these statements: |
| 231 | + |
| 232 | +```sql |
| 233 | +ALTER TABLE fixed_inputs ENABLE TRIGGER fi_refresh_search; |
| 234 | +ALTER TABLE products ENABLE TRIGGER products_refresh_search; |
| 235 | +ALTER TABLE subscription_customer_descriptions ENABLE TRIGGER sub_cust_desc_refresh_search; |
| 236 | +ALTER TABLE subscription_instance_values ENABLE TRIGGER siv_refresh_search; |
| 237 | +ALTER TABLE subscriptions ENABLE TRIGGER sub_refresh_search; |
| 238 | +``` |
| 239 | + |
| 240 | +Disabling all triggers is done with these statements: |
| 241 | + |
| 242 | +```sql |
| 243 | +ALTER TABLE fixed_inputs DISABLE TRIGGER fi_refresh_search; |
| 244 | +ALTER TABLE products DISABLE TRIGGER products_refresh_search; |
| 245 | +ALTER TABLE subscription_customer_descriptions DISABLE TRIGGER sub_cust_desc_refresh_search; |
| 246 | +ALTER TABLE subscription_instance_values DISABLE TRIGGER siv_refresh_search; |
| 247 | +ALTER TABLE subscriptions DISABLE TRIGGER sub_refresh_search; |
| 248 | +``` |
| 249 | + |
| 250 | +The following query returns the number of seconds since the last refresh, which can be useful for debugging. |
| 251 | + |
| 252 | +```sql |
| 253 | +SELECT extract(epoch from now())::int - coalesce(pg_catalog.obj_description('subscriptions_search'::regclass)::int, 0); |
| 254 | +``` |
| 255 | + |
| 256 | +**Limitations** |
| 257 | + |
| 258 | +Updating the `subscriptions_search` materialized view is expensive and limited to once every 2 minutes. |
| 259 | + |
| 260 | +This means that when 2 changes happen within, say, 5 seconds of each other, the first change will be picked up directly. |
| 261 | +However, the second change will only be processed on the next refresh of the view. |
| 262 | +So during that period the second change will not show up in the search results. |
0 commit comments