Skip to content

Commit 1d0367d

Browse files
authored
Really fix UUID search + Add documentation for search (#739)
* Add test for searching uuids * Add documentation for search Signed-off-by: Mark90 <[email protected]> * Fixes #732 by actually deploying the uuid search workaround from #603 In #603 the uuid search workaround was added to existing migration da5c9f4cce1c in orchestrator-core commit 3e93263. Because of that, it was never deployed to existing environments where the original migration was already executed. In a new migration (460ec6748e37) this workaround will be deployed onto existing environments. The old migration (da5c9f4cce1c) is restored to its original state before commit 3e93263. Signed-off-by: Mark90 <[email protected]> * Bump version to 2.7.6rc2 --------- Signed-off-by: Mark90 <[email protected]>
1 parent 62d64fd commit 1d0367d

File tree

8 files changed

+507
-27
lines changed

8 files changed

+507
-27
lines changed

.bumpversion.cfg

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[bumpversion]
2-
current_version = 2.7.6rc1
2+
current_version = 2.7.6rc2
33
commit = False
44
tag = False
55
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(rc(?P<build>\d+))?

docs/reference-docs/search.md

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
# Search
2+
3+
<!-- TOC -->
4+
- [Search](#search)
5+
- [Usage](#usage)
6+
- [Implementation 1: Filter on DB Table](#implementation-1-filter-on-db-table)
7+
- [Implementation 2: Subscription Search](#implementation-2-subscription-search)
8+
- [Postgres Text Search](#postgres-text-search)
9+
- [Materialized View subscriptions\_search](#materialized-view-subscriptions_search)
10+
<!-- TOC -->
11+
12+
The orchestrator-core provides search functionality to find data within the Postgres DB.
13+
14+
The below overview describes which objects can be searched through which endpoint, and the underlying search implementation.
15+
16+
| Object | REST endpoint | GraphQL endpoint | Implementation |
17+
|----------------|------------------------------------|--------------------------------|---------------------|
18+
| Subscriptions | `/api/subscriptions/search?query=` | `subscriptions(query: "...")` | Subscription Search |
19+
| Processes | | `processes(query: "...")` | Filter on DB table |
20+
| Product Blocks | | `product_blocks(query: "...")` | Filter on DB table |
21+
| Products | | `products(query: "...")` | Filter on DB table |
22+
| Resource types | | `resource_types(query: "...")` | Filter on DB table |
23+
| Workflows | | `workflows(query: "...")` | Filter on DB table |
24+
25+
There are 2 implementations:
26+
27+
1. **Filter on DB Table**: generic implementation to search on values in the DB table for an object, or related tables
28+
1. **Subscription Search**: specialized implementation to search for values anywhere in a subscription, such as instance values or customer descriptions
29+
30+
## Usage
31+
32+
Call the REST or GraphQL endpoint with the `query` parameter set to the search query.
33+
34+
Some search query examples for Subscriptions:
35+
36+
| Query | Matches subscriptions with |
37+
|----------------------------------------------------------|----------------------------------------------------------------------|
38+
| `test` | `test` in any field |
39+
| `description:test` | `test` in description field |
40+
| `L2VPN` | `L2VPN` in any field |
41+
| `tag:L2VPN` | `L2VPN` in product tag field |
42+
| `tag:(FW \| L2VPN)` <br/> Or: <br/>`tag:FW \| tag:L2VPN` | `FW` **OR** `L2VPN` in product tag field |
43+
| `description:test tag:L2VPN` | `test` in description field **and** `L2VPN` in product tag field |
44+
| `description:test -tag:L2VPN` | `test` in description field **and NOT** `L2VPN` in product tag field |
45+
| `test 123` <br/> Or: <br/>`123 test` | `test` **AND** `123` anywhere
46+
| `"test 123"` | `test 123` anywhere (Phrase search) |
47+
| `test*` | any field starting with `test` (Prefix search) |
48+
49+
Note that:
50+
* For other objects the query syntax is the same
51+
* Searching is case-insensitive
52+
* Ordering of words does not matter (unless it is a Phrase)
53+
* Search words cannot contain the characters `|-*():"` as they are part of the search query grammar
54+
55+
## Implementation 1: Filter on DB Table
56+
57+
This implementation translates the user's search query to `WHERE` clauses on DB columns of the object's DB table. For some objects this extends to related DB tables.
58+
59+
We can distinguish these steps:
60+
61+
* The module `search_query.py` parses the user's query and generates a sequence of sqlalchemy `.filter()` clauses
62+
* The REST/GraphQL endpoint appends these clauses to the sqlalchemy `select()` to find objects that match the user's query
63+
64+
## Implementation 2: Subscription Search
65+
66+
This is a specialized implementation to allow searching values anywhere in a subscription or in related entities, without sacrificing performance.
67+
68+
We can distinguish these steps/components:
69+
70+
* The DB view `subscriptions_search` is a search index with _Text Search_ (TS) vectors. Both are explained in the next sections
71+
* The module `search_query.py` parses the user's query into a string that is used to create a TS query
72+
* The TS query is wrapped in a single sqlalchemy `.filter()` clause to match TS documents in `subscriptions_search` and get the corresponding `subscription_id`
73+
* The REST/GraphQL endpoint appends this clause to the sqlalchemy `select()` to find subscription objects that match the user's query
74+
75+
The diagram below visualizes the flow and dependencies between components in the Database, API and Frontend.
76+
77+
![Diagram search](search_overview.png)
78+
79+
<!-- search_overview.png can be opened and modified in draw.io -->
80+
81+
### Postgres Text Search
82+
83+
Postgres _Full Text Search_ (TS) has extensive [documentation](https://www.postgresql.org/docs/current/textsearch.html) but we'll cover the fundamentals in this section. TS queries can be done on "normal" DB tables, i.e. without a search index, but this is too slow. It is recommended to maintain a search index which we do in DB view `subscriptions_search`.
84+
85+
**Creating TS vectors**
86+
87+
The query behind `subscriptions_search` retrieves Subscriptions joined with several other tables (as shown in the previous diagram), forming a "document" of keywords that in some way relate to the subscription.
88+
89+
Each document is turned into a _tsvector_ with Postgres function `to_tsvector()` which consists of these phases:
90+
* Parse document into _tokens_: split text into _tokens_ using special characters as delimiters
91+
* Convert tokens into _lexemes_: a lexeme is a normalized token, i.e. this folds upper-case to lower-case. This step can also normalize based on language, but we disable that by using the `'simple'` dictionary (shown in diagram above)
92+
* Create vector optimized for search: store array of _lexemes_ along with positional information
93+
94+
The array of _lexemes_ makes up the tsvector document that we store in `subsriptions_search`, and which we can query through the Postgres function `to_tsquery()`.
95+
96+
We'll demonstrate how this works through a few examples. You can follow along in any postgres shell (v14 or above). If you have Docker installed, run these commands in separate shells.
97+
98+
```sh
99+
docker run --rm -e POSTGRES_HOST_AUTH_METHOD=trust --name pg15 postgres:15
100+
docker exec -i -t pg15 su - postgres -c psql
101+
```
102+
103+
Translating the input text `'color:Light_Blue count:4'` to a tsvector:
104+
105+
```
106+
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4');
107+
to_tsvector
108+
----------------------------------------------
109+
'4':5 'blue':3 'color':1 'count':4 'light':2
110+
(1 row)
111+
```
112+
113+
The result `'4':5 'blue':3 'color':1 'count':4 'light':2` is an array of lexemes combined with information about their position in the original text.
114+
115+
With Postgres function `to_tsdebug()` one can investigate how the input was parsed.
116+
117+
```
118+
postgres=# select * from ts_debug('simple', 'color:Light_Blue count:4');
119+
alias | description | token | dictionaries | dictionary | lexemes
120+
-----------+------------------+-------+--------------+------------+---------
121+
asciiword | Word, all ASCII | color | {simple} | simple | {color}
122+
blank | Space symbols | : | {} | |
123+
asciiword | Word, all ASCII | Light | {simple} | simple | {light}
124+
blank | Space symbols | _ | {} | |
125+
asciiword | Word, all ASCII | Blue | {simple} | simple | {blue}
126+
blank | Space symbols | | {} | |
127+
asciiword | Word, all ASCII | count | {simple} | simple | {count}
128+
blank | Space symbols | : | {} | |
129+
uint | Unsigned integer | 4 | {simple} | simple | {4}
130+
```
131+
132+
Note: when we would write "Light-Blue" with a dash instead of underscore, Postgres translates this into the vector `'blue':4 'light':3 'light-blue':2`, which makes it very complicated to query, in particular for UUID strings. To mitigate this we replace all occurrences of `-` with `_` when creating TS Vectors and before executing TS Queries.
133+
134+
**Querying TS Vectors**
135+
136+
To run queries against TS Vectors we have to prepare a TS Query with Postgres function `to_tsquery()`. We also pass the `'simple'` dictionary here to prevent language-specific normalization.
137+
138+
```
139+
postgres=# select to_tsquery('simple', 'Light_Blue');
140+
to_tsquery
141+
--------------------
142+
'light' <-> 'blue'
143+
```
144+
145+
The resulting TS query means as much as: `the vector must contain 'light' followed by 'blue'`. Note that the input string has been tokenized similar to TS vectors.
146+
147+
We can execute this TS query against the TS vector:
148+
149+
```
150+
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'Light_Blue');
151+
?column?
152+
----------
153+
t
154+
(1 row)
155+
```
156+
157+
Returns `t` for `true`.
158+
159+
As a final example we can also add an `OR` condition:
160+
161+
```
162+
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'color <-> Green | count <-> 4');
163+
?column?
164+
----------
165+
t
166+
(1 row)
167+
```
168+
169+
Which returns true because while the vector does not contain the color green, it does contain count 4.
170+
171+
Note that the `color <-> Green | count <-> 4` string passed to `ts_query()` must be constructed in a specific way.
172+
This happens in the orchestrator-core module `search_query.py` as shown in the overview diagram.
173+
174+
### Materialized View subscriptions_search
175+
176+
As mentioned before, `subscriptions_search` is a DB view which lies at the heart of the implementation.
177+
If you're not familiar with database views; they represent a (usually complicated) database query in the form of a "virtual table".
178+
179+
In this case we're using a [Materialized View](https://www.postgresql.org/docs/current/rules-materializedviews.html#RULES-MATERIALIZEDVIEWS) which is like a normal view, except that the "virtual table" is persisted to save resources and increase performance.
180+
The view's data is persisted until it is refreshed, which can be done manually or through an update trigger. (further explained below)
181+
182+
This table has a [GIN index](https://www.postgresql.org/docs/current/textsearch-indexes.html#TEXTSEARCH-INDEXES) for efficient search queries.
183+
184+
**Triggers**
185+
186+
Database function `refresh_subscriptions_search_view` takes care of refreshing `subscriptions_search`. It is called by triggers on the following tables:
187+
188+
* `fi_refresh_search` on table `fixed_inputs`
189+
* `products_refresh_search` on table `products`
190+
* `siv_refresh_search` on table `subscription_instance_values`
191+
* `sub_cust_desc_refresh_search` on table `subscription_customer_descriptions`
192+
* `sub_refresh_search` on table `subscriptions`
193+
194+
The following query shows the current state of the triggers:
195+
196+
```sql
197+
SELECT tgname, tgenabled
198+
FROM pg_trigger
199+
where pg_trigger.tgname in
200+
('fi_refresh_search',
201+
'products_refresh_search',
202+
'sub_cust_desc_refresh_search',
203+
'siv_refresh_search',
204+
'sub_refresh_search');
205+
```
206+
207+
When all triggers are disabled the output looks like:
208+
209+
```
210+
tgname | tgenabled
211+
------------------------------+-----------
212+
fi_refresh_search | D
213+
products_refresh_search | D
214+
siv_refresh_search | D
215+
sub_cust_desc_refresh_search | D
216+
sub_refresh_search | D
217+
```
218+
219+
And when all triggers are enabled the output looks like this:
220+
```
221+
tgname | tgenabled
222+
------------------------------+-----------
223+
fi_refresh_search | O
224+
products_refresh_search | O
225+
siv_refresh_search | O
226+
sub_cust_desc_refresh_search | O
227+
sub_refresh_search | O
228+
```
229+
230+
Enabling all triggers is done with these statements:
231+
232+
```sql
233+
ALTER TABLE fixed_inputs ENABLE TRIGGER fi_refresh_search;
234+
ALTER TABLE products ENABLE TRIGGER products_refresh_search;
235+
ALTER TABLE subscription_customer_descriptions ENABLE TRIGGER sub_cust_desc_refresh_search;
236+
ALTER TABLE subscription_instance_values ENABLE TRIGGER siv_refresh_search;
237+
ALTER TABLE subscriptions ENABLE TRIGGER sub_refresh_search;
238+
```
239+
240+
Disabling all triggers is done with these statements:
241+
242+
```sql
243+
ALTER TABLE fixed_inputs DISABLE TRIGGER fi_refresh_search;
244+
ALTER TABLE products DISABLE TRIGGER products_refresh_search;
245+
ALTER TABLE subscription_customer_descriptions DISABLE TRIGGER sub_cust_desc_refresh_search;
246+
ALTER TABLE subscription_instance_values DISABLE TRIGGER siv_refresh_search;
247+
ALTER TABLE subscriptions DISABLE TRIGGER sub_refresh_search;
248+
```
249+
250+
The following query returns the number of seconds since the last refresh, which can be useful for debugging.
251+
252+
```sql
253+
SELECT extract(epoch from now())::int - coalesce(pg_catalog.obj_description('subscriptions_search'::regclass)::int, 0);
254+
```
255+
256+
**Limitations**
257+
258+
Updating the `subscriptions_search` materialized view is expensive and limited to once every 2 minutes.
259+
260+
This means that when 2 changes happen within, say, 5 seconds of each other, the first change will be picked up directly.
261+
However, the second change will only be processed on the next refresh of the view.
262+
So during that period the second change will not show up in the search results.
170 KB
Loading

orchestrator/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
"""This is the orchestrator workflow engine."""
1515

16-
__version__ = "2.7.6rc1"
16+
__version__ = "2.7.6rc2"
1717

1818
from orchestrator.app import OrchestratorCore
1919
from orchestrator.settings import app_settings

orchestrator/migrations/versions/schema/2023-09-25_da5c9f4cce1c_add_subscription_metadata_to_fulltext_.sql

Lines changed: 34 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,17 @@ DROP MATERIALIZED VIEW IF EXISTS subscriptions_search;
55
CREATE MATERIALIZED VIEW IF NOT EXISTS subscriptions_search AS
66
WITH rt_info AS (SELECT s.subscription_id,
77
concat_ws(
8-
' ',
9-
string_agg(rt.resource_type || ' ' || siv.value, ' '),
10-
string_agg(distinct 'subscription_instance_id' || ':' || si.subscription_instance_id, ' ')
11-
) AS rt_info
8+
', ',
9+
string_agg(
10+
rt.resource_type || ':' || siv.value,
11+
', '
12+
ORDER BY rt.resource_type
13+
),
14+
string_agg(
15+
distinct 'subscription_instance_id' || ':' || si.subscription_instance_id,
16+
', '
17+
)
18+
) AS rt_info
1219
FROM subscription_instance_values siv
1320
JOIN resource_types rt ON siv.resource_type_id = rt.resource_type_id
1421
JOIN subscription_instances si ON siv.subscription_instance_id = si.subscription_instance_id
@@ -23,46 +30,48 @@ WITH rt_info AS (SELECT s.subscription_id,
2330
'note:' || coalesce(s.note, ''),
2431
'customer_id:' || s.customer_id,
2532
'product_id:' || s.product_id],
26-
' '
27-
) AS sub_info,
33+
', '
34+
) AS sub_info,
2835
array_to_string(
2936
ARRAY ['product_name:' || p.name,
3037
'product_description:' || p.description,
3138
'tag:' || p.tag,
3239
'product_type:', p.product_type],
33-
' '
34-
) AS prod_info
40+
', '
41+
) AS prod_info
3542
FROM subscriptions s
3643
JOIN products p ON s.product_id = p.product_id),
3744
fi_info AS (SELECT s.subscription_id,
38-
string_agg(fi.name || ':' || fi.value, ' ') AS fi_info
45+
string_agg(
46+
fi.name || ':' || fi.value,
47+
', '
48+
ORDER BY fi.name
49+
) AS fi_info
3950
FROM subscriptions s
4051
JOIN products p ON s.product_id = p.product_id
4152
JOIN fixed_inputs fi ON p.product_id = fi.product_id
4253
GROUP BY s.subscription_id),
4354
cust_info AS (SELECT s.subscription_id,
44-
string_agg('customer_description: ' || scd.description, ' ') AS cust_info
55+
string_agg(
56+
'customer_description: ' || scd.description,
57+
', '
58+
) AS cust_info
4559
FROM subscriptions s
4660
JOIN subscription_customer_descriptions scd ON s.subscription_id = scd.subscription_id
4761
GROUP BY s.subscription_id)
48-
-- to_tsvector handles parsing of hyphened words in a peculiar way and is inconsistent with how to_tsquery parses it in Postgres <14
49-
-- Replacing all hyphens with underscores makes the parsing more predictable and removes some issues arising when searching for subscription ids for example
50-
-- See: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7
5162
SELECT s.subscription_id,
5263
to_tsvector(
5364
'simple',
54-
replace(
55-
concat_ws(
56-
' ',
57-
spi.sub_info,
58-
spi.prod_info,
59-
fi.fi_info,
60-
rti.rt_info,
61-
ci.cust_info,
62-
md.metadata::text
63-
),
64-
'-', '_')
65-
) as tsv
65+
concat_ws(
66+
', ',
67+
spi.sub_info,
68+
spi.prod_info,
69+
fi.fi_info,
70+
rti.rt_info,
71+
ci.cust_info,
72+
md.metadata::text
73+
)
74+
) as tsv
6675
FROM subscriptions s
6776
LEFT JOIN sub_prod_info spi ON s.subscription_id = spi.subscription_id
6877
LEFT JOIN fi_info fi ON s.subscription_id = fi.subscription_id

0 commit comments

Comments
 (0)