Scoping "Seen Tables". Improving table scan performance.#3741
Scoping "Seen Tables". Improving table scan performance.#3741
Conversation
6836326 to
6257fd3
Compare
|
❌ 54/55 passed, 1 failed, 5 skipped, 4h37m11s total ❌ test_all_grant_types: AssertionError: assert {('ANONYMOUS ...dummy_twu2b')} == {('ANONYMOUS ..._fzcf6'), ...} (12m12.2s)Running from acceptance #8385 |
6257fd3 to
c51ee78
Compare
JCZuurmond
left a comment
There was a problem hiding this comment.
@FastLee : I have to think about the choice to add a scope for a bit.
It feels redundant to pass the in-scope tables, extract the schemas names from those tables, then skip the in-scope schemas when iterating over all schemas. Why not directly loop over the in-scope schemas?
A similar approach is repeated for the table names, which makes the schema loop even more redundant. Why not directly go to the in-scope tables?
Moreover, we typically provide the scope using an include_ at initialization, which might not work here, but I have to think about that
| def _iter_schemas(self) -> Iterable[SchemaInfo]: | ||
| for catalog in self._iter_catalogs(): | ||
| if catalog.name is None: | ||
| if catalog.name is None or catalog.catalog_type in ( |
There was a problem hiding this comment.
This has probably the same effect as the filter on line 186. If not, move the filter there so that filtering catalogs happens inside the _iter_catalog
| if ( | ||
| schema.catalog_name is None | ||
| or schema.name is None | ||
| or schema.name == "information_schema" |
There was a problem hiding this comment.
Move this into _iter_schema
| tables: list = [] | ||
| logger.info(f"Scanning {len(tasks)} schemas for tables") | ||
| table_lists = Threads.gather("list tables", tasks, self.API_LIMIT) | ||
| # Combine tuple of lists to a list |
There was a problem hiding this comment.
Use itertools.chain instead
| for table_list in table_lists[0]: | ||
| tables.extend(table_list) | ||
| for table in tables: | ||
| if not isinstance(table, TableInfo): |
There was a problem hiding this comment.
Is this supposed to happen? It should not based on the type hinting, right?
| ) | ||
| tables: list = [] | ||
| logger.info(f"Scanning {len(tasks)} schemas for tables") | ||
| table_lists = Threads.gather("list tables", tasks, self.API_LIMIT) |
There was a problem hiding this comment.
It is not an API limit, but the number of threads
Scoped down table scan for migrated tables.