-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add Datasource Type Exclusion from Schema Refresh #7572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Datasource Type Exclusion from Schema Refresh #7572
Conversation
- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support
|
Thank you for your PR with the detailed description ! Just a question.
May I hear what kind of performance impact you are worrying ? |
|
Thank you for the question! You're right - the performance impact of exception catching would be minimal in this case. The concern was more about the implementation approach rather than actual performance. The exception catching approach would look like: try:
ds.query_runner.get_schema(get_stats=False)
refresh_schema.delay(ds.id)
except NotSupported:
logger.info("skip: no schema support")However, this approach has a conceptual issue: we'd be calling
Additionally, when there are many datasources:
With type-based exclusion:
That said, the performance difference is likely negligible in practice. The main benefit is code clarity and avoiding unnecessary method calls. If the maintainers prefer the exception catching approach for better automatic detection, I'm happy to change it. What do you think? |
|
Thank you very much for you detailed explanation !
Yes, but I'm just feeling, if we just ignore the exception, it is not "checking" but just "ignoring".
I think, at least for query_results / python data sources you described, calling get_schema() does not initialize the connections.
I'm just feeling that at the point we ignore the error, the original issue was already solved.
Yes, it might be meaningless. (Although, performance impact will be minimum.)
I think the impact is very little. (Probably less than 0.1sec ?) What I'm feeling is that the attributes(ex: schema listing is supported or not.) for each Data Source is nice to be encapsulated inside each Data Source class, is not defined at the global variables, if possible. How about ? |
What type of PR is this?
Description
Added functionality to exclude datasource types that don't need or can't perform schema refresh from the schema refresh process.
Background
Some datasource types (
results,python, etc.) don't implement theget_schemamethod, causingNotSupportedexceptions during schema refresh, which generates error logs and metrics.Error logs before fix:
These datasources don't have the concept of schema, so they should be excluded from the beginning.
Changes
Flow Diagram
Before Fix:
flowchart TD Start[refresh_schemas start] --> Loop{Each datasource} Loop --> Paused{paused?} Paused -->|Yes| SkipPaused[Skip: paused] Paused -->|No| Blacklist{blacklist?} Blacklist -->|Yes| SkipBlacklist[Skip: blacklist] Blacklist -->|No| OrgDisabled{org.is_disabled?} OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled] OrgDisabled -->|No| Execute[Execute refresh_schema] Execute --> Error{NotSupported exception} Error -->|results/python| ErrorLog[❌ Error logs] Error -->|pg/mysql etc| Success[✅ Success] SkipPaused --> Loop SkipBlacklist --> Loop SkipOrg --> Loop ErrorLog --> Loop Success --> Loop Loop --> End[Complete]After Fix:
flowchart TD Start[refresh_schemas start] --> Loop{Each datasource} Loop --> Paused{paused?} Paused -->|Yes| SkipPaused[Skip: paused] Paused -->|No| Blacklist{blacklist?} Blacklist -->|Yes| SkipBlacklist[Skip: blacklist] Blacklist -->|No| TypeExcluded{type in EXCLUDED_TYPES?} TypeExcluded -->|Yes| SkipType[✅ Skip: type_excluded] TypeExcluded -->|No| OrgDisabled{org.is_disabled?} OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled] OrgDisabled -->|No| Execute[Execute refresh_schema] Execute --> Success[✅ Success] SkipPaused --> Loop SkipBlacklist --> Loop SkipType --> Loop SkipOrg --> Loop Success --> Loop Loop --> End[Complete]Implementation Details
New Setting
SCHEMAS_REFRESH_EXCLUDED_TYPES: Set of datasource types to excludeREDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES"results,python"(two types that definitely cause errors)Schema Refresh Logic Update
refresh_schemas()functionreason=type_excludedBenefits
Usage
Default Behavior
Without setting environment variable,
resultsandpythonare automatically excluded.Exclude Additional Types (.env file)
How is this tested?
Unit Tests
New test:
test_skips_excluded_datasource_types: Verifies excluded types are correctly skippedExisting test compatibility:
test_calls_refresh_of_all_data_sources: PASSEDtest_skips_paused_data_sources: PASSEDTest Execution Results:
Manual Testing (Verification)
Test Steps:
resultsandpythondatasourcesrefresh_schemas()Execution Command:
Execution Logs:
Verification Results:
resultsandpythoncorrectly skipped (no errors)pg(PostgreSQL) executes normally (not appearing in logs is normal)Related Tickets & Documents
Fixes #7571
Mobile & Desktop Screenshots/Recordings (if there are UI changes)
N/A (backend-only changes)
Additional Information
Implementation Approach
Initially attempted to automatically detect the presence of
get_schemamethod, but abandoned due to:hasattr()cannot detect becauseget_schemaexists inBaseQueryRunnerTherefore, adopted explicit type name specification approach. This approach:
ENABLED_QUERY_RUNNERS)Datasource Types That Don't Need Schema Refresh
The following types don't implement
get_schemamethod and are candidates for exclusion:results- Query Results (references other query results)python- Python executionBackward Compatibility
resultsandpythonin existing environments