Add Datasource Type Exclusion from Schema Refresh #7572

snickerjp · 2025-11-14T15:47:36Z

What type of PR is this?

Feature

Description

Added functionality to exclude datasource types that don't need or can't perform schema refresh from the schema refresh process.

Background

Some datasource types (results, python, etc.) don't implement the get_schema method, causing NotSupported exceptions during schema refresh, which generates error logs and metrics.

Error logs before fix:

[WARNING] Failed refreshing schema for the data source: Query Results
Traceback (most recent call last):
  File "/app/redash/tasks/queries/maintenance.py", line 166, in refresh_schema
    ds.get_schema(refresh=True)
  File "/app/redash/query_runner/__init__.py", line 232, in get_schema
    raise NotSupported()
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=1 runtime=0.00

[WARNING] Failed refreshing schema for the data source: python
Traceback (most recent call last):
  ...
redash.query_runner.NotSupported
[INFO] task=refresh_schema state=failed ds_id=2 runtime=0.00

These datasources don't have the concept of schema, so they should be excluded from the beginning.

Changes

Flow Diagram

Before Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Error{NotSupported exception}
    Error -->|results/python| ErrorLog[❌ Error logs]
    Error -->|pg/mysql etc| Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipOrg --> Loop
    ErrorLog --> Loop
    Success --> Loop
    Loop --> End[Complete]

After Fix:

flowchart TD
    Start[refresh_schemas start] --> Loop{Each datasource}
    Loop --> Paused{paused?}
    Paused -->|Yes| SkipPaused[Skip: paused]
    Paused -->|No| Blacklist{blacklist?}
    Blacklist -->|Yes| SkipBlacklist[Skip: blacklist]
    Blacklist -->|No| TypeExcluded{type in EXCLUDED_TYPES?}
    TypeExcluded -->|Yes| SkipType[✅ Skip: type_excluded]
    TypeExcluded -->|No| OrgDisabled{org.is_disabled?}
    OrgDisabled -->|Yes| SkipOrg[Skip: org_disabled]
    OrgDisabled -->|No| Execute[Execute refresh_schema]
    Execute --> Success[✅ Success]
    SkipPaused --> Loop
    SkipBlacklist --> Loop
    SkipType --> Loop
    SkipOrg --> Loop
    Success --> Loop
    Loop --> End[Complete]

Implementation Details

New Setting
- SCHEMAS_REFRESH_EXCLUDED_TYPES: Set of datasource types to exclude
- Environment variable: REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES
- Default value: "results,python" (two types that definitely cause errors)
Schema Refresh Logic Update
- Added type exclusion check in refresh_schemas() function
- Excluded types are logged with reason=type_excluded
- Maintains consistency with existing exclusion mechanisms (blacklist, paused, org.is_disabled)

Benefits

Reduces unnecessary error logs and metrics
Prevents wasteful endpoint access
Improves schema refresh process efficiency

Usage

Default Behavior

Without setting environment variable, results and python are automatically excluded.

Exclude Additional Types (.env file)

REDASH_SCHEMAS_REFRESH_EXCLUDED_TYPES=results,python,json,url

How is this tested?

Unit tests (pytest)
Manually

Unit Tests

New test:

test_skips_excluded_datasource_types: Verifies excluded types are correctly skipped

Existing test compatibility:

test_calls_refresh_of_all_data_sources: PASSED
test_skips_paused_data_sources: PASSED

Test Execution Results:

3 passed, 21 warnings in 9.07s
✅ test_calls_refresh_of_all_data_sources PASSED
✅ test_skips_excluded_datasource_types PASSED
✅ test_skips_paused_data_sources PASSED

Manual Testing (Verification)

Test Steps:

Create results and python datasources
Execute refresh_schemas()
Check logs

Execution Command:

docker compose exec worker python -c "
from redash import create_app
from redash.tasks.queries.maintenance import refresh_schemas
from redash import models

app = create_app()
with app.app_context():
    print('=== Data sources ===')
    for ds in models.DataSource.query:
        print(f'ID={ds.id} Name={ds.name} Type={ds.type}')
    print()
    print('=== Running refresh_schemas ===')
    refresh_schemas()
"

Execution Logs:

=== Data sources ===
ID=1 Name=Query Results Type=results
ID=2 Name=python Type=python
ID=3 Name=redash Type=pg

=== Running refresh_schemas ===
[INFO] task=refresh_schemas state=start
[INFO] task=refresh_schema state=skip ds_id=1 reason=type_excluded
[INFO] task=refresh_schema state=skip ds_id=2 reason=type_excluded
[INFO] task=refresh_schemas state=finish total_runtime=0.01

Verification Results:

✅ results and python correctly skipped (no errors)
✅ pg (PostgreSQL) executes normally (not appearing in logs is normal)
✅ Error logs and stack traces completely eliminated

Related Tickets & Documents

Fixes #7571

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

N/A (backend-only changes)

Additional Information

Implementation Approach

Initially attempted to automatically detect the presence of get_schema method, but abandoned due to:

hasattr() cannot detect because get_schema exists in BaseQueryRunner
Checking method override is complex and has low maintainability
Exception catching approach has performance impact

Therefore, adopted explicit type name specification approach. This approach:

Simple and easy to understand
Works reliably
Flexible control via environment variables
Consistent with other Redash settings (like ENABLED_QUERY_RUNNERS)

Datasource Types That Don't Need Schema Refresh

The following types don't implement get_schema method and are candidates for exclusion:

results - Query Results (references other query results)
python - Python execution
And potentially many others

Backward Compatibility

Default value automatically excludes results and python in existing environments
Can revert to previous behavior (attempt all datasources) by setting environment variable to empty string
Does not affect existing exclusion mechanisms (blacklist, paused, org.is_disabled)

- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support

yoshiokatsuneo · 2025-11-14T16:16:06Z

Thank you for your PR with the detailed description !

Just a question.

Exception catching approach has performance impact

May I hear what kind of performance impact you are worrying ?
I just thought there is also an option to ignore NotSupported exception.

snickerjp · 2025-11-14T17:15:35Z

Thank you for the question!

You're right - the performance impact of exception catching would be minimal in this case. The concern was more about the implementation approach rather than actual performance.

The exception catching approach would look like:

try:
   ds.query_runner.get_schema(get_stats=False)
   refresh_schema.delay(ds.id)
except NotSupported:
   logger.info("skip: no schema support")

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:

get_schema() is meant to actually retrieve schema, not to check capability
Even with get_stats=False, it might still initialize connections or perform setup
It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

Additionally, when there are many datasources:

Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)
Some query runners might initialize connections when accessing the query_runner property
Python exception handling has overhead (stack unwinding, traceback creation)

With type-based exclusion:

Skip check happens before any query runner instantiation
O(1) set lookup: ds.type in EXCLUDED_TYPES
No method calls, no exceptions, no overhead

That said, the performance difference is likely negligible in practice. The main benefit is code clarity and avoiding unnecessary method calls.

If the maintainers prefer the exception catching approach for better automatic detection, I'm happy to change it. What do you think?

yoshiokatsuneo · 2025-11-15T07:23:24Z

@snickerjp

Thank you very much for you detailed explanation !

However, this approach has a conceptual issue: we'd be calling get_schema() just to check if it's supported, which feels wrong because:
get_schema() is meant to actually retrieve schema, not to check capability

Yes, but I'm just feeling, if we just ignore the exception, it is not "checking" but just "ignoring".

Even with get_stats=False, it might still initialize connections or perform setup

I think, at least for query_results / python data sources you described, calling get_schema() does not initialize the connections.

It's semantically unclear - the code looks like it's trying to get schema, but it's actually just checking support

I'm just feeling that at the point we ignore the error, the original issue was already solved.

Exception catching would call get_schema() for every datasource during refresh_schemas() execution (every 30 minutes by default)

Yes, it might be meaningless. (Although, performance impact will be minimum.)

Python exception handling has overhead (stack unwinding, traceback creation)

I think the impact is very little. (Probably less than 0.1sec ?)

What I'm feeling is that the attributes(ex: schema listing is supported or not.) for each Data Source is nice to be encapsulated inside each Data Source class, is not defined at the global variables, if possible.
If we need to detect whether each Data Source support get_schema or not, I think we may add a method(ex: "is_get_method_supported"?) to the each Data Source class. (Although it make be the change bigger, and I'm not sure whether it is worth to do when the main issue(error logging) is already solved.)

How about ?

snickerjp added 3 commits November 14, 2025 14:49

Add datasource type exclusion from schema refresh

32cef44

- Add SCHEMAS_REFRESH_EXCLUDED_TYPES setting with default 'results,python' - Add type-based exclusion check in refresh_schemas() - Prevents unnecessary errors for datasources without schema support

Add test for datasource type exclusion

6be5135

Fix ruff W293: Remove whitespace from blank line

b2ef5e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Datasource Type Exclusion from Schema Refresh #7572

Add Datasource Type Exclusion from Schema Refresh #7572

snickerjp commented Nov 14, 2025

Uh oh!

yoshiokatsuneo commented Nov 14, 2025 •

edited

Loading

Uh oh!

snickerjp commented Nov 14, 2025 •

edited

Loading

Uh oh!

yoshiokatsuneo commented Nov 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Datasource Type Exclusion from Schema Refresh #7572

Are you sure you want to change the base?

Add Datasource Type Exclusion from Schema Refresh #7572

Conversation

snickerjp commented Nov 14, 2025

What type of PR is this?

Description

Background

Changes

Flow Diagram

Implementation Details

Benefits

Usage

Default Behavior

Exclude Additional Types (.env file)

How is this tested?

Unit Tests

Manual Testing (Verification)

Related Tickets & Documents

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

Additional Information

Implementation Approach

Datasource Types That Don't Need Schema Refresh

Backward Compatibility

Uh oh!

yoshiokatsuneo commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snickerjp commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoshiokatsuneo commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yoshiokatsuneo commented Nov 14, 2025 •

edited

Loading

snickerjp commented Nov 14, 2025 •

edited

Loading

yoshiokatsuneo commented Nov 15, 2025 •

edited

Loading