Skip to content

Conversation

@diegoimbert
Copy link
Contributor

@diegoimbert diegoimbert commented Dec 9, 2025

Detect which specific tables were accessed from a ducklake/datatable/db resource when parsing assets.

@diegoimbert diegoimbert marked this pull request as ready for review December 10, 2025 13:36
@diegoimbert diegoimbert marked this pull request as draft December 10, 2025 13:37
@claude
Copy link
Contributor

claude bot commented Dec 10, 2025

Claude finished @diegoimbert's task in 3m 43s —— View job


Code Review Complete ✅

This PR introduces specific table tracking for database assets (Ducklake, DataTable, and database resources), enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a significant enhancement to the asset tracking system.


Summary of Changes

Backend Changes

  • Asset Parsers Enhanced: SQL, Python, and TypeScript parsers now detect table-level access patterns
  • Access Type Tracking: Distinguishes between Read (R), Write (W), and Read-Write (RW) operations
  • Database Support: Handles ATTACH, USE, and dot-notation syntax (e.g., db.table_name)
  • Path Format: Assets now use hierarchical paths like ducklake_name/table_name or u/user/pg_db/table_name

Frontend Changes

  • UI Updates: Components updated to display and manage specific tables
  • DB Manager: Opens directly to specific tables when clicked from asset lists
  • Asset Display: Improved icons and formatting for table-specific assets
  • Path Parsing: Logic added to extract base database and specific table from asset paths

Code Quality Assessment

✅ Strengths

  1. Comprehensive Testing: All three parsers (SQL, Python, TypeScript) include extensive unit tests covering edge cases
  2. Consistent Design: Similar patterns used across all three language parsers
  3. Type Safety: Strong typing in both Rust backend and TypeScript frontend
  4. Backward Compatibility: Handles assets without specific tables (falls back to database-level tracking)
  5. Access Type Merging: Smart logic to combine R/W access types when same table accessed multiple times

⚠️ Areas for Improvement

I've left inline comments on specific issues. Key concerns:

1. Path Validation & Security

  • No validation that table names don't contain / characters, which could create malformed paths
  • String interpolation in SQL construction (Python/TS parsers) should be reviewed for edge cases
  • Consider sanitizing user input from parsed code before using in paths

2. Access Type Merging Logic (Priority: Medium)

The current logic in asset_parser.rs:46-52 treats (Some(R), None)None. This loses information. Consider:

(None, Some(x)) | (Some(x), None) => Some(x),  // Preserve known access

3. SQL Parser Stack Management (Priority: High)

The Query statement pushes R twice (lines 210 and 344), which could cause incorrect access type detection in nested queries. This needs verification.

4. Frontend Path Parsing

Multiple locations split asset.path by / without validating the expected format:

  • ExploreAssetButton.svelte:61-62 - assumes 3+ segments for resources
  • Similar patterns in ducklake/datatable handlers
  • Should add defensive validation or use a helper function

5. Missing Asset Types in Frontend

lib.ts:parseAssetFromString() doesn't handle ducklake:// or datatable:// prefixes, though these exist in the backend parser.

6. Performance Considerations

  • TypeScript parser clones entire var_identifiers HashMap on every block scope entry
  • API query uses subqueries that could be optimized with JOINs for large datasets
  • Consider adding database indexes if not present

Security Review

✅ No Critical Issues Found

  • Asset paths are validated through parse_asset_syntax()
  • Database queries use parameterized queries (sqlx)
  • No SQL injection vectors identified
  • Proper workspace isolation maintained

💡 Recommendations

  1. Add input sanitization for table names extracted from user code
  2. Document path format expectations to prevent malformed paths
  3. Consider rate limiting asset parsing operations if exposed to untrusted code

Testing Recommendations

The PR includes good unit tests, but consider adding:

  1. Integration tests for full parse → store → retrieve flow
  2. Edge case tests:
    • Table names with special characters (table-name, table.name)
    • Very long paths
    • Unicode table names
    • Nested queries with multiple access types
  3. Performance tests with large codebases
  4. Frontend E2E tests for the DB Manager specific table navigation

Best Practices Adherence

✅ Follows Windmill Guidelines

  • Clean code with single responsibilities
  • Avoids duplication (mostly - see inline comments)
  • Rust performance optimizations applied correctly
  • Proper error handling throughout

⚠️ Minor Deviations

  • Some code duplication in frontend path parsing (see inline comments)
  • Could extract more helper functions for better testability
  • Some SQL query complexity that could be simplified

Reproduction Instructions for Testers

To test the specific asset tables feature, navigate to the script editor and create a new Python, TypeScript, or SQL script. Write code that accesses a database with specific tables - for example, in Python use db = wmill.datatable('mydata') followed by db.query('SELECT * FROM users'), or in SQL use ATTACH 'ducklake://mydl' AS dl; SELECT * FROM dl.customers;. After saving the script, go to the Assets page in the workspace navigation. You should see the specific tables listed (like mydata/users or mydl/customers) rather than just the database names. Each asset should show whether it's being read, written, or both. Click the "Manage" button next to any database asset to open the DB Manager directly to that specific table. In the Flow editor, when you reference a script with database assets, the flow's asset graph should display the specific tables being accessed with appropriate icons and access indicators.


Recommendation

Approve with minor revisions. The implementation is solid and well-tested. The issues identified are mostly minor improvements and edge case handling. The critical issue with SQL parser stack management (Query statement) should be verified, but likely works correctly in practice. Consider addressing the inline comments before merging, particularly:

  1. Fix the Query statement double-push issue if confirmed
  2. Add path validation in frontend
  3. Consider the access type merging logic improvement
  4. Add missing ducklake/datatable parsing in frontend

Overall, this is a valuable feature that significantly improves asset tracking granularity! 🎉

Ok((input, path))
}
// Detect when we do 'a.b' and 'a' is associated with an asset in var_identifiers
// Or when we access 'b' and we did USE a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Logic Issue: When accessing tables with USE context, the code constructs paths like {path}/{specific_table}. However, there's no validation that specific_table doesn't contain / characters itself, which could lead to malformed paths.

Consider adding path sanitization or validation for the table name components.

Comment on lines +61 to +64
if name.0.len() == 1 {
let ident = name.0.first()?.as_ident()?;
if ident.quote_style.is_some() {
return None;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential False Negative: This check prevents single identifiers from being matched when access_type is None. However, this might miss legitimate table references in contexts where the access type isn't yet determined.

For example, in a CTE (Common Table Expression) or subquery, you might reference a table before the outer query's access type is known. Consider whether this is the intended behavior.

Comment on lines +84 to +85
let (kind, path) = self.var_identifiers.get(&ident.value)?;
let path = if name.0.len() == 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality: This logic for extracting table names from multi-part object names could be extracted into a helper function for better readability and testability. The pattern is used multiple times in this file.

Comment on lines +288 to +294
sqlparser::ast::Statement::AttachDuckDBDatabase {
database_path,
database_alias,
..
} => {
if let Some((kind, path)) = parse_asset_syntax(&database_path.value, true) {
if kind == AssetKind::Ducklake
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security Note: The parse_asset_syntax function is used here to extract asset information from database paths. Ensure that this function properly validates and sanitizes the input to prevent path traversal attacks or other injection vulnerabilities, especially since these paths are user-controlled (coming from SQL statements).

The current implementation looks safe, but it's worth documenting this security consideration.

Comment on lines +192 to +197
let duckdb_conn_prefix = match kind {
AssetKind::DataTable => "datatable",
AssetKind::Ducklake => "ducklake",
_ => return Ok(()),
};
let sql = format!("ATTACH '{duckdb_conn_prefix}://{path}' AS dt; USE dt; {sql}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Issue: When constructing the SQL string for parsing, this code directly interpolates the path variable which comes from user input (the Python code being parsed). While this is passed to windmill_parser_sql::parse_assets which should be safe, consider whether there could be any edge cases with unusual characters in the path.

Example: What if path contains quotes or special SQL characters? The SQL parser should handle this, but it's worth verifying.

Comment on lines 38 to +42
fn visit_stmt_assign(&mut self, node: rustpython_ast::StmtAssign) {
// Check if the value is a call to a tracked function
if let Some((kind, name)) = self.extract_asset_from_call(&node.value) {
// Track all target variables
for target in &node.targets {
if let Expr::Name(name_expr) = target {
let Ok(var_name) = name_expr.id.parse::<String>();
self.var_identifiers
.insert(var_name, (kind.clone(), name.clone()));
}
}
} else {
// If not wmill.datatable or similar, remove any tracked variables
// It means the identifier is no longer refering to an asset
for target in &node.targets {
if let Expr::Name(name_expr) = target {
let Ok(var_name) = name_expr.id.parse::<String>();
let removed = self.var_identifiers.remove(&var_name);
// if a db = wmill.datatable() or similar was removed, but never used (e.g db.query(...)),
// we still want to register the asset as unknown access type
match removed {
Some((kind, name)) => {
if self
.assets
.iter()
.all(|a| !(a.kind == kind && a.path == name))
{
self.assets.push(ParseAssetsResult {
kind,
access_type: None,
path: name,
});
}
}
None => {}
if let Some(Expr::Name(expr_name)) = node.targets.first() {
// Remove any tracked variables with that name in case of reassignment
let Ok(var_name) = expr_name.id.parse::<String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality: The error handling here silently ignores the Err result from id.parse::<String>(). While this is probably safe since we're just skipping invalid identifiers, consider logging or at least adding a comment explaining why errors are ignored.

This pattern appears multiple times in the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's infaillible

Comment on lines 60 to +62
if (asset.kind === 'resource' && isDbType(_resourceMetadata?.resource_type)) {
let resourcePath = asset.path.split('/').slice(0, 3).join('/')
let specificTable = asset.path.split('/')[3] as string | undefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Bug: This assumes that the resource path has at least 3 segments (u/user/resource). If the asset path doesn't follow this pattern, slice(0, 3) could produce unexpected results, and accessing [3] could be undefined.

Consider adding validation:

let pathParts = asset.path.split('/')
if (pathParts.length < 3) {
    console.error('Invalid resource path format:', asset.path)
    return
}
let resourcePath = pathParts.slice(0, 3).join('/')
let specificTable = pathParts[3]

FROM asset
LEFT JOIN resource ON asset.kind = 'resource' AND asset.path = resource.path AND resource.workspace_id = $1
LEFT JOIN resource ON asset.kind = 'resource'
AND array_to_string((string_to_array(asset.path, '/'))[1:3], '/') = resource.path -- With specific table, asset path can be e.g u/diego/pg_db/table_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL Query Complexity: This comment indicates complex path manipulation for resources with specific tables. The expression array_to_string((string_to_array(asset.path, '/'))[1:3], '/') is doing array slicing in SQL which can be hard to understand and maintain.

Consider:

  1. Adding a code comment explaining the path format and why this slicing is needed
  2. Or moving this logic to application code where it's easier to test and understand
  3. Adding an example: -- Example: 'u/user/pg_db/table_name' -> 'u/user/pg_db'

Comment on lines 71 to +73
} else if (asset.kind === 'ducklake') {
dbManagerDrawer?.openDrawer({ type: 'ducklake', ducklake: asset.path })
let ducklake = asset.path.split('/')[0]
let specificTable = asset.path.split('/')[1] as string | undefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Duplication: The pattern of splitting asset.path and extracting the base path and specific table is repeated for ducklake (lines 72-74) and datatable (lines 76-78). This is also similar to the resource handling (lines 61-62).

Consider extracting a helper function:

function parseAssetPath(path: string, minSegments: number = 1): { basePath: string; specificTable?: string } {
    const parts = path.split('/')
    return {
        basePath: parts.slice(0, minSegments).join('/'),
        specificTable: parts[minSegments]
    }
}

Comment on lines 46 to 47
AND (asset.kind <> 'resource' OR resource.path IS NOT NULL)
AND (asset.usage_kind <> 'flow' OR asset.usage_path = ANY(SELECT path FROM flow WHERE workspace_id = $1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance Consideration: This query filters assets by checking if their usage paths exist in the flow and script tables using ANY(SELECT ...) subqueries. For large datasets, this could be slow.

Consider:

  1. Using JOINs instead of subqueries for better query optimization
  2. Adding indexes on flow.path and script.path if they don't already exist
  3. Or materializing this data if the query becomes a bottleneck

The current approach ensures data consistency (filtering out assets for deleted scripts/flows), but may have performance implications.

Comment on lines +207 to +209
match statement {
sqlparser::ast::Statement::Query(_) => {
// don't forget pop() in post_visit_statement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness Issue: When visiting a Query statement, the code pushes R (read) access type twice - once in pre_visit_statement (line 210) and once in pre_visit_query (line 344). This means after visiting a query, you'll have an extra R on the stack that won't be popped until post_visit_statement.

This could lead to incorrect access type detection for nested structures. The post_visit_query at line 352 pops one R, but the one pushed in pre_visit_statement also needs to be popped in post_visit_statement (line 333).

Verify this is working correctly with nested queries.

@diegoimbert diegoimbert marked this pull request as ready for review December 10, 2025 15:41
@claude
Copy link
Contributor

claude bot commented Dec 10, 2025

Claude finished @diegoimbert's task in 3m 21s —— View job


Code Review Complete ✅

This PR introduces specific table tracking for database assets, enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a valuable enhancement to the asset tracking system.


Summary of Changes

Backend Changes

  • Asset Parsers Enhanced: SQL, Python, and TypeScript parsers now detect table-level access patterns
  • Access Type Tracking: Distinguishes between Read (R), Write (W), and Read-Write (RW) operations
  • Database Support: Handles ATTACH, USE, and dot-notation syntax (e.g., db.table_name)
  • Path Format: Assets now use hierarchical paths like ducklake_name/table_name or u/user/pg_db/table_name

Frontend Changes

  • UI Updates: Components updated to display and manage specific tables
  • DB Manager: Opens directly to specific tables when clicked from asset lists
  • Asset Display: Improved icons and formatting for table-specific assets
  • Path Parsing: Logic added to extract base database and specific table from asset paths

Code Quality Assessment

✅ Strengths

  1. Comprehensive Testing: All three parsers (SQL, Python, TypeScript) include extensive unit tests covering edge cases
  2. Consistent Design: Similar patterns used across all three language parsers
  3. Type Safety: Strong typing in both Rust backend and TypeScript frontend
  4. Backward Compatibility: Handles assets without specific tables (falls back to database-level tracking)
  5. Smart Access Type Merging: Logic to combine R/W access types when same table accessed multiple times

⚠️ Areas for Improvement

Building on the previous review, here are additional concerns and observations:

1. Access Type Merging Logic (Priority: HIGH)

The current logic in asset_parser.rs:46-52 treats (Some(R), None)None, which loses information. This is problematic because:

  • If an asset is first detected with R access, then later with ambiguous access (None), we lose the known R
  • This makes the tracking less useful than it could be

Recommendation: Change the merge logic to preserve known access types:

(None, None) => None,
(None, Some(x)) | (Some(x), None) => Some(x),  // Preserve known access
(Some(R), Some(W)) | (Some(W), Some(R)) => Some(RW),
// ... rest unchanged

2. SQL Parser Stack Management (Priority: HIGH - Needs Verification)

In windmill-parser-sql/src/asset_parser.rs:

  • Line 210: Statement::Query(_) pushes R in pre_visit_statement
  • Line 344: pre_visit_query also pushes R
  • This means nested queries have R pushed twice but only popped once in post_visit_query (line 352)
  • The post_visit_statement (line 333) pops the first one

This could cause incorrect access type detection in nested queries. For example:

INSERT INTO table1 
SELECT * FROM (SELECT * FROM table2)  -- nested query

The inner SELECT might incorrectly inherit W access from the INSERT context due to stack mismanagement.

Recommendation: Verify this behavior with nested query tests, or restructure to avoid double-pushing.

3. Path Validation & Security (Priority: MEDIUM)

Several areas lack input validation:

  • Backend: No validation that table names don't contain / characters before constructing paths like {db}/{table}

    • Example: If user code references table/with/slashes, the path becomes malformed
    • Occurs in: sql/asset_parser.rs:74, sql/asset_parser.rs:87, py/asset_parser.rs:197, ts/asset_parser.rs:190
  • Frontend: Path splitting assumes specific formats without validation

    • ExploreAssetButton.svelte:61-62: Assumes resource paths have ≥3 segments
    • If path is u/user (only 2 segments), pathParts[3] is undefined
    • Similar issues in ducklake/datatable handlers (lines 72-78)

Recommendation:

  • Backend: Sanitize table names before path construction (replace / with _ or reject)
  • Frontend: Add defensive checks or extract to a helper function

4. Frontend Code Duplication (Priority: LOW)

The pattern of splitting asset.path and extracting components is repeated in ExploreAssetButton.svelte:

  • Lines 61-62 (resources)
  • Lines 72-74 (ducklake)
  • Lines 76-78 (datatable)

Recommendation: Extract to a helper function:

function parseAssetPath(path: string, minSegments: number = 1): {
    basePath: string;
    specificTable?: string;
} {
    const parts = path.split('/');
    return {
        basePath: parts.slice(0, minSegments).join('/'),
        specificTable: parts[minSegments]
    };
}

5. Missing Asset Types in Frontend (Priority: MEDIUM)

lib.ts:parseAssetFromString() doesn't handle ducklake:// or datatable:// prefixes:

  • Lines 53-62 only handle res://, $res:, and s3://
  • But backend supports ducklake:// and datatable:// (see parse_asset_syntax in asset_parser.rs:89-92)

Recommendation: Add cases for ducklake and datatable parsing.

6. TypeScript Parser Performance (Priority: LOW)

windmill-parser-ts/src/asset_parser.rs:82:

let saved_var_identifiers = self.var_identifiers.clone();

This clones the entire HashMap on every block scope entry. For deeply nested code with many blocks, this could be inefficient.

Recommendation: Consider using a stack-based approach or reference counting if this becomes a performance bottleneck.

7. API Query Performance (Priority: LOW)

windmill-api/src/assets.rs:47-48:

AND (asset.usage_kind <> 'flow' OR asset.usage_path = ANY(SELECT path FROM flow WHERE workspace_id = $1))
AND (asset.usage_kind <> 'script' OR asset.usage_path = ANY(SELECT path FROM script WHERE workspace_id = $1))

Uses subqueries in the WHERE clause. For large datasets, this could be slow.

Recommendation: Consider using JOINs instead:

LEFT JOIN flow ON asset.usage_kind = 'flow' AND asset.usage_path = flow.path AND flow.workspace_id = $1
LEFT JOIN script ON asset.usage_kind = 'script' AND asset.usage_path = script.path AND script.workspace_id = $1
WHERE asset.workspace_id = $1
  AND (asset.kind <> 'resource' OR resource.path IS NOT NULL)
  AND (asset.usage_kind <> 'flow' OR flow.path IS NOT NULL)
  AND (asset.usage_kind <> 'script' OR script.path IS NOT NULL)

Also ensure indexes exist on flow.path and script.path.


Security Review

✅ No Critical Issues Found

  • Asset paths are validated through parse_asset_syntax()
  • Database queries use parameterized queries (sqlx)
  • No SQL injection vectors identified
  • Proper workspace isolation maintained

💡 Recommendations

  1. Add input sanitization for table names extracted from user code (especially for / characters)
  2. Document path format expectations to prevent malformed paths
  3. Consider rate limiting asset parsing operations if exposed to untrusted code

Testing Recommendations

The PR includes good unit tests, but consider adding:

  1. Integration tests for full parse → store → retrieve flow
  2. Edge case tests:
    • Table names with special characters (table-name, table.name, table/name)
    • Very long paths (stress test)
    • Unicode table names
    • Nested queries with multiple access types (to verify stack management)
    • Queries that reference same table multiple times with different access patterns
  3. Performance tests with large codebases (100+ asset references)
  4. Frontend E2E tests for the DB Manager specific table navigation

Adherence to Best Practices

✅ Follows Windmill Guidelines

  • Clean code with single responsibilities
  • Avoids duplication (mostly - some frontend duplication noted)
  • Rust performance optimizations applied correctly
  • Proper error handling throughout
  • No blocking operations in async contexts

⚠️ Minor Deviations

  • Some code duplication in frontend path parsing
  • Could extract more helper functions for better testability
  • SQL query complexity could be simplified

Additional Observations

  1. Good defensive programming: The parsers handle missing cases gracefully and fall back to unknown access types rather than failing
  2. Excellent test coverage: Each parser has comprehensive tests demonstrating various scenarios
  3. Consistent API: The asset format (kind/path/table) is consistent across all three languages
  4. Developer experience: The feature integrates well with existing UI components (DB Manager opening to specific tables is a nice touch)

Reproduction Instructions for Testers

To test the specific asset tables feature, navigate to the script editor and create a new Python, TypeScript, or SQL script. Write code that accesses a database with specific tables - for example, in Python use db = wmill.datatable('mydata') followed by db.query('SELECT * FROM users'), or in SQL use ATTACH 'ducklake://mydl' AS dl; SELECT * FROM dl.customers;, or in TypeScript use let sql = wmill.datatable('dt') followed by await sql\SELECT * FROM friends`.fetch(). After saving the script, go to the Assets page in the workspace navigation. You should see the specific tables listed (like mydata/usersormydl/customers`) rather than just the database names. Each asset should show whether it's being read, written, or both using R, W, or RW indicators. Click the "Manage" button next to any database asset to open the DB Manager directly to that specific table - the table should be pre-selected in the table list on the left sidebar. In the Flow editor, when you reference a script with database assets, the flow's asset graph should display the specific tables being accessed with appropriate icons and access indicators, distinguishing between read operations (shown with read indicators) and write operations (shown with write indicators).


Recommendation

Approve with minor revisions. The implementation is solid and well-tested. The issues identified are mostly minor improvements and edge case handling. The critical issues should be addressed:

Must Fix Before Merge:

  1. Verify SQL parser stack management for nested queries (or add test confirming it works)
  2. ⚠️ Consider the access type merging logic - current behavior might be intentional but loses information

Should Fix (Can be follow-up):

  1. Add path validation/sanitization for table names with /
  2. Add frontend path parsing validation
  3. Add missing ducklake/datatable parsing in frontend parseAssetFromString
  4. Extract frontend path parsing helper

Nice to Have:

  1. Optimize TypeScript parser HashMap cloning
  2. Optimize API query with JOINs instead of subqueries

Overall, this is a valuable feature that significantly improves asset tracking granularity! The code quality is high, testing is comprehensive, and the implementation follows established patterns. The concerns raised are mostly about edge cases and potential optimizations rather than fundamental issues. 🎉


@rubenfiszel rubenfiszel disabled auto-merge December 11, 2025 13:34
@rubenfiszel rubenfiszel merged commit c6c7f34 into main Dec 11, 2025
4 checks passed
@rubenfiszel rubenfiszel deleted the di/specific-asset-tables-2 branch December 11, 2025 13:34
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants