-
Notifications
You must be signed in to change notification settings - Fork 853
Specific asset tables #7323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specific asset tables #7323
Conversation
This reverts commit 6e1588d.
|
Claude finished @diegoimbert's task in 3m 43s —— View job Code Review Complete ✅This PR introduces specific table tracking for database assets (Ducklake, DataTable, and database resources), enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a significant enhancement to the asset tracking system. Summary of ChangesBackend Changes
Frontend Changes
Code Quality Assessment✅ Strengths
|
| Ok((input, path)) | ||
| } | ||
| // Detect when we do 'a.b' and 'a' is associated with an asset in var_identifiers | ||
| // Or when we access 'b' and we did USE a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Logic Issue: When accessing tables with USE context, the code constructs paths like {path}/{specific_table}. However, there's no validation that specific_table doesn't contain / characters itself, which could lead to malformed paths.
Consider adding path sanitization or validation for the table name components.
| if name.0.len() == 1 { | ||
| let ident = name.0.first()?.as_ident()?; | ||
| if ident.quote_style.is_some() { | ||
| return None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential False Negative: This check prevents single identifiers from being matched when access_type is None. However, this might miss legitimate table references in contexts where the access type isn't yet determined.
For example, in a CTE (Common Table Expression) or subquery, you might reference a table before the outer query's access type is known. Consider whether this is the intended behavior.
| let (kind, path) = self.var_identifiers.get(&ident.value)?; | ||
| let path = if name.0.len() == 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Quality: This logic for extracting table names from multi-part object names could be extracted into a helper function for better readability and testability. The pattern is used multiple times in this file.
| sqlparser::ast::Statement::AttachDuckDBDatabase { | ||
| database_path, | ||
| database_alias, | ||
| .. | ||
| } => { | ||
| if let Some((kind, path)) = parse_asset_syntax(&database_path.value, true) { | ||
| if kind == AssetKind::Ducklake |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security Note: The parse_asset_syntax function is used here to extract asset information from database paths. Ensure that this function properly validates and sanitizes the input to prevent path traversal attacks or other injection vulnerabilities, especially since these paths are user-controlled (coming from SQL statements).
The current implementation looks safe, but it's worth documenting this security consideration.
| let duckdb_conn_prefix = match kind { | ||
| AssetKind::DataTable => "datatable", | ||
| AssetKind::Ducklake => "ducklake", | ||
| _ => return Ok(()), | ||
| }; | ||
| let sql = format!("ATTACH '{duckdb_conn_prefix}://{path}' AS dt; USE dt; {sql}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Issue: When constructing the SQL string for parsing, this code directly interpolates the path variable which comes from user input (the Python code being parsed). While this is passed to windmill_parser_sql::parse_assets which should be safe, consider whether there could be any edge cases with unusual characters in the path.
Example: What if path contains quotes or special SQL characters? The SQL parser should handle this, but it's worth verifying.
| fn visit_stmt_assign(&mut self, node: rustpython_ast::StmtAssign) { | ||
| // Check if the value is a call to a tracked function | ||
| if let Some((kind, name)) = self.extract_asset_from_call(&node.value) { | ||
| // Track all target variables | ||
| for target in &node.targets { | ||
| if let Expr::Name(name_expr) = target { | ||
| let Ok(var_name) = name_expr.id.parse::<String>(); | ||
| self.var_identifiers | ||
| .insert(var_name, (kind.clone(), name.clone())); | ||
| } | ||
| } | ||
| } else { | ||
| // If not wmill.datatable or similar, remove any tracked variables | ||
| // It means the identifier is no longer refering to an asset | ||
| for target in &node.targets { | ||
| if let Expr::Name(name_expr) = target { | ||
| let Ok(var_name) = name_expr.id.parse::<String>(); | ||
| let removed = self.var_identifiers.remove(&var_name); | ||
| // if a db = wmill.datatable() or similar was removed, but never used (e.g db.query(...)), | ||
| // we still want to register the asset as unknown access type | ||
| match removed { | ||
| Some((kind, name)) => { | ||
| if self | ||
| .assets | ||
| .iter() | ||
| .all(|a| !(a.kind == kind && a.path == name)) | ||
| { | ||
| self.assets.push(ParseAssetsResult { | ||
| kind, | ||
| access_type: None, | ||
| path: name, | ||
| }); | ||
| } | ||
| } | ||
| None => {} | ||
| if let Some(Expr::Name(expr_name)) = node.targets.first() { | ||
| // Remove any tracked variables with that name in case of reassignment | ||
| let Ok(var_name) = expr_name.id.parse::<String>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Quality: The error handling here silently ignores the Err result from id.parse::<String>(). While this is probably safe since we're just skipping invalid identifiers, consider logging or at least adding a comment explaining why errors are ignored.
This pattern appears multiple times in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's infaillible
| if (asset.kind === 'resource' && isDbType(_resourceMetadata?.resource_type)) { | ||
| let resourcePath = asset.path.split('/').slice(0, 3).join('/') | ||
| let specificTable = asset.path.split('/')[3] as string | undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Bug: This assumes that the resource path has at least 3 segments (u/user/resource). If the asset path doesn't follow this pattern, slice(0, 3) could produce unexpected results, and accessing [3] could be undefined.
Consider adding validation:
let pathParts = asset.path.split('/')
if (pathParts.length < 3) {
console.error('Invalid resource path format:', asset.path)
return
}
let resourcePath = pathParts.slice(0, 3).join('/')
let specificTable = pathParts[3]| FROM asset | ||
| LEFT JOIN resource ON asset.kind = 'resource' AND asset.path = resource.path AND resource.workspace_id = $1 | ||
| LEFT JOIN resource ON asset.kind = 'resource' | ||
| AND array_to_string((string_to_array(asset.path, '/'))[1:3], '/') = resource.path -- With specific table, asset path can be e.g u/diego/pg_db/table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL Query Complexity: This comment indicates complex path manipulation for resources with specific tables. The expression array_to_string((string_to_array(asset.path, '/'))[1:3], '/') is doing array slicing in SQL which can be hard to understand and maintain.
Consider:
- Adding a code comment explaining the path format and why this slicing is needed
- Or moving this logic to application code where it's easier to test and understand
- Adding an example:
-- Example: 'u/user/pg_db/table_name' -> 'u/user/pg_db'
| } else if (asset.kind === 'ducklake') { | ||
| dbManagerDrawer?.openDrawer({ type: 'ducklake', ducklake: asset.path }) | ||
| let ducklake = asset.path.split('/')[0] | ||
| let specificTable = asset.path.split('/')[1] as string | undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Duplication: The pattern of splitting asset.path and extracting the base path and specific table is repeated for ducklake (lines 72-74) and datatable (lines 76-78). This is also similar to the resource handling (lines 61-62).
Consider extracting a helper function:
function parseAssetPath(path: string, minSegments: number = 1): { basePath: string; specificTable?: string } {
const parts = path.split('/')
return {
basePath: parts.slice(0, minSegments).join('/'),
specificTable: parts[minSegments]
}
}| AND (asset.kind <> 'resource' OR resource.path IS NOT NULL) | ||
| AND (asset.usage_kind <> 'flow' OR asset.usage_path = ANY(SELECT path FROM flow WHERE workspace_id = $1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance Consideration: This query filters assets by checking if their usage paths exist in the flow and script tables using ANY(SELECT ...) subqueries. For large datasets, this could be slow.
Consider:
- Using JOINs instead of subqueries for better query optimization
- Adding indexes on
flow.pathandscript.pathif they don't already exist - Or materializing this data if the query becomes a bottleneck
The current approach ensures data consistency (filtering out assets for deleted scripts/flows), but may have performance implications.
| match statement { | ||
| sqlparser::ast::Statement::Query(_) => { | ||
| // don't forget pop() in post_visit_statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness Issue: When visiting a Query statement, the code pushes R (read) access type twice - once in pre_visit_statement (line 210) and once in pre_visit_query (line 344). This means after visiting a query, you'll have an extra R on the stack that won't be popped until post_visit_statement.
This could lead to incorrect access type detection for nested structures. The post_visit_query at line 352 pops one R, but the one pushed in pre_visit_statement also needs to be popped in post_visit_statement (line 333).
Verify this is working correctly with nested queries.
|
Claude finished @diegoimbert's task in 3m 21s —— View job Code Review Complete ✅This PR introduces specific table tracking for database assets, enabling Windmill to track not just which databases are accessed, but which specific tables within those databases. This is a valuable enhancement to the asset tracking system. Summary of ChangesBackend Changes
Frontend Changes
Code Quality Assessment✅ Strengths
|
Detect which specific tables were accessed from a ducklake/datatable/db resource when parsing assets.