-
Notifications
You must be signed in to change notification settings - Fork 1.2k
chore(wren-ai-service): improve add quotes #1913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughMoved SQL-quoting into Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Engine as src.core.engine.add_quotes
participant Tokenizer as Tokenizer / Token stream
Caller->>Engine: add_quotes(sql)
Engine->>Engine: squish_sql(sql)
Engine->>Tokenizer: tokenize(sql)
Tokenizer-->>Engine: token stream
Engine->>Engine: analyze tokens (IDENT, DOT, STAR, L_PAREN, ...)
alt dotted identifiers to quote
Engine->>Engine: record edits (quote each identifier in chain)
note right of Engine #eef2ff: edits recorded right-to-left to\npreserve original offsets
end
Engine->>Engine: apply edits -> quoted_sql
Engine-->>Caller: (quoted_sql, "")
alt on error
Engine-->>Caller: ("", "error message")
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
wren-ai-service/eval/utils.py (1)
27-35
: Fix multi-statement quoting and add optional dialect parameter inadd_quotes
The current implementation in
wren-ai-service/eval/utils.py
silently drops all but the first statement when given a multi-statement SQL string and uses a default parser configuration that can fail on unquoted reserved words. Based on the test output (sqlglot v27.8.0):
- Multi-statement input like
select a from t; select b from t
is parsed byparse_one
as only the first statement, producing
SELECT "a" FROM "t"
and droppingselect b from t
.- Dialect-specific quoting (e.g. PostgreSQL, BigQuery, DuckDB) works as expected when you pass
read=
anddialect=
, but the default (no dialect) parser errors on reserved identifiers like
select select, from from
.To address these issues, update
add_quotes
as follows:• Change the signature to accept an optional dialect:
-def add_quotes(sql: str) -> Tuple[str, bool]: +def add_quotes(sql: str, dialect: Optional[str] = None) -> Tuple[str, bool]:• Parse all statements and preserve them:
- node = parse_one(sql, read=dialect) if dialect else parse_one(sql) - quoted_sql = node.sql(dialect=dialect, identify=True) if dialect else node.sql(identify=True) + # Use parse(...) to handle multi-statement inputs + expressions = parse(sql, read=dialect) # returns List[Expression] + quoted_statements = [ + expr.sql(dialect=dialect, identify=True) + for expr in expressions + ] + quoted_sql = "; ".join(quoted_statements)• Thread the dialect through both parsing and SQL generation:
- quoted_sql = parse_one(sql).sql(identify=True) + expressions = parse(sql, read=dialect) + quoted_sql = "; ".join( + expr.sql(dialect=dialect, identify=True) + for expr in expressions + )• Return the full, re-quoted script:
return quoted_sql, TrueThese changes ensure:
- All statements in a multi-statement SQL string are preserved and properly quoted.
- Dialect-specific quoting is deterministic and matches what connectors like DuckDB, PostgreSQL, and BigQuery expect.
- Reserved identifiers parse and quote correctly when a dialect is provided, and errors on unquoted reserved words are avoided.
🧹 Nitpick comments (3)
wren-ai-service/src/core/engine.py (2)
8-8
: Narrow import looks good; consider future-proofing for dialect-specific quoting.Switching to
from sqlglot import parse_one
is fine and reduces namespace clutter. To avoid subtle differences across data sources (e.g., DuckDB vs BigQuery quoting), consider making the quoting dialect explicit in the call sites (see suggestion below on the add_quotes function for a small API extension).
52-60
: Make quoting deterministic across engines, avoid logging raw SQL, and align failure behavior.
- Determinism:
parse_one(sql).sql(identify=True)
uses default dialects; output may vary from expectations per connector. Add an optionaldialect
arg and pass it through to both parsing and generation to lock the quoting style (e.g., "duckdb", "postgres", "bigquery").- Logging hygiene: logging full SQL at exception level can leak sensitive schema or literal info. Prefer a generic message (stack trace still included) and avoid embedding the SQL.
- Failure behavior: returning
""
on error forces all callers to handle empty SQL specially. Consider returning the originalsql
(like utils.add_quotes does) and the error string so downstream code can decide whether to proceed or fail.Apply this diff within the current block:
-def add_quotes(sql: str) -> Tuple[str, str]: +def add_quotes(sql: str, dialect: Optional[str] = None) -> Tuple[str, str]: try: - quoted_sql = parse_one(sql).sql(identify=True) + node = parse_one(sql, read=dialect) if dialect else parse_one(sql) + quoted_sql = ( + node.sql(dialect=dialect, identify=True) + if dialect + else node.sql(identify=True) + ) except Exception as e: - logger.exception(f"Error in adding quotes to {sql}: {e}") - - return "", str(e) + # Avoid logging raw SQL to reduce leakage; stack trace is preserved. + logger.exception("Error adding quotes to SQL") + return sql, str(e) return quoted_sql, ""To ensure we pick the right dialects at call sites, please confirm which connectors can be used here and wire the corresponding dialect (e.g., "duckdb" for DuckDB, "postgres" for Postgres, "bigquery" for BigQuery). I can follow up with a small patch once you confirm the list.
wren-ai-service/eval/utils.py (1)
27-35
: Replace print-based error reporting with structured logging and keep stack traces.Using
engine.py
which already uses a logger. Prefer a module logger and avoid logging the raw SQL string to reduce leakage risk. The function’s return contract stays the same.Apply this diff within the current block:
def add_quotes(sql: str) -> Tuple[str, bool]: try: quoted_sql = parse_one(sql).sql(identify=True) return quoted_sql, True except Exception as e: - print(f"Error in adding quotes to SQL: {sql}") - print(f"Error: {e}") - return sql, False + logger.exception("Error adding quotes to SQL") + return sql, FalseAdd this near the imports (outside the selected range):
import logging logger = logging.getLogger("wren-ai-service")I can push a follow-up commit that introduces the logger in this module and removes the print statements if you’d like.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
wren-ai-service/eval/utils.py
(2 hunks)wren-ai-service/src/core/engine.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/eval/utils.py (1)
17-17
: LGTM on narrowed import.
from sqlglot import parse_one
is clearer and avoids importing the whole module namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
wren-ai-service/eval/data_curation/utils.py (1)
47-48
: Fix success check for add_quotes: empty string means “no error”.The second return value is an error message. Asserting truthiness will fail on success.
Apply this diff:
- quoted_sql, no_error = add_quotes(sql) - assert no_error, f"Error in quoting SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert error == "", f"Error in quoting SQL: {error}"wren-ai-service/tools/run_sql.py (2)
48-50
: Fix success check for add_quotes in duckdb path.Empty string means no error; current assert fails on success.
Apply this diff:
- quoted_sql, no_error = add_quotes(sql) - assert no_error, f"Error in adding quotes to SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert error == "", f"Error in adding quotes to SQL: {error}"
71-73
: Fix success check for add_quotes in non-duckdb path.Same issue as above.
Apply this diff:
- quoted_sql, no_error = add_quotes(sql) - assert no_error, f"Error in adding quotes to SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert error == "", f"Error in adding quotes to SQL: {error}"
🧹 Nitpick comments (10)
wren-ai-service/src/core/engine.py (2)
118-136
: Quote qualifiers in dotted function calls; skip only the function name.Current logic skips the entire dotted chain when followed by '(', so qualifiers like db.schema remain unquoted in e.g., db.schema.my_func(...). Prefer quoting the qualifiers while leaving the final identifier (the function name) unquoted.
Apply this diff to adjust the behavior:
- # If the next token after the chain is '(', it's a function call -> skip - if j + 1 < n and toks[j + 1].token_type == TokenType.L_PAREN: - i = j + 1 - continue + # If the next token after the chain is '(', it's a function call. + # Quote only the qualifiers in the chain and skip the function name. + is_func_call = j + 1 < n and toks[j + 1].token_type == TokenType.L_PAREN + last_ident_index = len(chain_tokens) - 1 if is_func_call else None @@ - for k in range( + for k in range( 0, len(chain_tokens), 2 ): # Process only identifiers (skip dots) - ident_token = chain_tokens[k] + if is_func_call and k == last_ident_index: + # leave function name unquoted + continue + ident_token = chain_tokens[k] token_text = sql[ident_token.start : ident_token.end + 1]Consider adding a unit test covering dataset-qualified UDFs: SELECT db.schema.fn(col) FROM t; expecting '"db"."schema".fn("col")'. Do you want me to add it?
146-148
: Avoid logging full SQL on failures to reduce PII leakage; truncate or hash.Logging the entire SQL at error level can leak sensitive data. Truncate and include an MD5/SHA1 to correlate logs.
Apply this diff to redact and truncate:
- logger.exception(f"Error in adding quotes to {sql}: {e}") + redacted_sql = (sql[:512] + "...") if len(sql) > 512 else sql + logger.exception(f"Error in add_quotes (len={len(sql)}): {redacted_sql} :: {e}")wren-ai-service/eval/data_curation/utils.py (5)
215-218
: Use the function parameter data_source instead of st.session_state for endpoint selection.Mixing sources can lead to mismatches when session state differs from the argument.
Apply this diff:
- WREN_ENGINE_ENDPOINT - if st.session_state["data_source"] == "duckdb" - else WREN_IBIS_ENDPOINT, + WREN_ENGINE_ENDPOINT if data_source == "duckdb" else WREN_IBIS_ENDPOINT,
221-229
: Typo: rename quesiton_sql_pair → question_sql_pair for clarity.Keeps comprehension readable and consistent.
Apply this diff:
- { - **quesiton_sql_pair, + { + **question_sql_pair, "context": context, "data": sql_data, "document": document, } - for quesiton_sql_pair, context, sql_data, document in zip( - question_sql_pairs, contexts, sqls_data, documents + for question_sql_pair, context, sql_data, document in zip( + question_sql_pairs, contexts, sqls_data, documents )
25-25
: Deduplicate remove_limit_statement: import from core.engine rather than reimplementing.Reduces drift and keeps behavior consistent across modules.
Apply this diff:
-from src.core.engine import add_quotes +from src.core.engine import add_quotes, remove_limit_statement @@ -def remove_limit_statement(sql: str) -> str: - pattern = r"\s*LIMIT\s+\d+(\s*;?\s*--.*|\s*;?\s*)$" - modified_sql = re.sub(pattern, "", sql, flags=re.IGNORECASE) - - return modified_sql +# Use remove_limit_statement from src.core.engineAlso applies to: 273-278
84-104
: Avoid creating unused aiohttp.ClientSession contexts.Both get_validated_question_sql_pairs and get_contexts_from_sqls open a ClientSession but never use it, which only adds overhead.
Consider removing the unnecessary contexts:
- async with aiohttp.ClientSession(): - for question_sql_pair in question_sql_pairs: + for question_sql_pair in question_sql_pairs: ... - results = await asyncio.gather(*tasks) - return [ + results = await asyncio.gather(*tasks) + return [Repeat similarly for get_contexts_from_sqls. If you intend to reuse a session for downstream calls, thread it through function parameters and use it.
Also applies to: 119-141
38-49
: Type annotation of is_sql_valid should reflect Optional[str] for the error.The function returns (True, None) on success; annotate as Tuple[bool, Optional[str]] or normalize to always return a string.
Apply this diff:
-async def is_sql_valid( +async def is_sql_valid( sql: str, @@ -) -> Tuple[bool, str]: +) -> Tuple[bool, Optional[str]]:wren-ai-service/tools/run_sql.py (1)
40-46
: Optional: Deduplicate quoting logic across branches.Compute quoted_sql once before the conditional to simplify and avoid divergence.
Example:
def get_data_from_wren_engine( sql: str, @@ ): - if dataset_type == "duckdb": - quoted_sql, error = add_quotes(sql) - assert error == "", f"Error in adding quotes to SQL: {error}" + quoted_sql, error = add_quotes(sql) + assert error == "", f"Error in adding quotes to SQL: {error}" + if dataset_type == "duckdb": response = requests.get(Also applies to: 71-73
wren-ai-service/tests/pytest/test_engine.py (2)
51-62
: Add a dataset-qualified function-call test.Ensure only function name remains unquoted while qualifiers are quoted (aligning with the proposed engine change).
Add this test:
+ def test_add_quotes_dataset_qualified_function_call(self): + sql = "SELECT project.dataset.my_func(col) FROM tbl" + result, error = add_quotes(sql) + assert error == "" + assert '"project"."dataset".my_func("col")' in result + assert '"tbl"' in result
212-231
: Protect squish_sql from modifying string literals.Add a regression test to prevent future regressions where squish_sql alters content inside quoted strings.
Append this test:
+ def test_squish_sql_preserves_string_literals(self): + sql = "SELECT 'a b', \"c d\", `e f`, [g h]" + result = squish_sql(sql) + assert result.startswith("SELECT") + assert "'a b'" in result + assert '"c d"' in result + assert "`e f`" in result + assert "[g h]" in result
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
wren-ai-service/eval/data_curation/utils.py
(1 hunks)wren-ai-service/eval/utils.py
(2 hunks)wren-ai-service/src/core/engine.py
(2 hunks)wren-ai-service/tests/pytest/test_engine.py
(1 hunks)wren-ai-service/tools/run_sql.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- wren-ai-service/eval/utils.py
🧰 Additional context used
🧬 Code graph analysis (3)
wren-ai-service/eval/data_curation/utils.py (1)
wren-ai-service/src/core/engine.py (1)
add_quotes
(57-150)
wren-ai-service/tests/pytest/test_engine.py (1)
wren-ai-service/src/core/engine.py (4)
add_quotes
(57-150)clean_generation_result
(30-42)remove_limit_statement
(45-49)squish_sql
(52-54)
wren-ai-service/tools/run_sql.py (1)
wren-ai-service/src/core/engine.py (1)
add_quotes
(57-150)
🔇 Additional comments (2)
wren-ai-service/src/core/engine.py (1)
8-8
: The above script will verify whether sqlglot is declared in your project’s dependency manifests (requirements.txt, pyproject.toml, setup.py) or referenced elsewhere. Based on its output, we can confirm if a version pin exists or if you need to add one.Once we see the results, we’ll update the review comment accordingly.
wren-ai-service/tests/pytest/test_engine.py (1)
91-103
: Revisit expectation that "SQL" is always an identifier.Depending on the tokenizer/dialect, SQL might not be treated as an identifier when standalone. If this test flakes across versions, consider loosening or removing the '"SQL"' assertion.
Would you like me to parameterize the test by sqlglot version/dialect, or relax the assertion to only check quoting on non-keyword tokens (e.g., "INVALID" and "PARENTHESES")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
wren-ai-service/eval/utils.py (3)
189-197
: Fix regex to support both 'bird_' and 'spider_' filenamesCurrent code returns
None
forspider_...
because it always readsgroup(1)
. Use a single named group.- match = re.search( - r"bird_(.+?)_eval_dataset\.toml|spider_(.+?)_eval_dataset\.toml", path - ) - if match: - return match.group(1) + match = re.search( + r"(?:bird|spider)_(?P<db>.+?)_eval_dataset\.toml", path + ) + if match: + return match.group("db")
222-224
: Avoid mutable default argument for pipe_componentsUsing
{}
as a default can leak state across calls.-def engine_config( - mdl: dict, pipe_components: dict[str, Any] = {}, path: str = "" -) -> dict: +def engine_config( + mdl: dict, pipe_components: Optional[dict[str, Any]] = None, path: str = "" +) -> dict: + if pipe_components is None: + pipe_components = {}
159-164
: Adjust add_quotes usage across all call sitesThe helper
add_quotes
returns(quoted_sql, error_message)
(empty string on success), not a boolean. Any code unpacking the second element asno_error: bool
will mis-handle errors. We need to:• In
wren-ai-service/eval/utils.py
(lines 159–164), renameno_error
→error
, invert the condition, and include the real error message:- quoted_sql, no_error = add_quotes(sql) - if not no_error: - print(f"Error in quoting SQL: {sql}") - quoted_sql = sql + quoted_sql, error = add_quotes(sql) + if error: + print(f"Error quoting SQL: {error}; sql={sql}") + quoted_sql = sql• In
wren-ai-service/eval/utils.py
(lines 36–37), change the early assert to check for an empty error string:- quoted_sql, error = add_quotes(sql) - assert not error, f"Error in quoting SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert not error, f"Error in quoting SQL: {error}; sql={sql}"• In
wren-ai-service/eval/data_curation/utils.py
(lines 47–48), unpack intoerror
and assert accordingly:- quoted_sql, no_error = add_quotes(sql) - assert no_error, f"Error in quoting SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert not error, f"Error in quoting SQL: {error}; sql={sql}"• In
wren-ai-service/tools/run_sql.py
(lines 48–49 and 71–72), update the unpacking and assertions:- quoted_sql, no_error = add_quotes(sql) - assert no_error, f"Error in adding quotes to SQL: {sql}" + quoted_sql, error = add_quotes(sql) + assert not error, f"Error adding quotes to SQL: {error}; sql={sql}"Please apply these fixes across any other call sites (e.g., in
src/pipelines/generation/utils/sql.py
, renameerror_message
toerror
or handle it consistently). This is required to prevent silent failures or inverted error handling.
🧹 Nitpick comments (3)
wren-ai-service/eval/utils.py (3)
36-37
: Prefer explicit error handling over assert; include error details
assert
can be optimized away and hides failures. Raise with context (or fallback) instead.- quoted_sql, error = add_quotes(sql) - assert not error, f"Error in quoting SQL: {sql}" + quoted_sql, error = add_quotes(sql) + if error: + raise ValueError(f"Error quoting SQL: {error}; sql={sql!r}")
49-55
: Avoid JSON parsing before status check (duckdb path)If the response is non-JSON on error,
await response.json()
will raise.- ) as response: - data = await response.json() - - if response.status != 200: - return {"data": [], "columns": []} + ) as response: + if response.status != 200: + return {"data": [], "columns": []} + data = await response.json()
7-7
: Nit: mixed typing stylesYou’re mixing built-in generics (
list[str]
) withtyping.List/Dict
. It works, but consider unifying for consistency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
wren-ai-service/eval/utils.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wren-ai-service/eval/utils.py (1)
wren-ai-service/src/core/engine.py (1)
add_quotes
(57-150)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pytest
- GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/eval/utils.py (1)
21-21
: LGTM: centralized quoting via engine.add_quotesGood move removing local duplication and importing the engine’s single source of truth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
wren-ai-service/src/core/engine.py (1)
53-65
: Make squish_sql quote-aware; current newline replacement alters strings/quoted identifiers.Replacing all newlines with spaces can change string literals/JSON payloads and quoted identifiers. Collapse whitespace only outside '...', "...",
...
, and [...].Apply this diff:
-def squish_sql(sql: str) -> str: - return ( - sqlparse.format( - sql, - strip_comments=False, - reindent=False, # don't add newlines/indent - keyword_case=None, # don't change case - ) - .replace("\n", " ") - .replace("\r", " ") - .strip() - ) +def squish_sql(sql: str) -> str: + """ + Collapse whitespace outside of string/quoted identifier contexts. + Preserves content inside: single quotes ('...'), double quotes ("..."), + backticks (`...`), and bracket quotes ([...]). + """ + s = sql.replace("\r\n", "\n").replace("\r", "\n") + out = [] + in_single = in_double = in_backtick = in_bracket = False + prev_space = False + i = 0 + while i < len(s): + ch = s[i] + if not (in_single or in_double or in_backtick or in_bracket): + if ch in "'\"`[": + if ch == "'": + in_single = True + elif ch == '"': + in_double = True + elif ch == "`": + in_backtick = True + else: + in_bracket = True + out.append(ch) + prev_space = False + elif ch.isspace(): + if not prev_space: + out.append(" ") + prev_space = True + else: + out.append(ch) + prev_space = False + else: + out.append(ch) + if in_single and ch == "'": + if i + 1 < len(s) and s[i + 1] == "'": + out.append(s[i + 1]); i += 1 + else: + in_single = False + elif in_double and ch == '"': + if i + 1 < len(s) and s[i + 1] == '"': + out.append(s[i + 1]); i += 1 + else: + in_double = False + elif in_backtick and ch == "`": + in_backtick = False + elif in_bracket and ch == "]": + in_bracket = False + i += 1 + return "".join(out).strip()
🧹 Nitpick comments (4)
wren-ai-service/src/core/engine.py (4)
7-9
: Trim imports once squish_sql no longer needs sqlparse.If you adopt the quote-aware squish, remove the unused import.
-import sqlparse -from sqlglot.tokens import Token, Tokenizer, TokenType +from sqlglot.tokens import Token, Tokenizer, TokenType
73-84
: Broaden identifier token types defensively.Some sqlglot versions use IDENTIFIER. Include it if present.
- def is_ident(tok: Token): - # SQLGlot uses VAR for identifiers, but also treats SQL keywords as identifiers in some contexts - return tok.token_type in ( - TokenType.VAR, - TokenType.SCHEMA, - TokenType.TABLE, - TokenType.COLUMN, - TokenType.DATABASE, - TokenType.INDEX, - TokenType.VIEW, - ) + def is_ident(tok: Token): + ident_types = { + TokenType.VAR, + TokenType.SCHEMA, + TokenType.TABLE, + TokenType.COLUMN, + TokenType.DATABASE, + TokenType.INDEX, + TokenType.VIEW, + } + if hasattr(TokenType, "IDENTIFIER"): + ident_types.add(getattr(TokenType, "IDENTIFIER")) + return tok.token_type in ident_types
67-67
: Add targeted tests for edge cases.Please add tests for:
- Strings/JSON with embedded newlines and multiple spaces (preserved).
- Ident chains with function calls: schema.func(col), db.schema.func(col).
- Dialect-specific quote pairs: MySQL/BigQuery backticks; SQL Server brackets (including internal ] escaping).
- Wildcards t.* and plain * untouched.
- Already-quoted identifiers remain unchanged.
I can draft these test cases in tests/pytest/test_engine.py if helpful.
138-142
: Token.start and Token.end are stable – no change needed for slicing logicVerified in SQLGlot’s official API reference that every Python
Token
returned bytokenize
includes.start
and.end
attributes (character offsets into the input SQL), along with.line
and.col
metadata (sqlglot.com). There is no separate.span
property in the Python API, but you can reconstruct a span tuple via(token.start, token.end)
if desired.To guard against future breaking changes:
Pin your SQLGlot dependency to a known-good minimum version that provides these attributes (they have existed since at least v10.x). For example, in your
pyproject.toml
orrequirements.txt
:sqlglot >= 10.0.0, < 27.0.0
(Optional) Add a small fallback in case of edge cases or rust-tokenizer tokens exposing a different API:
start = getattr(ident_token, "start", None) end = getattr(ident_token, "end", None) if hasattr(ident_token, "span"): start, end = ident_token.spanWith these safeguards, you can reliably rely on
.start
/.end
offsets for your slice-and-replace logic.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
wren-ai-service/src/core/engine.py
(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pytest
- GitHub Check: Analyze (go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (9)
wren-ai-service/tests/pytest/test_engine.py (9)
1-5
: Stabilize imports across CI/IDE environmentsConfirm that
src
is importable from the test runner’s working directory (e.g., via a src-layout, namespace package, or pytest.ini pythonpath). If not, tests may fail to import in CI.
20-29
: Also assert alias quoting in FROM clauseCurrent checks don’t validate whether the table alias is quoted in FROM. Add a focused assertion.
@@ assert '"u"."name"' in result assert '"u"."email"' in result assert '"users"' in result + # Alias should be quoted in FROM as well + assert 'FROM "users" "u"' in result
40-49
: Cover alias quoting alongside wildcardValidate alias quoting in FROM when wildcard is used.
@@ assert "t.*" in result # Should not be quoted assert '"name"' in result assert '"table1"' in result + # Alias should be quoted in FROM + assert 'FROM "table1" "t"' in result
62-80
: Add assertions for alias and literal handling in complex queryTighten expectations for alias quoting and boolean literal preservation.
@@ assert '"posts"' in result assert '"categories"' in result + # Alias from AS should be quoted + assert '"category_name"' in result + # Boolean literal should remain unquoted + assert " true" in result
90-102
: Reduce brittleness in malformed SQL checkDepending on sqlglot tokenization,
SQL
might not always be treated as an identifier. Consider relaxing that specific assertion to avoid flaky failures across versions.@@ - assert '"SQL"' in result assert '"PARENTHESES"' in result
128-142
: Add negatives to ensure keywords aren’t quotedMake the intent explicit by asserting keywords are never quoted.
@@ assert "ORDER BY " in result assert '"name"' in result # But identifiers should be quoted assert '"users"' in result assert '"active"' in result assert '"created_at"' in result + # Ensure SQL keywords are not quoted + assert '"SELECT"' not in result + assert '"FROM"' not in result + assert '"WHERE"' not in result + assert '"ORDER"' not in result
155-161
: Add test for triple single quotesYou already handle
'''
in the implementation; add a focused test.@@ cleaned = clean_generation_result(result) assert '"""' not in cleaned assert "SELECT * FROM users" in cleaned + + def test_clean_generation_result_removes_single_triple_quotes(self): + """Test removing triple single quotes.""" + result = "'''SELECT * FROM users'''" + cleaned = clean_generation_result(result) + assert "'''" not in cleaned + assert "SELECT * FROM users" in cleaned
193-199
: Minor: unify case-insensitive assertion styleFor consistency with other tests, consider asserting on
result.upper()
.@@ - assert "limit" not in result + assert "LIMIT" not in result.upper()
200-206
: Add test for LIMIT with trailing commentYour regex handles
-- comment
at EOL; add a test to lock it in.@@ result = remove_limit_statement(sql) assert "name = 'limit'" in result assert "LIMIT 10" not in result + + def test_remove_limit_statement_with_trailing_comment(self): + """Test removing LIMIT when a trailing SQL comment exists.""" + sql = "SELECT * FROM users LIMIT 10 -- comment" + result = remove_limit_statement(sql) + assert "LIMIT" not in result.upper() + assert "SELECT * FROM users" in result
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
wren-ai-service/tests/pytest/test_engine.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wren-ai-service/tests/pytest/test_engine.py (1)
wren-ai-service/src/core/engine.py (3)
add_quotes
(67-160)clean_generation_result
(31-43)remove_limit_statement
(46-50)
🔇 Additional comments (11)
wren-ai-service/tests/pytest/test_engine.py (11)
11-19
: LGTM: baseline quoting assertions cover core pathGood happy-path coverage for simple identifiers.
30-39
: LGTM: preserves already-quoted identifiersMatches intended behavior.
81-89
: LGTM: schema-prefixed identifiersGood coverage of dotted chains with multiple levels.
103-110
: LGTM: empty input behaviorCovers non-error empty case.
111-119
: LGTM: whitespace normalization sanity checkMatches expected pre-tokenization squish.
120-127
: LGTM: dotted chain spacing preservedGood guard against spacing regressions.
147-154
: LGTM: code-fence removalCovers typical ```sql blocks.
162-168
: LGTM: semicolon strippingCovers trailing semicolon case.
169-174
: LGTM: normalization exactnessExact match assertion is helpful here.
179-185
: LGTM: basic LIMIT removalCaptures the primary behavior.
186-192
: LGTM: LIMIT with semicolonGood end-of-statement coverage.
we could add quotes to sql without transpilation using sqlglot
Summary by CodeRabbit
Bug Fixes
Refactor
New Features
Tests