Skip to content

Conversation

cyyeh
Copy link
Member

@cyyeh cyyeh commented Aug 22, 2025

we could add quotes to sql without transpilation using sqlglot

Summary by CodeRabbit

  • Bug Fixes

    • More reliable and consistent automatic quoting of SQL identifiers and clearer error reporting when quoting fails.
  • Refactor

    • Centralized quoting logic into the core engine and removed duplicate in-file implementations; SQL whitespace/line-ending normalization added.
  • New Features

    • Added a public SQL normalization utility (squish_sql) and unified token-based quoting behavior.
  • Tests

    • New comprehensive tests for quoting, normalization, limit removal, and cleaning of generated SQL.

Copy link
Contributor

coderabbitai bot commented Aug 22, 2025

Walkthrough

Moved SQL-quoting into src/core/engine.py, introducing token-based add_quotes and squish_sql; removed local add_quotes and sqlglot usage from several modules and updated imports; added tests validating engine utilities and quoting behaviors.

Changes

Cohort / File(s) Summary
Engine: token-based quoting & utilities
wren-ai-service/src/core/engine.py
Added squish_sql(sql: str); reimplemented add_quotes using sqlparse/Tokenizer/TokenType token-driven quoting (identifier chain detection, skip wildcards/functions, right-to-left edits); removed sqlglot usage; updated error handling to return ("", error_msg) on failure.
Call sites: import centralized add_quotes
wren-ai-service/eval/utils.py, wren-ai-service/eval/data_curation/utils.py, wren-ai-service/tools/run_sql.py
Removed local add_quotes definitions and sqlglot imports; now import add_quotes from src.core.engine; call sites retain (quoted_sql, error) contract and assert/error-checking logic unchanged aside from implementation relocation.
Tests: engine utilities
wren-ai-service/tests/pytest/test_engine.py
New test module covering add_quotes, clean_generation_result, remove_limit_statement, and squish_sql across many SQL scenarios (simple/dotted identifiers, quoted identifiers, wildcards, functions, joins, malformed input, whitespace normalization, limit removal, code block stripping).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Engine as src.core.engine.add_quotes
  participant Tokenizer as Tokenizer / Token stream
  Caller->>Engine: add_quotes(sql)
  Engine->>Engine: squish_sql(sql)
  Engine->>Tokenizer: tokenize(sql)
  Tokenizer-->>Engine: token stream
  Engine->>Engine: analyze tokens (IDENT, DOT, STAR, L_PAREN, ...)
  alt dotted identifiers to quote
    Engine->>Engine: record edits (quote each identifier in chain)
    note right of Engine #eef2ff: edits recorded right-to-left to\npreserve original offsets
  end
  Engine->>Engine: apply edits -> quoted_sql
  Engine-->>Caller: (quoted_sql, "")
  alt on error
    Engine-->>Caller: ("", "error message")
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yichieh-lu
  • imAsterSun

Poem

I nibble tokens, hop through quotes,
I squish the lines and tidy notes.
Dotted names snug, wildcards spared,
Tests applaud the work I've paired.
A rabbit hops — SQL neat and shared. 🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/ai-service/improve-add-quotes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
wren-ai-service/eval/utils.py (1)

27-35: Fix multi-statement quoting and add optional dialect parameter in add_quotes

The current implementation in wren-ai-service/eval/utils.py silently drops all but the first statement when given a multi-statement SQL string and uses a default parser configuration that can fail on unquoted reserved words. Based on the test output (sqlglot v27.8.0):

  • Multi-statement input like
    select a from t; select b from t
    is parsed by parse_one as only the first statement, producing
    SELECT "a" FROM "t" and dropping select b from t.
  • Dialect-specific quoting (e.g. PostgreSQL, BigQuery, DuckDB) works as expected when you pass read= and dialect=, but the default (no dialect) parser errors on reserved identifiers like
    select select, from from.

To address these issues, update add_quotes as follows:

• Change the signature to accept an optional dialect:

-def add_quotes(sql: str) -> Tuple[str, bool]:
+def add_quotes(sql: str, dialect: Optional[str] = None) -> Tuple[str, bool]:

• Parse all statements and preserve them:

-    node = parse_one(sql, read=dialect) if dialect else parse_one(sql)
-    quoted_sql = node.sql(dialect=dialect, identify=True) if dialect else node.sql(identify=True)
+    # Use parse(...) to handle multi-statement inputs
+    expressions = parse(sql, read=dialect)  # returns List[Expression]
+    quoted_statements = [
+        expr.sql(dialect=dialect, identify=True)
+        for expr in expressions
+    ]
+    quoted_sql = "; ".join(quoted_statements)

• Thread the dialect through both parsing and SQL generation:

-    quoted_sql = parse_one(sql).sql(identify=True)
+    expressions = parse(sql, read=dialect)
+    quoted_sql = "; ".join(
+        expr.sql(dialect=dialect, identify=True)
+        for expr in expressions
+    )

• Return the full, re-quoted script:

return quoted_sql, True

These changes ensure:

  • All statements in a multi-statement SQL string are preserved and properly quoted.
  • Dialect-specific quoting is deterministic and matches what connectors like DuckDB, PostgreSQL, and BigQuery expect.
  • Reserved identifiers parse and quote correctly when a dialect is provided, and errors on unquoted reserved words are avoided.
🧹 Nitpick comments (3)
wren-ai-service/src/core/engine.py (2)

8-8: Narrow import looks good; consider future-proofing for dialect-specific quoting.

Switching to from sqlglot import parse_one is fine and reduces namespace clutter. To avoid subtle differences across data sources (e.g., DuckDB vs BigQuery quoting), consider making the quoting dialect explicit in the call sites (see suggestion below on the add_quotes function for a small API extension).


52-60: Make quoting deterministic across engines, avoid logging raw SQL, and align failure behavior.

  • Determinism: parse_one(sql).sql(identify=True) uses default dialects; output may vary from expectations per connector. Add an optional dialect arg and pass it through to both parsing and generation to lock the quoting style (e.g., "duckdb", "postgres", "bigquery").
  • Logging hygiene: logging full SQL at exception level can leak sensitive schema or literal info. Prefer a generic message (stack trace still included) and avoid embedding the SQL.
  • Failure behavior: returning "" on error forces all callers to handle empty SQL specially. Consider returning the original sql (like utils.add_quotes does) and the error string so downstream code can decide whether to proceed or fail.

Apply this diff within the current block:

-def add_quotes(sql: str) -> Tuple[str, str]:
+def add_quotes(sql: str, dialect: Optional[str] = None) -> Tuple[str, str]:
     try:
-        quoted_sql = parse_one(sql).sql(identify=True)
+        node = parse_one(sql, read=dialect) if dialect else parse_one(sql)
+        quoted_sql = (
+            node.sql(dialect=dialect, identify=True)
+            if dialect
+            else node.sql(identify=True)
+        )
     except Exception as e:
-        logger.exception(f"Error in adding quotes to {sql}: {e}")
-
-        return "", str(e)
+        # Avoid logging raw SQL to reduce leakage; stack trace is preserved.
+        logger.exception("Error adding quotes to SQL")
+        return sql, str(e)
 
     return quoted_sql, ""

To ensure we pick the right dialects at call sites, please confirm which connectors can be used here and wire the corresponding dialect (e.g., "duckdb" for DuckDB, "postgres" for Postgres, "bigquery" for BigQuery). I can follow up with a small patch once you confirm the list.

wren-ai-service/eval/utils.py (1)

27-35: Replace print-based error reporting with structured logging and keep stack traces.

Using print in a library/utility module makes operational visibility harder and is inconsistent with engine.py which already uses a logger. Prefer a module logger and avoid logging the raw SQL string to reduce leakage risk. The function’s return contract stays the same.

Apply this diff within the current block:

 def add_quotes(sql: str) -> Tuple[str, bool]:
     try:
         quoted_sql = parse_one(sql).sql(identify=True)
         return quoted_sql, True
     except Exception as e:
-        print(f"Error in adding quotes to SQL: {sql}")
-        print(f"Error: {e}")
-        return sql, False
+        logger.exception("Error adding quotes to SQL")
+        return sql, False

Add this near the imports (outside the selected range):

import logging
logger = logging.getLogger("wren-ai-service")

I can push a follow-up commit that introduces the logger in this module and removes the print statements if you’d like.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0c44ace and a4cbb1e.

📒 Files selected for processing (2)
  • wren-ai-service/eval/utils.py (2 hunks)
  • wren-ai-service/src/core/engine.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/eval/utils.py (1)

17-17: LGTM on narrowed import.

from sqlglot import parse_one is clearer and avoids importing the whole module namespace.

@cyyeh cyyeh added module/ai-service ai-service related ci/ai-service ai-service related labels Aug 22, 2025
@cyyeh cyyeh marked this pull request as draft August 26, 2025 10:22
@cyyeh cyyeh marked this pull request as ready for review August 27, 2025 00:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
wren-ai-service/eval/data_curation/utils.py (1)

47-48: Fix success check for add_quotes: empty string means “no error”.

The second return value is an error message. Asserting truthiness will fail on success.

Apply this diff:

-    quoted_sql, no_error = add_quotes(sql)
-    assert no_error, f"Error in quoting SQL: {sql}"
+    quoted_sql, error = add_quotes(sql)
+    assert error == "", f"Error in quoting SQL: {error}"
wren-ai-service/tools/run_sql.py (2)

48-50: Fix success check for add_quotes in duckdb path.

Empty string means no error; current assert fails on success.

Apply this diff:

-        quoted_sql, no_error = add_quotes(sql)
-        assert no_error, f"Error in adding quotes to SQL: {sql}"
+        quoted_sql, error = add_quotes(sql)
+        assert error == "", f"Error in adding quotes to SQL: {error}"

71-73: Fix success check for add_quotes in non-duckdb path.

Same issue as above.

Apply this diff:

-        quoted_sql, no_error = add_quotes(sql)
-        assert no_error, f"Error in adding quotes to SQL: {sql}"
+        quoted_sql, error = add_quotes(sql)
+        assert error == "", f"Error in adding quotes to SQL: {error}"
🧹 Nitpick comments (10)
wren-ai-service/src/core/engine.py (2)

118-136: Quote qualifiers in dotted function calls; skip only the function name.

Current logic skips the entire dotted chain when followed by '(', so qualifiers like db.schema remain unquoted in e.g., db.schema.my_func(...). Prefer quoting the qualifiers while leaving the final identifier (the function name) unquoted.

Apply this diff to adjust the behavior:

-            # If the next token after the chain is '(', it's a function call -> skip
-            if j + 1 < n and toks[j + 1].token_type == TokenType.L_PAREN:
-                i = j + 1
-                continue
+            # If the next token after the chain is '(', it's a function call.
+            # Quote only the qualifiers in the chain and skip the function name.
+            is_func_call = j + 1 < n and toks[j + 1].token_type == TokenType.L_PAREN
+            last_ident_index = len(chain_tokens) - 1 if is_func_call else None
@@
-            for k in range(
+            for k in range(
                 0, len(chain_tokens), 2
             ):  # Process only identifiers (skip dots)
-                ident_token = chain_tokens[k]
+                if is_func_call and k == last_ident_index:
+                    # leave function name unquoted
+                    continue
+                ident_token = chain_tokens[k]
                 token_text = sql[ident_token.start : ident_token.end + 1]

Consider adding a unit test covering dataset-qualified UDFs: SELECT db.schema.fn(col) FROM t; expecting '"db"."schema".fn("col")'. Do you want me to add it?


146-148: Avoid logging full SQL on failures to reduce PII leakage; truncate or hash.

Logging the entire SQL at error level can leak sensitive data. Truncate and include an MD5/SHA1 to correlate logs.

Apply this diff to redact and truncate:

-        logger.exception(f"Error in adding quotes to {sql}: {e}")
+        redacted_sql = (sql[:512] + "...") if len(sql) > 512 else sql
+        logger.exception(f"Error in add_quotes (len={len(sql)}): {redacted_sql} :: {e}")
wren-ai-service/eval/data_curation/utils.py (5)

215-218: Use the function parameter data_source instead of st.session_state for endpoint selection.

Mixing sources can lead to mismatches when session state differs from the argument.

Apply this diff:

-            WREN_ENGINE_ENDPOINT
-            if st.session_state["data_source"] == "duckdb"
-            else WREN_IBIS_ENDPOINT,
+            WREN_ENGINE_ENDPOINT if data_source == "duckdb" else WREN_IBIS_ENDPOINT,

221-229: Typo: rename quesiton_sql_pair → question_sql_pair for clarity.

Keeps comprehension readable and consistent.

Apply this diff:

-            {
-                **quesiton_sql_pair,
+            {
+                **question_sql_pair,
                 "context": context,
                 "data": sql_data,
                 "document": document,
             }
-            for quesiton_sql_pair, context, sql_data, document in zip(
-                question_sql_pairs, contexts, sqls_data, documents
+            for question_sql_pair, context, sql_data, document in zip(
+                question_sql_pairs, contexts, sqls_data, documents
             )

25-25: Deduplicate remove_limit_statement: import from core.engine rather than reimplementing.

Reduces drift and keeps behavior consistent across modules.

Apply this diff:

-from src.core.engine import add_quotes
+from src.core.engine import add_quotes, remove_limit_statement
@@
-def remove_limit_statement(sql: str) -> str:
-    pattern = r"\s*LIMIT\s+\d+(\s*;?\s*--.*|\s*;?\s*)$"
-    modified_sql = re.sub(pattern, "", sql, flags=re.IGNORECASE)
-
-    return modified_sql
+# Use remove_limit_statement from src.core.engine

Also applies to: 273-278


84-104: Avoid creating unused aiohttp.ClientSession contexts.

Both get_validated_question_sql_pairs and get_contexts_from_sqls open a ClientSession but never use it, which only adds overhead.

Consider removing the unnecessary contexts:

-    async with aiohttp.ClientSession():
-        for question_sql_pair in question_sql_pairs:
+    for question_sql_pair in question_sql_pairs:
             ...
-        results = await asyncio.gather(*tasks)
-        return [
+    results = await asyncio.gather(*tasks)
+    return [

Repeat similarly for get_contexts_from_sqls. If you intend to reuse a session for downstream calls, thread it through function parameters and use it.

Also applies to: 119-141


38-49: Type annotation of is_sql_valid should reflect Optional[str] for the error.

The function returns (True, None) on success; annotate as Tuple[bool, Optional[str]] or normalize to always return a string.

Apply this diff:

-async def is_sql_valid(
+async def is_sql_valid(
     sql: str,
@@
-) -> Tuple[bool, str]:
+) -> Tuple[bool, Optional[str]]:
wren-ai-service/tools/run_sql.py (1)

40-46: Optional: Deduplicate quoting logic across branches.

Compute quoted_sql once before the conditional to simplify and avoid divergence.

Example:

 def get_data_from_wren_engine(
     sql: str,
@@
 ):
-    if dataset_type == "duckdb":
-        quoted_sql, error = add_quotes(sql)
-        assert error == "", f"Error in adding quotes to SQL: {error}"
+    quoted_sql, error = add_quotes(sql)
+    assert error == "", f"Error in adding quotes to SQL: {error}"
+    if dataset_type == "duckdb":
         response = requests.get(

Also applies to: 71-73

wren-ai-service/tests/pytest/test_engine.py (2)

51-62: Add a dataset-qualified function-call test.

Ensure only function name remains unquoted while qualifiers are quoted (aligning with the proposed engine change).

Add this test:

+    def test_add_quotes_dataset_qualified_function_call(self):
+        sql = "SELECT project.dataset.my_func(col) FROM tbl"
+        result, error = add_quotes(sql)
+        assert error == ""
+        assert '"project"."dataset".my_func("col")' in result
+        assert '"tbl"' in result

212-231: Protect squish_sql from modifying string literals.

Add a regression test to prevent future regressions where squish_sql alters content inside quoted strings.

Append this test:

+    def test_squish_sql_preserves_string_literals(self):
+        sql = "SELECT 'a   b', \"c   d\", `e   f`, [g   h]"
+        result = squish_sql(sql)
+        assert result.startswith("SELECT")
+        assert "'a   b'" in result
+        assert '"c   d"' in result
+        assert "`e   f`" in result
+        assert "[g   h]" in result
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a4cbb1e and 9b7ed84.

📒 Files selected for processing (5)
  • wren-ai-service/eval/data_curation/utils.py (1 hunks)
  • wren-ai-service/eval/utils.py (2 hunks)
  • wren-ai-service/src/core/engine.py (2 hunks)
  • wren-ai-service/tests/pytest/test_engine.py (1 hunks)
  • wren-ai-service/tools/run_sql.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/eval/utils.py
🧰 Additional context used
🧬 Code graph analysis (3)
wren-ai-service/eval/data_curation/utils.py (1)
wren-ai-service/src/core/engine.py (1)
  • add_quotes (57-150)
wren-ai-service/tests/pytest/test_engine.py (1)
wren-ai-service/src/core/engine.py (4)
  • add_quotes (57-150)
  • clean_generation_result (30-42)
  • remove_limit_statement (45-49)
  • squish_sql (52-54)
wren-ai-service/tools/run_sql.py (1)
wren-ai-service/src/core/engine.py (1)
  • add_quotes (57-150)
🔇 Additional comments (2)
wren-ai-service/src/core/engine.py (1)

8-8: The above script will verify whether sqlglot is declared in your project’s dependency manifests (requirements.txt, pyproject.toml, setup.py) or referenced elsewhere. Based on its output, we can confirm if a version pin exists or if you need to add one.

Once we see the results, we’ll update the review comment accordingly.

wren-ai-service/tests/pytest/test_engine.py (1)

91-103: Revisit expectation that "SQL" is always an identifier.

Depending on the tokenizer/dialect, SQL might not be treated as an identifier when standalone. If this test flakes across versions, consider loosening or removing the '"SQL"' assertion.

Would you like me to parameterize the test by sqlglot version/dialect, or relax the assertion to only check quoting on non-keyword tokens (e.g., "INVALID" and "PARENTHESES")?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
wren-ai-service/eval/utils.py (3)

189-197: Fix regex to support both 'bird_' and 'spider_' filenames

Current code returns None for spider_... because it always reads group(1). Use a single named group.

-    match = re.search(
-        r"bird_(.+?)_eval_dataset\.toml|spider_(.+?)_eval_dataset\.toml", path
-    )
-    if match:
-        return match.group(1)
+    match = re.search(
+        r"(?:bird|spider)_(?P<db>.+?)_eval_dataset\.toml", path
+    )
+    if match:
+        return match.group("db")

222-224: Avoid mutable default argument for pipe_components

Using {} as a default can leak state across calls.

-def engine_config(
-    mdl: dict, pipe_components: dict[str, Any] = {}, path: str = ""
-) -> dict:
+def engine_config(
+    mdl: dict, pipe_components: Optional[dict[str, Any]] = None, path: str = ""
+) -> dict:
+    if pipe_components is None:
+        pipe_components = {}

159-164: Adjust add_quotes usage across all call sites

The helper add_quotes returns (quoted_sql, error_message) (empty string on success), not a boolean. Any code unpacking the second element as no_error: bool will mis-handle errors. We need to:

• In wren-ai-service/eval/utils.py (lines 159–164), rename no_errorerror, invert the condition, and include the real error message:

-        quoted_sql, no_error = add_quotes(sql)
-        if not no_error:
-            print(f"Error in quoting SQL: {sql}")
-            quoted_sql = sql
+        quoted_sql, error = add_quotes(sql)
+        if error:
+            print(f"Error quoting SQL: {error}; sql={sql}")
+            quoted_sql = sql

• In wren-ai-service/eval/utils.py (lines 36–37), change the early assert to check for an empty error string:

-    quoted_sql, error = add_quotes(sql)
-    assert not error, f"Error in quoting SQL: {sql}"
+    quoted_sql, error = add_quotes(sql)
+    assert not error, f"Error in quoting SQL: {error}; sql={sql}"

• In wren-ai-service/eval/data_curation/utils.py (lines 47–48), unpack into error and assert accordingly:

-    quoted_sql, no_error = add_quotes(sql)
-    assert no_error, f"Error in quoting SQL: {sql}"
+    quoted_sql, error = add_quotes(sql)
+    assert not error, f"Error in quoting SQL: {error}; sql={sql}"

• In wren-ai-service/tools/run_sql.py (lines 48–49 and 71–72), update the unpacking and assertions:

-        quoted_sql, no_error = add_quotes(sql)
-        assert no_error, f"Error in adding quotes to SQL: {sql}"
+        quoted_sql, error = add_quotes(sql)
+        assert not error, f"Error adding quotes to SQL: {error}; sql={sql}"

Please apply these fixes across any other call sites (e.g., in src/pipelines/generation/utils/sql.py, rename error_message to error or handle it consistently). This is required to prevent silent failures or inverted error handling.

🧹 Nitpick comments (3)
wren-ai-service/eval/utils.py (3)

36-37: Prefer explicit error handling over assert; include error details

assert can be optimized away and hides failures. Raise with context (or fallback) instead.

-    quoted_sql, error = add_quotes(sql)
-    assert not error, f"Error in quoting SQL: {sql}"
+    quoted_sql, error = add_quotes(sql)
+    if error:
+        raise ValueError(f"Error quoting SQL: {error}; sql={sql!r}")

49-55: Avoid JSON parsing before status check (duckdb path)

If the response is non-JSON on error, await response.json() will raise.

-        ) as response:
-            data = await response.json()
-
-            if response.status != 200:
-                return {"data": [], "columns": []}
+        ) as response:
+            if response.status != 200:
+                return {"data": [], "columns": []}
+            data = await response.json()

7-7: Nit: mixed typing styles

You’re mixing built-in generics (list[str]) with typing.List/Dict. It works, but consider unifying for consistency.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9b7ed84 and 8e714fd.

📒 Files selected for processing (1)
  • wren-ai-service/eval/utils.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wren-ai-service/eval/utils.py (1)
wren-ai-service/src/core/engine.py (1)
  • add_quotes (57-150)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pytest
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
wren-ai-service/eval/utils.py (1)

21-21: LGTM: centralized quoting via engine.add_quotes

Good move removing local duplication and importing the engine’s single source of truth.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
wren-ai-service/src/core/engine.py (1)

53-65: Make squish_sql quote-aware; current newline replacement alters strings/quoted identifiers.

Replacing all newlines with spaces can change string literals/JSON payloads and quoted identifiers. Collapse whitespace only outside '...', "...", ..., and [...].

Apply this diff:

-def squish_sql(sql: str) -> str:
-    return (
-        sqlparse.format(
-            sql,
-            strip_comments=False,
-            reindent=False,  # don't add newlines/indent
-            keyword_case=None,  # don't change case
-        )
-        .replace("\n", " ")
-        .replace("\r", " ")
-        .strip()
-    )
+def squish_sql(sql: str) -> str:
+    """
+    Collapse whitespace outside of string/quoted identifier contexts.
+    Preserves content inside: single quotes ('...'), double quotes ("..."),
+    backticks (`...`), and bracket quotes ([...]).
+    """
+    s = sql.replace("\r\n", "\n").replace("\r", "\n")
+    out = []
+    in_single = in_double = in_backtick = in_bracket = False
+    prev_space = False
+    i = 0
+    while i < len(s):
+        ch = s[i]
+        if not (in_single or in_double or in_backtick or in_bracket):
+            if ch in "'\"`[":
+                if ch == "'":
+                    in_single = True
+                elif ch == '"':
+                    in_double = True
+                elif ch == "`":
+                    in_backtick = True
+                else:
+                    in_bracket = True
+                out.append(ch)
+                prev_space = False
+            elif ch.isspace():
+                if not prev_space:
+                    out.append(" ")
+                    prev_space = True
+            else:
+                out.append(ch)
+                prev_space = False
+        else:
+            out.append(ch)
+            if in_single and ch == "'":
+                if i + 1 < len(s) and s[i + 1] == "'":
+                    out.append(s[i + 1]); i += 1
+                else:
+                    in_single = False
+            elif in_double and ch == '"':
+                if i + 1 < len(s) and s[i + 1] == '"':
+                    out.append(s[i + 1]); i += 1
+                else:
+                    in_double = False
+            elif in_backtick and ch == "`":
+                in_backtick = False
+            elif in_bracket and ch == "]":
+                in_bracket = False
+        i += 1
+    return "".join(out).strip()
🧹 Nitpick comments (4)
wren-ai-service/src/core/engine.py (4)

7-9: Trim imports once squish_sql no longer needs sqlparse.

If you adopt the quote-aware squish, remove the unused import.

-import sqlparse
-from sqlglot.tokens import Token, Tokenizer, TokenType
+from sqlglot.tokens import Token, Tokenizer, TokenType

73-84: Broaden identifier token types defensively.

Some sqlglot versions use IDENTIFIER. Include it if present.

-        def is_ident(tok: Token):
-            # SQLGlot uses VAR for identifiers, but also treats SQL keywords as identifiers in some contexts
-            return tok.token_type in (
-                TokenType.VAR,
-                TokenType.SCHEMA,
-                TokenType.TABLE,
-                TokenType.COLUMN,
-                TokenType.DATABASE,
-                TokenType.INDEX,
-                TokenType.VIEW,
-            )
+        def is_ident(tok: Token):
+            ident_types = {
+                TokenType.VAR,
+                TokenType.SCHEMA,
+                TokenType.TABLE,
+                TokenType.COLUMN,
+                TokenType.DATABASE,
+                TokenType.INDEX,
+                TokenType.VIEW,
+            }
+            if hasattr(TokenType, "IDENTIFIER"):
+                ident_types.add(getattr(TokenType, "IDENTIFIER"))
+            return tok.token_type in ident_types

67-67: Add targeted tests for edge cases.

Please add tests for:

  • Strings/JSON with embedded newlines and multiple spaces (preserved).
  • Ident chains with function calls: schema.func(col), db.schema.func(col).
  • Dialect-specific quote pairs: MySQL/BigQuery backticks; SQL Server brackets (including internal ] escaping).
  • Wildcards t.* and plain * untouched.
  • Already-quoted identifiers remain unchanged.

I can draft these test cases in tests/pytest/test_engine.py if helpful.


138-142: Token.start and Token.end are stable – no change needed for slicing logic

Verified in SQLGlot’s official API reference that every Python Token returned by tokenize includes .start and .end attributes (character offsets into the input SQL), along with .line and .col metadata (sqlglot.com). There is no separate .span property in the Python API, but you can reconstruct a span tuple via (token.start, token.end) if desired.

To guard against future breaking changes:

  • Pin your SQLGlot dependency to a known-good minimum version that provides these attributes (they have existed since at least v10.x). For example, in your pyproject.toml or requirements.txt:

    sqlglot >= 10.0.0, < 27.0.0
  • (Optional) Add a small fallback in case of edge cases or rust-tokenizer tokens exposing a different API:

    start = getattr(ident_token, "start", None)
    end = getattr(ident_token, "end", None)
    if hasattr(ident_token, "span"):
        start, end = ident_token.span

With these safeguards, you can reliably rely on .start/.end offsets for your slice-and-replace logic.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8e714fd and ab7e9de.

📒 Files selected for processing (1)
  • wren-ai-service/src/core/engine.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pytest
  • GitHub Check: Analyze (go)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
wren-ai-service/tests/pytest/test_engine.py (9)

1-5: Stabilize imports across CI/IDE environments

Confirm that src is importable from the test runner’s working directory (e.g., via a src-layout, namespace package, or pytest.ini pythonpath). If not, tests may fail to import in CI.


20-29: Also assert alias quoting in FROM clause

Current checks don’t validate whether the table alias is quoted in FROM. Add a focused assertion.

@@
         assert '"u"."name"' in result
         assert '"u"."email"' in result
         assert '"users"' in result
+        # Alias should be quoted in FROM as well
+        assert 'FROM "users" "u"' in result

40-49: Cover alias quoting alongside wildcard

Validate alias quoting in FROM when wildcard is used.

@@
         assert "t.*" in result  # Should not be quoted
         assert '"name"' in result
         assert '"table1"' in result
+        # Alias should be quoted in FROM
+        assert 'FROM "table1" "t"' in result

62-80: Add assertions for alias and literal handling in complex query

Tighten expectations for alias quoting and boolean literal preservation.

@@
         assert '"posts"' in result
         assert '"categories"' in result
+        # Alias from AS should be quoted
+        assert '"category_name"' in result
+        # Boolean literal should remain unquoted
+        assert " true" in result

90-102: Reduce brittleness in malformed SQL check

Depending on sqlglot tokenization, SQL might not always be treated as an identifier. Consider relaxing that specific assertion to avoid flaky failures across versions.

@@
-        assert '"SQL"' in result
         assert '"PARENTHESES"' in result

128-142: Add negatives to ensure keywords aren’t quoted

Make the intent explicit by asserting keywords are never quoted.

@@
         assert "ORDER BY " in result
         assert '"name"' in result  # But identifiers should be quoted
         assert '"users"' in result
         assert '"active"' in result
         assert '"created_at"' in result
+        # Ensure SQL keywords are not quoted
+        assert '"SELECT"' not in result
+        assert '"FROM"' not in result
+        assert '"WHERE"' not in result
+        assert '"ORDER"' not in result

155-161: Add test for triple single quotes

You already handle ''' in the implementation; add a focused test.

@@
         cleaned = clean_generation_result(result)
         assert '"""' not in cleaned
         assert "SELECT * FROM users" in cleaned
+
+    def test_clean_generation_result_removes_single_triple_quotes(self):
+        """Test removing triple single quotes."""
+        result = "'''SELECT * FROM users'''"
+        cleaned = clean_generation_result(result)
+        assert "'''" not in cleaned
+        assert "SELECT * FROM users" in cleaned

193-199: Minor: unify case-insensitive assertion style

For consistency with other tests, consider asserting on result.upper().

@@
-        assert "limit" not in result
+        assert "LIMIT" not in result.upper()

200-206: Add test for LIMIT with trailing comment

Your regex handles -- comment at EOL; add a test to lock it in.

@@
         result = remove_limit_statement(sql)
         assert "name = 'limit'" in result
         assert "LIMIT 10" not in result
+
+    def test_remove_limit_statement_with_trailing_comment(self):
+        """Test removing LIMIT when a trailing SQL comment exists."""
+        sql = "SELECT * FROM users LIMIT 10 -- comment"
+        result = remove_limit_statement(sql)
+        assert "LIMIT" not in result.upper()
+        assert "SELECT * FROM users" in result
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ab7e9de and 0fbf035.

📒 Files selected for processing (1)
  • wren-ai-service/tests/pytest/test_engine.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wren-ai-service/tests/pytest/test_engine.py (1)
wren-ai-service/src/core/engine.py (3)
  • add_quotes (67-160)
  • clean_generation_result (31-43)
  • remove_limit_statement (46-50)
🔇 Additional comments (11)
wren-ai-service/tests/pytest/test_engine.py (11)

11-19: LGTM: baseline quoting assertions cover core path

Good happy-path coverage for simple identifiers.


30-39: LGTM: preserves already-quoted identifiers

Matches intended behavior.


81-89: LGTM: schema-prefixed identifiers

Good coverage of dotted chains with multiple levels.


103-110: LGTM: empty input behavior

Covers non-error empty case.


111-119: LGTM: whitespace normalization sanity check

Matches expected pre-tokenization squish.


120-127: LGTM: dotted chain spacing preserved

Good guard against spacing regressions.


147-154: LGTM: code-fence removal

Covers typical ```sql blocks.


162-168: LGTM: semicolon stripping

Covers trailing semicolon case.


169-174: LGTM: normalization exactness

Exact match assertion is helpful here.


179-185: LGTM: basic LIMIT removal

Captures the primary behavior.


186-192: LGTM: LIMIT with semicolon

Good end-of-statement coverage.

@cyyeh cyyeh merged commit 2da04ec into main Aug 27, 2025
12 checks passed
@cyyeh cyyeh deleted the chore/ai-service/improve-add-quotes branch August 27, 2025 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/ai-service ai-service related module/ai-service ai-service related wren-ai-service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants