Parsing improvements: Speed up parsing for new branches / Improve KG for large files #452

dhirenmathur · 2025-09-30T12:37:12Z

Summary by CodeRabbit

New Features
- Global inference cache shared across projects for faster repeat processing.
- Docstring requests now accept an optional metadata field.
Performance Improvements
- Cache-aware batching, improved tokenization, and smarter reference resolution.
- Chunking and consolidation for large nodes to stay within token limits.
Background Tasks
- Automated cache cleanup (expired and least-accessed) with stats reporting.
Database
- Added inference cache table and indexes; removed project linkage to enable global caching.

coderabbitai · 2025-09-30T12:37:21Z

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title highlights parsing performance and knowledge‐graph improvements, which correspond to some of the changes such as chunk handling for large files, but it does not reference the major new inference caching system and related migrations that constitute the bulk of the PR. It therefore captures only part of the changes and is overly broad about the actual scope. The phrasing is clear and descriptive of one aspect but misses the primary functional addition. As a result, the title is only partially related to the full changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch cache_parse

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c950f83 and 99084e2.

📒 Files selected for processing (7)

app/alembic/versions/20250928_simple_global_cache.py (1 hunks)
app/modules/parsing/knowledge_graph/inference_service.py (8 hunks)
app/modules/parsing/services/cache_cleanup_service.py (1 hunks)
app/modules/parsing/services/inference_cache_service.py (1 hunks)
app/modules/parsing/tasks/cache_cleanup_tasks.py (1 hunks)
app/modules/parsing/utils/cache_diagnostics.py (1 hunks)
potpie-ui (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

app/modules/parsing/services/inference_cache_service.py (1)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/modules/parsing/knowledge_graph/inference_service.py (6)

app/core/database.py (1)

get_db (29-34)

app/modules/parsing/knowledge_graph/inference_schema.py (3)

DocstringRequest (6-9)

DocstringResponse (18-19)

DocstringNode (12-15)

app/modules/parsing/services/inference_cache_service.py (3)

InferenceCacheService (10-148)

get_cached_inference (14-50)

store_inference (52-128)

app/modules/parsing/utils/content_hash.py (2)

generate_content_hash (6-26)

is_content_cacheable (29-49)

app/modules/parsing/utils/cache_diagnostics.py (2)

analyze_cache_misses (10-106)

log_diagnostics_summary (109-146)

app/modules/projects/projects_service.py (1)

get_project_from_db_by_id_sync (220-233)

app/modules/parsing/services/cache_cleanup_service.py (1)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/alembic/versions/20250928_simple_global_cache.py (1)

app/alembic/versions/20250923_add_inference_cache_table.py (2)

upgrade (20-63)

downgrade (66-73)

app/modules/parsing/utils/cache_diagnostics.py (2)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/modules/parsing/utils/content_hash.py (1)

generate_content_hash (6-26)

app/modules/parsing/tasks/cache_cleanup_tasks.py (2)

app/core/database.py (1)

get_db (29-34)

app/modules/parsing/services/cache_cleanup_service.py (4)

CacheCleanupService (11-90)

cleanup_expired_entries (31-43)

cleanup_least_accessed (45-71)

get_cleanup_stats (73-90)

🪛 Ruff (0.13.3)

app/modules/parsing/services/inference_cache_service.py

15-15: Unused method argument: project_id

(ARG002)

126-126: Use raise without specifying exception name

Remove exception name

(TRY201)

app/modules/parsing/knowledge_graph/inference_service.py

379-379: Unused method argument: project_id

(ARG002)

398-398: Do not catch blind exception: Exception

(BLE001)

615-615: Do not catch blind exception: Exception

(BLE001)

849-849: Do not catch blind exception: Exception

(BLE001)

926-926: Do not catch blind exception: Exception

(BLE001)

980-980: Do not catch blind exception: Exception

(BLE001)

990-990: Consider moving this statement to an else block

(TRY300)

992-992: Do not catch blind exception: Exception

(BLE001)

993-993: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

app/modules/parsing/utils/cache_diagnostics.py

91-91: Do not catch blind exception: Exception

(BLE001)

92-92: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

app/modules/parsing/tasks/cache_cleanup_tasks.py

21-21: Consider moving this statement to an else block

(TRY300)

46-46: Consider moving this statement to an else block

(TRY300)

67-67: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (1)

potpie-ui (1)

1-1: Confirm the potpie-ui submodule update.

Only the pointer changed, so I can’t review the underlying UI code here. Please confirm the linked commit was vetted (tests, lint, sanity checks) or share a brief summary of what it contains so we know the update is intentional.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (14)

app/modules/parsing/utils/content_hash.py (1)
44-47: Consider cross-platform line ending compatibility.

The current implementation splits on "\n" only, which may not correctly handle Windows-style "\r\n" line endings. This could affect the uniqueness calculation on different platforms.

Apply this diff to improve cross-platform compatibility:
-    lines = node_text.strip().split("\n")
+    lines = node_text.strip().splitlines()
The splitlines() method handles all line ending styles (\n, \r\n, \r) uniformly.
app/modules/parsing/services/cache_cleanup_service.py (2)
11-15: Add validation for TTL configuration.

The environment variable INFERENCE_CACHE_TTL_DAYS is parsed without validation. If the value is non-numeric, negative, or zero, it could cause unexpected behavior or failures.

Apply this diff to add validation:
     def __init__(self, db: Session):
         self.db = db
         # Default TTL: 30 days, configurable via environment
-        self.cache_ttl_days = int(os.getenv("INFERENCE_CACHE_TTL_DAYS", "30"))
+        try:
+            self.cache_ttl_days = int(os.getenv("INFERENCE_CACHE_TTL_DAYS", "30"))
+            if self.cache_ttl_days <= 0:
+                logger.warning(f"Invalid INFERENCE_CACHE_TTL_DAYS: {self.cache_ttl_days}, using default 30")
+                self.cache_ttl_days = 30
+        except ValueError:
+            logger.warning(f"Invalid INFERENCE_CACHE_TTL_DAYS format, using default 30")
+            self.cache_ttl_days = 30
31-57: Optimize bulk deletion for better performance.

Lines 51-52 delete entries one at a time in a loop, which is inefficient for large deletions. If entries_to_remove is substantial (e.g., 10,000+), this will generate thousands of individual DELETE statements and significantly slow down the cleanup task.

Apply this diff to use bulk deletion:
     def cleanup_least_accessed(self, max_entries: int = 100000) -> int:
         """Remove least accessed entries if cache grows too large"""
         total_entries = self.db.query(InferenceCache).count()
 
         if total_entries <= max_entries:
             return 0
 
         entries_to_remove = total_entries - max_entries
 
-        # Get least accessed entries
-        least_accessed = (
+        # Get IDs of least accessed entries
+        least_accessed_ids = (
             self.db.query(InferenceCache)
             .order_by(
                 InferenceCache.access_count.asc(), InferenceCache.last_accessed.asc()
             )
             .limit(entries_to_remove)
-            .all()
+            .with_entities(InferenceCache.id)
+            .subquery()
         )
 
-        # Delete them
-        for entry in least_accessed:
-            self.db.delete(entry)
+        # Bulk delete using subquery
+        result = self.db.execute(
+            delete(InferenceCache).where(InferenceCache.id.in_(least_accessed_ids))
+        )
 
         self.db.commit()
 
-        logger.info(f"Cleaned up {entries_to_remove} least accessed cache entries")
-        return entries_to_remove
+        deleted_count = result.rowcount
+        logger.info(f"Cleaned up {deleted_count} least accessed cache entries")
+        return deleted_count
app/alembic/versions/20250923_add_inference_cache_table.py (2)
52-54: Potential redundant index on unique column.

The UNIQUE constraint on content_hash (line 47) already creates an index in PostgreSQL. The explicit index creation on line 52-53 may be redundant. However, using op.f() for standardized naming is a best practice.

This is acceptable but note that PostgreSQL will use the unique constraint's index for lookups regardless.

20-64: Consider indexing access_count for cleanup queries.

The access_count column (line 45) is used by cache cleanup services to identify least-accessed entries for eviction but lacks an index. Cache cleanup queries filtering or ordering by access_count will require full table scans as the cache grows.

Add an index on access_count for cleanup performance:
     op.create_index(
         op.f("ix_inference_cache_last_accessed"), "inference_cache", ["last_accessed"]
     )
+    op.create_index(
+        op.f("ix_inference_cache_access_count"), "inference_cache", ["access_count"]
+    )
And in the downgrade:
 def downgrade():
+    op.drop_index(
+        op.f("ix_inference_cache_access_count"), table_name="inference_cache"
+    )
     op.drop_index(
         op.f("ix_inference_cache_last_accessed"), table_name="inference_cache"
     )
app/modules/parsing/models/inference_cache_model.py (2)
19-19: Redundant index=True on primary key.

Primary keys are automatically indexed by the database. The explicit index=True is redundant but harmless.
-    id = Column(Integer, primary_key=True, index=True)
+    id = Column(Integer, primary_key=True)
35-35: Consider adding index for access_count to support cleanup queries.

The access_count column is used by cache cleanup services to identify least-accessed entries but lacks an index. This will cause performance issues as the cache grows and cleanup queries need to filter/sort by access count.

This aligns with the missing index in the migration file.
-    access_count = Column(Integer, default=1, nullable=False)
+    access_count = Column(Integer, default=1, nullable=False, index=True)
Note: This requires a corresponding migration update in 20250923_add_inference_cache_table.py.
thoughts/shared/plans/simplified-global-cache-implementation.md (1)
23-26: Add language identifier to fenced code block.

The code block lacks a language identifier for syntax highlighting.
-```
+```text
 content → content_hash → global cache → inference
 (project_id stored as metadata only)
Based on coding guidelines (markdownlint MD040).

</blockquote></details>
<details>
<summary>app/modules/parsing/knowledge_graph/inference_service.py (6)</summary><blockquote>

`208-254`: **Remove unnecessary assignment at line 240.**

The method correctly splits large nodes into processable chunks. However, line 240 assigns `current_tokens` after adding a line to `current_chunk_lines`, but this value is never used since `test_tokens` is recalculated on the next iteration.



Apply this diff to remove the dead assignment:

```diff
             else:
                 current_chunk_lines.append(line)
-                current_tokens = test_tokens
Optionally, consider using unpacking at line 222 for better style (as per coding guidelines):
         for line in lines:
-            test_chunk = "\n".join(current_chunk_lines + [line])
+            test_chunk = "\n".join([*current_chunk_lines, line])
256-289: Move inline import to module level.

The consolidation logic is sound, but the inline import of DocstringNode at line 283 is unnecessary since it's already imported from inference_schema at line 17.

Remove the inline import and rely on the module-level import:
-        # Create single consolidated docstring for parent node
-        from app.modules.parsing.knowledge_graph.inference_schema import DocstringNode
-
         consolidated_docstring = DocstringNode(
291-336: Remove redundant inline import.

The chunk processing logic correctly consolidates multi-chunk responses. However, the inline import of DocstringResponse at lines 324-326 is redundant since it's already imported at line 19.

Remove the inline import:
             # Create a mock response list for consolidation
             mock_responses = []
             for chunk_doc in chunk_docstrings:
-                from app.modules.parsing.knowledge_graph.inference_schema import (
-                    DocstringResponse,
-                )
-
                 mock_responses.append(DocstringResponse(docstrings=[chunk_doc]))
547-559: Replace hardcoded condition with proper log level check.

The if True: at line 548 unconditionally runs diagnostics. Consider checking if debug logging is enabled instead.
-            # Run diagnostics on nodes if DEBUG logging is enabled
-            if True:
+            if logger.isEnabledFor(logging.DEBUG):
                 try:
852-855: Use logger.exception for better error context.

When logging errors within an exception handler, use logger.exception instead of logger.error to automatically include the stack trace.
                 except Exception as e:
-                    logger.error(f"Failed to process batch {batch_index}: {e}")
+                    logger.exception(f"Failed to process batch {batch_index}: {e}")
                     # Continue with next batch instead of failing entire operation
                     return DocstringResponse(docstrings=[])
356-365: Extract repeated cache initialization to a helper method.

The cache service initialization pattern is duplicated in batch_nodes and generate_docstrings. Consider extracting to a helper method to reduce duplication and ensure consistent error handling.

Add a new method:
def _get_cache_service(self) -> Optional[InferenceCacheService]:
    """Initialize cache service with proper resource management"""
    try:
        db_gen = get_db()
        db = next(db_gen)
        cache_service = InferenceCacheService(db)
        # Store generator for cleanup
        cache_service._db_gen = db_gen
        return cache_service
    except Exception as e:
        logger.warning(f"Failed to initialize cache service: {e}. Continuing without cache.")
        return None

def _close_cache_service(self, cache_service: Optional[InferenceCacheService]) -> None:
    """Clean up cache service database session"""
    if cache_service and hasattr(cache_service, '_db_gen'):
        try:
            next(cache_service._db_gen)
        except StopIteration:
            pass
Then replace both initialization blocks with cache_service = self._get_cache_service() and add cleanup calls before returning.

Also applies to: 783-791

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26cb74b and c950f83.

📒 Files selected for processing (17)

.gitignore (1 hunks)
app/alembic/versions/20250923_add_inference_cache_table.py (1 hunks)
app/alembic/versions/20250928_simple_global_cache.py (1 hunks)
app/core/config_provider.py (2 hunks)
app/modules/media/media_controller.py (1 hunks)
app/modules/media/media_router.py (2 hunks)
app/modules/media/media_service.py (1 hunks)
app/modules/parsing/knowledge_graph/inference_schema.py (1 hunks)
app/modules/parsing/knowledge_graph/inference_service.py (6 hunks)
app/modules/parsing/models/inference_cache_model.py (1 hunks)
app/modules/parsing/services/cache_cleanup_service.py (1 hunks)
app/modules/parsing/services/inference_cache_service.py (1 hunks)
app/modules/parsing/tasks/cache_cleanup_tasks.py (1 hunks)
app/modules/parsing/utils/cache_diagnostics.py (1 hunks)
app/modules/parsing/utils/content_hash.py (1 hunks)
deployment/stage/convo-server/convo-api-supervisord.conf (1 hunks)
thoughts/shared/plans/simplified-global-cache-implementation.md (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (8)

app/alembic/versions/20250928_simple_global_cache.py (1)

app/alembic/versions/20250923_add_inference_cache_table.py (2)

upgrade (20-63)

downgrade (66-73)

app/modules/parsing/services/cache_cleanup_service.py (1)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/modules/parsing/models/inference_cache_model.py (1)

app/core/base_model.py (1)

Base (9-16)

app/modules/parsing/knowledge_graph/inference_service.py (6)

app/core/database.py (1)

get_db (29-34)

app/modules/parsing/knowledge_graph/inference_schema.py (3)

DocstringRequest (6-9)

DocstringResponse (18-19)

DocstringNode (12-15)

app/modules/parsing/services/inference_cache_service.py (2)

get_cached_inference (14-44)

store_inference (46-122)

app/modules/parsing/utils/content_hash.py (2)

generate_content_hash (6-26)

is_content_cacheable (29-49)

app/modules/parsing/utils/cache_diagnostics.py (2)

analyze_cache_misses (10-104)

log_diagnostics_summary (107-144)

app/modules/projects/projects_service.py (1)

get_project_from_db_by_id_sync (220-233)

app/modules/parsing/utils/cache_diagnostics.py (2)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/modules/parsing/utils/content_hash.py (1)

generate_content_hash (6-26)

app/modules/parsing/services/inference_cache_service.py (1)

app/modules/parsing/models/inference_cache_model.py (1)

InferenceCache (16-35)

app/alembic/versions/20250923_add_inference_cache_table.py (1)

app/alembic/versions/20250928_simple_global_cache.py (2)

upgrade (17-24)

downgrade (27-36)

app/modules/parsing/tasks/cache_cleanup_tasks.py (2)

app/core/database.py (1)

get_db (29-34)

app/modules/parsing/services/cache_cleanup_service.py (4)

CacheCleanupService (11-76)

cleanup_expired_entries (17-29)

cleanup_least_accessed (31-57)

get_cleanup_stats (59-76)

🪛 markdownlint-cli2 (0.18.1)

thoughts/shared/plans/simplified-global-cache-implementation.md

23-23: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)

app/modules/parsing/knowledge_graph/inference_service.py

222-222: Consider [*current_chunk_lines, line] instead of concatenation

Replace with [*current_chunk_lines, line]

(RUF005)

240-240: Local variable current_tokens is assigned to but never used

Remove assignment to unused variable current_tokens

(F841)

343-343: Unused method argument: project_id

(ARG002)

360-360: Do not catch blind exception: Exception

(BLE001)

558-558: Do not catch blind exception: Exception

(BLE001)

786-786: Do not catch blind exception: Exception

(BLE001)

836-836: Do not catch blind exception: Exception

(BLE001)

850-850: Consider moving this statement to an else block

(TRY300)

852-852: Do not catch blind exception: Exception

(BLE001)

853-853: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

app/modules/parsing/utils/cache_diagnostics.py

89-89: Do not catch blind exception: Exception

(BLE001)

90-90: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

app/modules/parsing/services/inference_cache_service.py

15-15: Unused method argument: project_id

(ARG002)

120-120: Use raise without specifying exception name

Remove exception name

(TRY201)

app/modules/parsing/tasks/cache_cleanup_tasks.py

20-20: Consider moving this statement to an else block

(TRY300)

22-22: Do not catch blind exception: Exception

(BLE001)

23-23: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

24-24: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

41-41: Consider moving this statement to an else block

(TRY300)

43-43: Do not catch blind exception: Exception

(BLE001)

44-44: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

45-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

58-58: Consider moving this statement to an else block

(TRY300)

60-60: Do not catch blind exception: Exception

(BLE001)

61-61: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

62-62: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🔇 Additional comments (24)

deployment/stage/convo-server/convo-api-supervisord.conf (1)

1-13: LGTM! Cosmetic change only.

The only change in this file is the addition of a trailing newline, which has no functional impact on the supervisord configuration.

app/core/config_provider.py (2)

20-22: LGTM: Formatting improvement.

The multiline formatting improves readability without changing behavior.

123-130: LGTM: Good formatting with trailing comma.

The multiline formatting improves readability, and the trailing comma follows Python best practices for easier future modifications.

app/modules/media/media_service.py (1)

60-62: LGTM: Formatting improvement for readability.

The multi-line formatting of the logger message is a minor readability improvement with no functional impact.

app/modules/media/media_controller.py (1)

39-40: LGTM: Consistent formatting with trailing comma.

Adding the trailing comma follows Python best practices for multi-line dictionaries and reduces diff noise when adding future fields.

app/modules/media/media_router.py (1)

41-42: LGTM: Consistent formatting across multimodal error responses.

The trailing commas in both HTTPException detail dictionaries follow Python best practices and maintain consistency with the formatting changes in media_controller.py.

Also applies to: 188-189

.gitignore (1)

74-74: Confirm no .thoughts/ directory exists and intent
No hidden \.thoughts/ directories were detected. Please verify that renaming the ignore rule from .thoughts/ to thoughts/ is intentional and won’t expose previously ignored content.

app/modules/parsing/utils/content_hash.py (1)

6-26: LGTM: Hash generation is appropriate for cache keys.

The whitespace normalization (collapsing all whitespace to single spaces) is a reasonable trade-off for content-based caching, as it focuses on semantic equivalence rather than formatting. The optional node_type prefix provides useful differentiation for identical code in different contexts.

app/modules/parsing/knowledge_graph/inference_schema.py (1)

1-1: LGTM: Backward-compatible metadata field addition.

The new optional metadata field is properly typed and defaults to None, ensuring backward compatibility. This enables metadata propagation for the cache layer introduced in this PR.

Also applies to: 9-9
app/alembic/versions/20250928_simple_global_cache.py (1)
17-21: Please run these SQL queries against your PostgreSQL database and share the outputs to confirm whether the inference_cache_project_id_fkey constraint exists:
SELECT conname
FROM pg_constraint
WHERE conname = 'inference_cache_project_id_fkey'
  AND conrelid = 'inference_cache'::regclass;

SELECT conname, contype
FROM pg_constraint
WHERE conrelid = 'inference_cache'::regclass
  AND contype = 'f';
app/alembic/versions/20250923_add_inference_cache_table.py (1)

48-48: Note: FK constraint is removed in subsequent migration.

The foreign key constraint with CASCADE delete on project_id is removed in migration 20250928_simple_global_cache.py to enable global cache independence. This temporary FK creates a brief inconsistency where the cache would be deleted with projects, which contradicts the global cache design.

While the migration chain is correct, consider adding a comment here referencing the subsequent removal for maintainability.

app/modules/parsing/services/inference_cache_service.py (3)

14-44: LGTM! Global cache lookup correctly ignores project_id.

The implementation correctly performs lookups using only content_hash, with project_id kept for API compatibility as documented. Access tracking updates are appropriate.

Note: The Ruff ARG002 warning about unused project_id is a false positive—the parameter is intentionally kept for API compatibility during the migration to global cache.

37-37: Use server-side timestamp for consistency.

Using func.now() for last_accessed generates a SQL NOW() call, which is correct. However, ensure this matches the Column definition's server_default=func.now() behavior for consistency.

This is acceptable as-is but be aware that func.now() is evaluated server-side during commit.

124-142: Inconsistent project_id filtering in stats contradicts global cache design.

The get_cache_stats method filters by project_id (lines 128-129), which contradicts the global cache architecture where project_id is metadata only. This filtering suggests project-scoped analytics, but the cache itself is global.

While this may be intentional for analytics purposes, it creates confusion about whether the cache is truly global or project-scoped.

Clarify the intended use of project_id filtering in stats:

If stats should be global-only, remove the filtering

If project-scoped analytics are desired, document this as a metadata-only use case

Based on the plan document (thoughts/shared/plans/simplified-global-cache-implementation.md), project_id is "metadata only" for tracing, suggesting the filter should be removed or clearly documented as analytics-only.

app/modules/parsing/models/inference_cache_model.py (1)

16-36: LGTM! Model schema aligns with migration.

The ORM model correctly defines all columns, constraints, and indexes matching the Alembic migration. The explicit __tablename__ appropriately overrides the base class's auto-generated name.

thoughts/shared/plans/simplified-global-cache-implementation.md (1)

1-376: Excellent implementation plan with clear phases and verification criteria.

The plan document provides a well-structured migration path from the broken project-scoped cache to a simplified global cache. The three-phase approach (fix logic → remove FK → simplify integration) minimizes risk while delivering immediate value.

Key strengths:

Clear problem statement with root cause analysis

Incremental phases with automated and manual verification criteria

Comprehensive test strategy covering cross-project sharing and persistence

Performance considerations and rollback strategy

The embedded code examples match the actual implementation, demonstrating that the plan was followed.

app/modules/parsing/utils/cache_diagnostics.py (2)

10-104: Well-designed diagnostic utility with defensive programming.

The analyze_cache_misses function provides comprehensive analysis of cache effectiveness with appropriate defensive checks for malformed data. The categorization of issues (missing types, unresolved references, cache misses) with examples is helpful for debugging.

107-144: LGTM! Clear and actionable diagnostic output.

The summary logging provides well-formatted diagnostic information with appropriate categorization and example nodes for each issue type. The formatting with separators and percentage calculations makes it easy to identify cache problems.

app/modules/parsing/knowledge_graph/inference_service.py (6)

5-5: LGTM! Imports support the new caching infrastructure.

The additional imports for Optional, get_db, InferenceCacheService, and content hashing utilities are appropriate for the cache-aware inference processing being introduced.

Also applies to: 13-13, 21-25

83-91: LGTM! Robust input validation.

The added None and type checking prevents potential crashes when node text is missing or incorrectly typed, with appropriate warnings to surface data quality issues.

338-343: Unused parameter is acceptable for API consistency.

The project_id parameter is currently unused within batch_nodes itself (the cache service doesn't use it for lookups per the implementation). However, this is acceptable as it maintains API consistency and supports future enhancements where project-specific batching logic may be needed.

648-653: LGTM! Safety check prevents empty batches.

The added conditional ensures that only non-empty batches are appended, preventing potential issues downstream.

769-779: LGTM! Effective cache-first processing strategy.

The implementation correctly separates cached nodes from those requiring LLM processing, processing cached nodes first via update_neo4j_with_cached_inference before batching the remainder. This optimizes throughput by avoiding unnecessary LLM calls.

808-840: LGTM! Robust cache storage with graceful failure handling.

The cache storage logic correctly extracts inference data and stores it with appropriate metadata (content_hash, node_type, tags, embedding). The broad exception catch at line 836 is appropriate here - cache write failures should not break the inference pipeline.

app/alembic/versions/20250928_simple_global_cache.py

app/modules/parsing/knowledge_graph/inference_service.py

coderabbitai · 2025-09-30T12:45:15Z

app/modules/parsing/knowledge_graph/inference_service.py

+        try:
+            db = next(get_db())
+            cache_service = InferenceCacheService(db)
+        except Exception as e:
+            logger.warning(
+                f"Failed to initialize cache service for result storage: {e}"
+            )
+            cache_service = None
+


⚠️ Potential issue | 🟠 Major

Fix resource leak: improperly managed database session.

Same issue as in batch_nodes - the database session from get_db() is not properly closed, causing a connection leak.

Apply the same fix pattern as suggested for batch_nodes:

# Process LLM batches and cache results try: - db = next(get_db()) + db_gen = get_db() + db = next(db_gen) cache_service = InferenceCacheService(db) except Exception as e: logger.warning( f"Failed to initialize cache service for result storage: {e}" ) cache_service = None + db_gen = None

Add cleanup before the method returns (after line 870):

+ # Clean up database session + if 'db_gen' in locals() and db_gen: + try: + next(db_gen) + except StopIteration: + pass + return updated_docstrings

🧰 Tools

🪛 Ruff (0.13.1)

786-786: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In app/modules/parsing/knowledge_graph/inference_service.py around lines 783 to 791, the code calls next(get_db()) to obtain a DB session but never closes the generator, leaking connections; change to capture the generator (db_gen = get_db()) then db = next(db_gen) and, before the method returns (add the cleanup just prior to the return around line 870), add the cleanup block that checks if db_gen is in locals() and calls next(db_gen) inside a try/except StopIteration to close the generator; keep existing exception handling for initialization but ensure the cleanup runs whether cache_service was created or set to None.

app/modules/parsing/knowledge_graph/inference_service.py

coderabbitai · 2025-09-30T12:45:15Z

app/modules/parsing/knowledge_graph/inference_service.py

+    async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None:
+        """Update Neo4j with cached inference data for a single node"""
+        cached_inference = node.get("cached_inference", {})
+        if not cached_inference:
+            return
+
+        # Extract inference data
+        docstring = cached_inference.get("docstring", "")
+        tags = cached_inference.get("tags", [])
+        embedding = self.generate_embedding(docstring)
+
+        with self.driver.session() as session:
+            project = self.project_manager.get_project_from_db_by_id_sync(
+                node.get("project_id", "")
+            )
+            repo_path = project.get("repo_path") if project else None
+            is_local_repo = True if repo_path else False
+
+            session.run(
+                """
+                MATCH (n:NODE {repoId: $repo_id, node_id: $node_id})
+                SET n.docstring = $docstring,
+                    n.embedding = $embedding,
+                    n.tags = $tags
+                """
+                + ("" if is_local_repo else ", n.text = null, n.signature = null"),
+                repo_id=node.get("project_id", ""),
+                node_id=node["node_id"],
+                docstring=docstring,
+                embedding=embedding,
+                tags=tags,
+            )
+
+        logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}")


⚠️ Potential issue | 🟠 Major

Verify project_id type and simplify boolean assignment.

Several issues to address:

Potential type mismatch: Line 992 retrieves project_id as a string (with default ""), but get_project_from_db_by_id_sync expects an int parameter according to the method signature. This could cause a runtime error.

Unnecessary async: This method has no await calls - the Neo4j session operations and project lookup are all synchronous. Consider making it a regular def method unless async is required for future compatibility.

Simplify boolean assignment: Line 996 can be simplified.

Apply this diff to fix the type issue and simplify:

- with self.driver.session() as session: - project = self.project_manager.get_project_from_db_by_id_sync( - node.get("project_id", "") - ) - repo_path = project.get("repo_path") if project else None - is_local_repo = True if repo_path else False + with self.driver.session() as session: + # Convert project_id to int for DB lookup + project_id = node.get("project_id") + if project_id: + try: + project_id = int(project_id) if isinstance(project_id, str) else project_id + except (ValueError, TypeError): + logger.warning(f"Invalid project_id for node {node['node_id']}: {project_id}") + project_id = None + + project = self.project_manager.get_project_from_db_by_id_sync(project_id) if project_id else None + repo_path = project.get("repo_path") if project else None + is_local_repo = bool(repo_path)

Additionally, consider changing the method signature to def instead of async def since there are no async operations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None:

"""Update Neo4j with cached inference data for a single node"""

cached_inference = node.get("cached_inference", {})

if not cached_inference:

return

# Extract inference data

docstring = cached_inference.get("docstring", "")

tags = cached_inference.get("tags", [])

embedding = self.generate_embedding(docstring)

with self.driver.session() as session:

project = self.project_manager.get_project_from_db_by_id_sync(

node.get("project_id", "")

)

repo_path = project.get("repo_path") if project else None

is_local_repo = True if repo_path else False

session.run(

"""

MATCH (n:NODE {repoId: $repo_id, node_id: $node_id})

SET n.docstring = $docstring,

n.embedding = $embedding,

n.tags = $tags

"""

+ ("" if is_local_repo else ", n.text = null, n.signature = null"),

repo_id=node.get("project_id", ""),

node_id=node["node_id"],

docstring=docstring,

embedding=embedding,

tags=tags,

)

logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}")

async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None:

"""Update Neo4j with cached inference data for a single node"""

cached_inference = node.get("cached_inference", {})

if not cached_inference:

return

# Extract inference data

docstring = cached_inference.get("docstring", "")

tags = cached_inference.get("tags", [])

embedding = self.generate_embedding(docstring)

with self.driver.session() as session:

# Convert project_id to int for DB lookup

project_id = node.get("project_id")

if project_id:

try:

project_id = int(project_id) if isinstance(project_id, str) else project_id

except (ValueError, TypeError):

logger.warning(f"Invalid project_id for node {node['node_id']}: {project_id}")

project_id = None

project = (

self.project_manager.get_project_from_db_by_id_sync(project_id)

if project_id

else None

)

repo_path = project.get("repo_path") if project else None

is_local_repo = bool(repo_path)

session.run(

"""

MATCH (n:NODE {repoId: $repo_id, node_id: $node_id})

SET n.docstring = $docstring,

n.embedding = $embedding,

n.tags = $tags

"""

("" if is_local_repo else ", n.text = null, n.signature = null"),

repo_id=node.get("project_id", ""),

node_id=node["node_id"],

docstring=docstring,

embedding=embedding,

tags=tags,

)

logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}")

coderabbitai · 2025-09-30T12:45:16Z

app/modules/parsing/services/inference_cache_service.py

+        # Check if entry already exists
+        existing_entry = (
+            self.db.query(InferenceCache)
+            .filter(InferenceCache.content_hash == content_hash)
+            .first()
+        )
+
+        if existing_entry:
+            # Entry already exists - update access tracking and return
+            existing_entry.access_count += 1
+            existing_entry.last_accessed = func.now()
+            self.db.commit()
+            logger.debug(
+                f"Cache entry already exists for content_hash: {content_hash[:12]}... (updated access tracking)"
+            )
+            return existing_entry


⚠️ Potential issue | 🟡 Minor

Race condition: check-then-insert is not atomic.

The check for existing entry (lines 73-77) followed by insert (lines 90-106) creates a race window where concurrent processes can both pass the existence check and attempt to insert the same content_hash, leading to unique constraint violations.

While the exception handler (lines 107-122) recovers from this race condition, it's inefficient to rely on exception handling for expected concurrent behavior.

Consider using an atomic upsert with PostgreSQL's ON CONFLICT:

from sqlalchemy.dialects.postgresql import insert def store_inference(self, content_hash: str, inference_data: Dict[str, Any], ...) -> InferenceCache: stmt = insert(InferenceCache).values( content_hash=content_hash, project_id=project_id, node_type=node_type, content_length=content_length, inference_data=inference_data, embedding_vector=embedding_vector, tags=tags, ).on_conflict_do_update( index_elements=['content_hash'], set_={ 'access_count': InferenceCache.access_count + 1, 'last_accessed': func.now() } ).returning(InferenceCache) result = self.db.execute(stmt) cache_entry = result.scalar_one() self.db.commit() return cache_entry

This eliminates the race condition and is more efficient.

app/modules/parsing/tasks/cache_cleanup_tasks.py

coderabbitai · 2025-09-30T12:45:16Z

app/modules/parsing/utils/cache_diagnostics.py

+        except Exception as e:
+            logger.error(f"Error analyzing node {node_id}: {e}")


🛠️ Refactor suggestion | 🟠 Major

Use logging.exception to include traceback.

When logging errors within an exception handler, logging.exception automatically includes the stack trace, which is more helpful for debugging than logging.error.

except Exception as e: - logger.error(f"Error analyzing node {node_id}: {e}") + logger.exception(f"Error analyzing node {node_id}: {e}")

Based on coding guidelines (Ruff TRY400).

Note: The broad Exception catch is acceptable here since this is a diagnostic utility that should be resilient to individual node failures and continue processing remaining nodes.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

logger.error(f"Error analyzing node {node_id}: {e}")

except Exception as e:

logger.exception(f"Error analyzing node {node_id}: {e}")

🧰 Tools

🪛 Ruff (0.13.1)

89-89: Do not catch blind exception: Exception

(BLE001)

90-90: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In app/modules/parsing/utils/cache_diagnostics.py around lines 89 to 90, replace the logger.error call inside the except block with logger.exception so the stack trace is recorded; keep the same contextual message (e.g., "Error analyzing node {node_id}:") and pass the exception implicitly by calling logger.exception to include the traceback while preserving the broad Exception catch behavior.

dhirenmathur · 2025-09-30T13:54:43Z

@coderabbitai generate docstring

coderabbitai · 2025-09-30T13:55:27Z

Tip

For best results, initiate chat on the files or code changes.

Task: Generate docstrings for PR #452

I'll generate comprehensive docstrings for all public classes, methods, and functions across the new inference caching system to improve coverage from 58.14% to meet the 80% threshold.

Files requiring docstrings:

app/modules/parsing/services/inference_cache_service.py - InferenceCacheService class and methods
app/modules/parsing/services/cache_cleanup_service.py - CacheCleanupService class and methods
app/modules/parsing/knowledge_graph/inference_service.py - Updated methods and new helper functions
app/modules/parsing/utils/cache_diagnostics.py - Diagnostic utility functions
app/modules/parsing/utils/content_hash.py - Hash generation utilities
app/modules/parsing/models/inference_cache_model.py - ORM model class
app/modules/parsing/tasks/cache_cleanup_tasks.py - Celery task functions
app/alembic/versions/ - Migration scripts

@dhirenmathur I'll generate comprehensive docstrings across all 8 files in the inference caching system. The docstrings will follow Google-style format and cover:

Service classes: InferenceCacheService and CacheCleanupService with all methods
Updated inference logic: New cache-aware methods in InferenceService
Utility functions: Cache diagnostics and content hashing utilities
ORM model: InferenceCache table documentation
Celery tasks: Cache cleanup periodic tasks
Migration scripts: Database schema change documentation

This will bring your docstring coverage from 58.14% to well above the 80% threshold. The docstrings will include detailed descriptions, parameter specifications, return values, and any raised exceptions.

⏳ Preparing changes...

coderabbitai · 2025-09-30T14:00:50Z

Caution

The CodeRabbit agent's plans did not produce any file changes.

…nection_handling, and revert_script for database migration among other things.

…ached embedding reuse - Preserved robust database connection cleanup in inference_service.py - Maintained cached embedding vector reuse functionality - Kept robust TTL configuration with proper error handling - Preserved proper database connection cleanup in cache tasks - Maintained correct node counting logic in cache diagnostics

* chore(submodule): refresh potpie-ui to latest origin/main * chore: refresh potpie-ui submodule and update .gitignore * chore: Auto-fix pre-commit issues --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

sonarqubecloud · 2025-10-09T05:40:50Z

Quality Gate passed

Issues
4 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dhirenmathur added 3 commits September 28, 2025 12:04

Inference caching on parsing

e978625

fix and add debug

450f100

Merge 450f100 into 26cb74b

9f7ca56

chore: Auto-fix pre-commit issues

c950f83

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

potpie-ai deleted a comment from cursor bot Sep 30, 2025

dhirenmathur and others added 8 commits October 2, 2025 11:31

fix review comments around using cache_vector_embedding, database_con…

0ad596f

…nection_handling, and revert_script for database migration among other things.

Run the pre-comment hook command.

e8be5eb

Chore: potpie UI refresh (#455)

04c6cd4

* chore(submodule): refresh potpie-ui to latest origin/main * chore: refresh potpie-ui submodule and update .gitignore * chore: Auto-fix pre-commit issues --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

updated inference caching for large nodes

f23529c

update chunking logic to ne O(n) instead of O9n2)

472f0d9

Merge 472f0d9 into 04c6cd4

b80246a

chore: Auto-fix pre-commit issues

99084e2

		except Exception as e:
		logger.error(f"Error analyzing node {node_id}: {e}")

Parsing improvements: Speed up parsing for new branches / Improve KG for large files #452

Are you sure you want to change the base?

Parsing improvements: Speed up parsing for new branches / Improve KG for large files #452

Uh oh!

Conversation

dhirenmathur commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

dhirenmathur commented Sep 30, 2025

Uh oh!

coderabbitai bot commented Sep 30, 2025

Task: Generate docstrings for PR #452

Files requiring docstrings:

Uh oh!

coderabbitai bot commented Sep 30, 2025

Uh oh!

sonarqubecloud bot commented Oct 9, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhirenmathur commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading