-
Notifications
You must be signed in to change notification settings - Fork 484
Parsing improvements: Speed up parsing for new branches / Improve KG for large files #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (7)
🧰 Additional context used🧬 Code graph analysis (6)app/modules/parsing/services/inference_cache_service.py (1)
app/modules/parsing/knowledge_graph/inference_service.py (6)
app/modules/parsing/services/cache_cleanup_service.py (1)
app/alembic/versions/20250928_simple_global_cache.py (1)
app/modules/parsing/utils/cache_diagnostics.py (2)
app/modules/parsing/tasks/cache_cleanup_tasks.py (2)
🪛 Ruff (0.13.3)app/modules/parsing/services/inference_cache_service.py15-15: Unused method argument: (ARG002) 126-126: Use Remove exception name (TRY201) app/modules/parsing/knowledge_graph/inference_service.py379-379: Unused method argument: (ARG002) 398-398: Do not catch blind exception: (BLE001) 615-615: Do not catch blind exception: (BLE001) 849-849: Do not catch blind exception: (BLE001) 926-926: Do not catch blind exception: (BLE001) 980-980: Do not catch blind exception: (BLE001) 990-990: Consider moving this statement to an (TRY300) 992-992: Do not catch blind exception: (BLE001) 993-993: Use Replace with (TRY400) app/modules/parsing/utils/cache_diagnostics.py91-91: Do not catch blind exception: (BLE001) 92-92: Use Replace with (TRY400) app/modules/parsing/tasks/cache_cleanup_tasks.py21-21: Consider moving this statement to an (TRY300) 46-46: Consider moving this statement to an (TRY300) 67-67: Consider moving this statement to an (TRY300) 🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Nitpick comments (14)
app/modules/parsing/utils/content_hash.py (1)
44-47: Consider cross-platform line ending compatibility.The current implementation splits on
"\n"only, which may not correctly handle Windows-style"\r\n"line endings. This could affect the uniqueness calculation on different platforms.Apply this diff to improve cross-platform compatibility:
- lines = node_text.strip().split("\n") + lines = node_text.strip().splitlines()The
splitlines()method handles all line ending styles (\n,\r\n,\r) uniformly.app/modules/parsing/services/cache_cleanup_service.py (2)
11-15: Add validation for TTL configuration.The environment variable
INFERENCE_CACHE_TTL_DAYSis parsed without validation. If the value is non-numeric, negative, or zero, it could cause unexpected behavior or failures.Apply this diff to add validation:
def __init__(self, db: Session): self.db = db # Default TTL: 30 days, configurable via environment - self.cache_ttl_days = int(os.getenv("INFERENCE_CACHE_TTL_DAYS", "30")) + try: + self.cache_ttl_days = int(os.getenv("INFERENCE_CACHE_TTL_DAYS", "30")) + if self.cache_ttl_days <= 0: + logger.warning(f"Invalid INFERENCE_CACHE_TTL_DAYS: {self.cache_ttl_days}, using default 30") + self.cache_ttl_days = 30 + except ValueError: + logger.warning(f"Invalid INFERENCE_CACHE_TTL_DAYS format, using default 30") + self.cache_ttl_days = 30
31-57: Optimize bulk deletion for better performance.Lines 51-52 delete entries one at a time in a loop, which is inefficient for large deletions. If
entries_to_removeis substantial (e.g., 10,000+), this will generate thousands of individual DELETE statements and significantly slow down the cleanup task.Apply this diff to use bulk deletion:
def cleanup_least_accessed(self, max_entries: int = 100000) -> int: """Remove least accessed entries if cache grows too large""" total_entries = self.db.query(InferenceCache).count() if total_entries <= max_entries: return 0 entries_to_remove = total_entries - max_entries - # Get least accessed entries - least_accessed = ( + # Get IDs of least accessed entries + least_accessed_ids = ( self.db.query(InferenceCache) .order_by( InferenceCache.access_count.asc(), InferenceCache.last_accessed.asc() ) .limit(entries_to_remove) - .all() + .with_entities(InferenceCache.id) + .subquery() ) - # Delete them - for entry in least_accessed: - self.db.delete(entry) + # Bulk delete using subquery + result = self.db.execute( + delete(InferenceCache).where(InferenceCache.id.in_(least_accessed_ids)) + ) self.db.commit() - logger.info(f"Cleaned up {entries_to_remove} least accessed cache entries") - return entries_to_remove + deleted_count = result.rowcount + logger.info(f"Cleaned up {deleted_count} least accessed cache entries") + return deleted_countapp/alembic/versions/20250923_add_inference_cache_table.py (2)
52-54: Potential redundant index on unique column.The
UNIQUEconstraint oncontent_hash(line 47) already creates an index in PostgreSQL. The explicit index creation on line 52-53 may be redundant. However, usingop.f()for standardized naming is a best practice.This is acceptable but note that PostgreSQL will use the unique constraint's index for lookups regardless.
20-64: Consider indexingaccess_countfor cleanup queries.The
access_countcolumn (line 45) is used by cache cleanup services to identify least-accessed entries for eviction but lacks an index. Cache cleanup queries filtering or ordering byaccess_countwill require full table scans as the cache grows.Add an index on
access_countfor cleanup performance:op.create_index( op.f("ix_inference_cache_last_accessed"), "inference_cache", ["last_accessed"] ) + op.create_index( + op.f("ix_inference_cache_access_count"), "inference_cache", ["access_count"] + )And in the downgrade:
def downgrade(): + op.drop_index( + op.f("ix_inference_cache_access_count"), table_name="inference_cache" + ) op.drop_index( op.f("ix_inference_cache_last_accessed"), table_name="inference_cache" )app/modules/parsing/models/inference_cache_model.py (2)
19-19: Redundantindex=Trueon primary key.Primary keys are automatically indexed by the database. The explicit
index=Trueis redundant but harmless.- id = Column(Integer, primary_key=True, index=True) + id = Column(Integer, primary_key=True)
35-35: Consider adding index foraccess_countto support cleanup queries.The
access_countcolumn is used by cache cleanup services to identify least-accessed entries but lacks an index. This will cause performance issues as the cache grows and cleanup queries need to filter/sort by access count.This aligns with the missing index in the migration file.
- access_count = Column(Integer, default=1, nullable=False) + access_count = Column(Integer, default=1, nullable=False, index=True)Note: This requires a corresponding migration update in
20250923_add_inference_cache_table.py.thoughts/shared/plans/simplified-global-cache-implementation.md (1)
23-26: Add language identifier to fenced code block.The code block lacks a language identifier for syntax highlighting.
-``` +```text content → content_hash → global cache → inference (project_id stored as metadata only)Based on coding guidelines (markdownlint MD040). </blockquote></details> <details> <summary>app/modules/parsing/knowledge_graph/inference_service.py (6)</summary><blockquote> `208-254`: **Remove unnecessary assignment at line 240.** The method correctly splits large nodes into processable chunks. However, line 240 assigns `current_tokens` after adding a line to `current_chunk_lines`, but this value is never used since `test_tokens` is recalculated on the next iteration. Apply this diff to remove the dead assignment: ```diff else: current_chunk_lines.append(line) - current_tokens = test_tokensOptionally, consider using unpacking at line 222 for better style (as per coding guidelines):
for line in lines: - test_chunk = "\n".join(current_chunk_lines + [line]) + test_chunk = "\n".join([*current_chunk_lines, line])
256-289: Move inline import to module level.The consolidation logic is sound, but the inline import of
DocstringNodeat line 283 is unnecessary since it's already imported frominference_schemaat line 17.Remove the inline import and rely on the module-level import:
- # Create single consolidated docstring for parent node - from app.modules.parsing.knowledge_graph.inference_schema import DocstringNode - consolidated_docstring = DocstringNode(
291-336: Remove redundant inline import.The chunk processing logic correctly consolidates multi-chunk responses. However, the inline import of
DocstringResponseat lines 324-326 is redundant since it's already imported at line 19.Remove the inline import:
# Create a mock response list for consolidation mock_responses = [] for chunk_doc in chunk_docstrings: - from app.modules.parsing.knowledge_graph.inference_schema import ( - DocstringResponse, - ) - mock_responses.append(DocstringResponse(docstrings=[chunk_doc]))
547-559: Replace hardcoded condition with proper log level check.The
if True:at line 548 unconditionally runs diagnostics. Consider checking if debug logging is enabled instead.- # Run diagnostics on nodes if DEBUG logging is enabled - if True: + if logger.isEnabledFor(logging.DEBUG): try:
852-855: Uselogger.exceptionfor better error context.When logging errors within an exception handler, use
logger.exceptioninstead oflogger.errorto automatically include the stack trace.except Exception as e: - logger.error(f"Failed to process batch {batch_index}: {e}") + logger.exception(f"Failed to process batch {batch_index}: {e}") # Continue with next batch instead of failing entire operation return DocstringResponse(docstrings=[])
356-365: Extract repeated cache initialization to a helper method.The cache service initialization pattern is duplicated in
batch_nodesandgenerate_docstrings. Consider extracting to a helper method to reduce duplication and ensure consistent error handling.Add a new method:
def _get_cache_service(self) -> Optional[InferenceCacheService]: """Initialize cache service with proper resource management""" try: db_gen = get_db() db = next(db_gen) cache_service = InferenceCacheService(db) # Store generator for cleanup cache_service._db_gen = db_gen return cache_service except Exception as e: logger.warning(f"Failed to initialize cache service: {e}. Continuing without cache.") return None def _close_cache_service(self, cache_service: Optional[InferenceCacheService]) -> None: """Clean up cache service database session""" if cache_service and hasattr(cache_service, '_db_gen'): try: next(cache_service._db_gen) except StopIteration: passThen replace both initialization blocks with
cache_service = self._get_cache_service()and add cleanup calls before returning.Also applies to: 783-791
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.gitignore(1 hunks)app/alembic/versions/20250923_add_inference_cache_table.py(1 hunks)app/alembic/versions/20250928_simple_global_cache.py(1 hunks)app/core/config_provider.py(2 hunks)app/modules/media/media_controller.py(1 hunks)app/modules/media/media_router.py(2 hunks)app/modules/media/media_service.py(1 hunks)app/modules/parsing/knowledge_graph/inference_schema.py(1 hunks)app/modules/parsing/knowledge_graph/inference_service.py(6 hunks)app/modules/parsing/models/inference_cache_model.py(1 hunks)app/modules/parsing/services/cache_cleanup_service.py(1 hunks)app/modules/parsing/services/inference_cache_service.py(1 hunks)app/modules/parsing/tasks/cache_cleanup_tasks.py(1 hunks)app/modules/parsing/utils/cache_diagnostics.py(1 hunks)app/modules/parsing/utils/content_hash.py(1 hunks)deployment/stage/convo-server/convo-api-supervisord.conf(1 hunks)thoughts/shared/plans/simplified-global-cache-implementation.md(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (8)
app/alembic/versions/20250928_simple_global_cache.py (1)
app/alembic/versions/20250923_add_inference_cache_table.py (2)
upgrade(20-63)downgrade(66-73)
app/modules/parsing/services/cache_cleanup_service.py (1)
app/modules/parsing/models/inference_cache_model.py (1)
InferenceCache(16-35)
app/modules/parsing/models/inference_cache_model.py (1)
app/core/base_model.py (1)
Base(9-16)
app/modules/parsing/knowledge_graph/inference_service.py (6)
app/core/database.py (1)
get_db(29-34)app/modules/parsing/knowledge_graph/inference_schema.py (3)
DocstringRequest(6-9)DocstringResponse(18-19)DocstringNode(12-15)app/modules/parsing/services/inference_cache_service.py (2)
get_cached_inference(14-44)store_inference(46-122)app/modules/parsing/utils/content_hash.py (2)
generate_content_hash(6-26)is_content_cacheable(29-49)app/modules/parsing/utils/cache_diagnostics.py (2)
analyze_cache_misses(10-104)log_diagnostics_summary(107-144)app/modules/projects/projects_service.py (1)
get_project_from_db_by_id_sync(220-233)
app/modules/parsing/utils/cache_diagnostics.py (2)
app/modules/parsing/models/inference_cache_model.py (1)
InferenceCache(16-35)app/modules/parsing/utils/content_hash.py (1)
generate_content_hash(6-26)
app/modules/parsing/services/inference_cache_service.py (1)
app/modules/parsing/models/inference_cache_model.py (1)
InferenceCache(16-35)
app/alembic/versions/20250923_add_inference_cache_table.py (1)
app/alembic/versions/20250928_simple_global_cache.py (2)
upgrade(17-24)downgrade(27-36)
app/modules/parsing/tasks/cache_cleanup_tasks.py (2)
app/core/database.py (1)
get_db(29-34)app/modules/parsing/services/cache_cleanup_service.py (4)
CacheCleanupService(11-76)cleanup_expired_entries(17-29)cleanup_least_accessed(31-57)get_cleanup_stats(59-76)
🪛 markdownlint-cli2 (0.18.1)
thoughts/shared/plans/simplified-global-cache-implementation.md
23-23: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.1)
app/modules/parsing/knowledge_graph/inference_service.py
222-222: Consider [*current_chunk_lines, line] instead of concatenation
Replace with [*current_chunk_lines, line]
(RUF005)
240-240: Local variable current_tokens is assigned to but never used
Remove assignment to unused variable current_tokens
(F841)
343-343: Unused method argument: project_id
(ARG002)
360-360: Do not catch blind exception: Exception
(BLE001)
558-558: Do not catch blind exception: Exception
(BLE001)
786-786: Do not catch blind exception: Exception
(BLE001)
836-836: Do not catch blind exception: Exception
(BLE001)
850-850: Consider moving this statement to an else block
(TRY300)
852-852: Do not catch blind exception: Exception
(BLE001)
853-853: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
app/modules/parsing/utils/cache_diagnostics.py
89-89: Do not catch blind exception: Exception
(BLE001)
90-90: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
app/modules/parsing/services/inference_cache_service.py
15-15: Unused method argument: project_id
(ARG002)
120-120: Use raise without specifying exception name
Remove exception name
(TRY201)
app/modules/parsing/tasks/cache_cleanup_tasks.py
20-20: Consider moving this statement to an else block
(TRY300)
22-22: Do not catch blind exception: Exception
(BLE001)
23-23: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
24-24: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
41-41: Consider moving this statement to an else block
(TRY300)
43-43: Do not catch blind exception: Exception
(BLE001)
44-44: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
45-45: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
58-58: Consider moving this statement to an else block
(TRY300)
60-60: Do not catch blind exception: Exception
(BLE001)
61-61: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
62-62: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (24)
deployment/stage/convo-server/convo-api-supervisord.conf (1)
1-13: LGTM! Cosmetic change only.The only change in this file is the addition of a trailing newline, which has no functional impact on the supervisord configuration.
app/core/config_provider.py (2)
20-22: LGTM: Formatting improvement.The multiline formatting improves readability without changing behavior.
123-130: LGTM: Good formatting with trailing comma.The multiline formatting improves readability, and the trailing comma follows Python best practices for easier future modifications.
app/modules/media/media_service.py (1)
60-62: LGTM: Formatting improvement for readability.The multi-line formatting of the logger message is a minor readability improvement with no functional impact.
app/modules/media/media_controller.py (1)
39-40: LGTM: Consistent formatting with trailing comma.Adding the trailing comma follows Python best practices for multi-line dictionaries and reduces diff noise when adding future fields.
app/modules/media/media_router.py (1)
41-42: LGTM: Consistent formatting across multimodal error responses.The trailing commas in both HTTPException detail dictionaries follow Python best practices and maintain consistency with the formatting changes in
media_controller.py.Also applies to: 188-189
.gitignore (1)
74-74: Confirm no.thoughts/directory exists and intent
No hidden\.thoughts/directories were detected. Please verify that renaming the ignore rule from.thoughts/tothoughts/is intentional and won’t expose previously ignored content.app/modules/parsing/utils/content_hash.py (1)
6-26: LGTM: Hash generation is appropriate for cache keys.The whitespace normalization (collapsing all whitespace to single spaces) is a reasonable trade-off for content-based caching, as it focuses on semantic equivalence rather than formatting. The optional
node_typeprefix provides useful differentiation for identical code in different contexts.app/modules/parsing/knowledge_graph/inference_schema.py (1)
1-1: LGTM: Backward-compatible metadata field addition.The new optional
metadatafield is properly typed and defaults toNone, ensuring backward compatibility. This enables metadata propagation for the cache layer introduced in this PR.Also applies to: 9-9
app/alembic/versions/20250928_simple_global_cache.py (1)
17-21: Please run these SQL queries against your PostgreSQL database and share the outputs to confirm whether theinference_cache_project_id_fkeyconstraint exists:SELECT conname FROM pg_constraint WHERE conname = 'inference_cache_project_id_fkey' AND conrelid = 'inference_cache'::regclass; SELECT conname, contype FROM pg_constraint WHERE conrelid = 'inference_cache'::regclass AND contype = 'f';app/alembic/versions/20250923_add_inference_cache_table.py (1)
48-48: Note: FK constraint is removed in subsequent migration.The foreign key constraint with CASCADE delete on
project_idis removed in migration20250928_simple_global_cache.pyto enable global cache independence. This temporary FK creates a brief inconsistency where the cache would be deleted with projects, which contradicts the global cache design.While the migration chain is correct, consider adding a comment here referencing the subsequent removal for maintainability.
app/modules/parsing/services/inference_cache_service.py (3)
14-44: LGTM! Global cache lookup correctly ignores project_id.The implementation correctly performs lookups using only
content_hash, withproject_idkept for API compatibility as documented. Access tracking updates are appropriate.Note: The Ruff ARG002 warning about unused
project_idis a false positive—the parameter is intentionally kept for API compatibility during the migration to global cache.
37-37: Use server-side timestamp for consistency.Using
func.now()forlast_accessedgenerates a SQLNOW()call, which is correct. However, ensure this matches the Column definition'sserver_default=func.now()behavior for consistency.This is acceptable as-is but be aware that
func.now()is evaluated server-side during commit.
124-142: Inconsistent project_id filtering in stats contradicts global cache design.The
get_cache_statsmethod filters byproject_id(lines 128-129), which contradicts the global cache architecture whereproject_idis metadata only. This filtering suggests project-scoped analytics, but the cache itself is global.While this may be intentional for analytics purposes, it creates confusion about whether the cache is truly global or project-scoped.
Clarify the intended use of
project_idfiltering in stats:
- If stats should be global-only, remove the filtering
- If project-scoped analytics are desired, document this as a metadata-only use case
Based on the plan document (thoughts/shared/plans/simplified-global-cache-implementation.md), project_id is "metadata only" for tracing, suggesting the filter should be removed or clearly documented as analytics-only.
app/modules/parsing/models/inference_cache_model.py (1)
16-36: LGTM! Model schema aligns with migration.The ORM model correctly defines all columns, constraints, and indexes matching the Alembic migration. The explicit
__tablename__appropriately overrides the base class's auto-generated name.thoughts/shared/plans/simplified-global-cache-implementation.md (1)
1-376: Excellent implementation plan with clear phases and verification criteria.The plan document provides a well-structured migration path from the broken project-scoped cache to a simplified global cache. The three-phase approach (fix logic → remove FK → simplify integration) minimizes risk while delivering immediate value.
Key strengths:
- Clear problem statement with root cause analysis
- Incremental phases with automated and manual verification criteria
- Comprehensive test strategy covering cross-project sharing and persistence
- Performance considerations and rollback strategy
The embedded code examples match the actual implementation, demonstrating that the plan was followed.
app/modules/parsing/utils/cache_diagnostics.py (2)
10-104: Well-designed diagnostic utility with defensive programming.The
analyze_cache_missesfunction provides comprehensive analysis of cache effectiveness with appropriate defensive checks for malformed data. The categorization of issues (missing types, unresolved references, cache misses) with examples is helpful for debugging.
107-144: LGTM! Clear and actionable diagnostic output.The summary logging provides well-formatted diagnostic information with appropriate categorization and example nodes for each issue type. The formatting with separators and percentage calculations makes it easy to identify cache problems.
app/modules/parsing/knowledge_graph/inference_service.py (6)
5-5: LGTM! Imports support the new caching infrastructure.The additional imports for
Optional,get_db,InferenceCacheService, and content hashing utilities are appropriate for the cache-aware inference processing being introduced.Also applies to: 13-13, 21-25
83-91: LGTM! Robust input validation.The added None and type checking prevents potential crashes when node text is missing or incorrectly typed, with appropriate warnings to surface data quality issues.
338-343: Unused parameter is acceptable for API consistency.The
project_idparameter is currently unused withinbatch_nodesitself (the cache service doesn't use it for lookups per the implementation). However, this is acceptable as it maintains API consistency and supports future enhancements where project-specific batching logic may be needed.
648-653: LGTM! Safety check prevents empty batches.The added conditional ensures that only non-empty batches are appended, preventing potential issues downstream.
769-779: LGTM! Effective cache-first processing strategy.The implementation correctly separates cached nodes from those requiring LLM processing, processing cached nodes first via
update_neo4j_with_cached_inferencebefore batching the remainder. This optimizes throughput by avoiding unnecessary LLM calls.
808-840: LGTM! Robust cache storage with graceful failure handling.The cache storage logic correctly extracts inference data and stores it with appropriate metadata (content_hash, node_type, tags, embedding). The broad exception catch at line 836 is appropriate here - cache write failures should not break the inference pipeline.
| try: | ||
| db = next(get_db()) | ||
| cache_service = InferenceCacheService(db) | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"Failed to initialize cache service for result storage: {e}" | ||
| ) | ||
| cache_service = None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix resource leak: improperly managed database session.
Same issue as in batch_nodes - the database session from get_db() is not properly closed, causing a connection leak.
Apply the same fix pattern as suggested for batch_nodes:
# Process LLM batches and cache results
try:
- db = next(get_db())
+ db_gen = get_db()
+ db = next(db_gen)
cache_service = InferenceCacheService(db)
except Exception as e:
logger.warning(
f"Failed to initialize cache service for result storage: {e}"
)
cache_service = None
+ db_gen = NoneAdd cleanup before the method returns (after line 870):
+ # Clean up database session
+ if 'db_gen' in locals() and db_gen:
+ try:
+ next(db_gen)
+ except StopIteration:
+ pass
+
return updated_docstrings🧰 Tools
🪛 Ruff (0.13.1)
786-786: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In app/modules/parsing/knowledge_graph/inference_service.py around lines 783 to
791, the code calls next(get_db()) to obtain a DB session but never closes the
generator, leaking connections; change to capture the generator (db_gen =
get_db()) then db = next(db_gen) and, before the method returns (add the cleanup
just prior to the return around line 870), add the cleanup block that checks if
db_gen is in locals() and calls next(db_gen) inside a try/except StopIteration
to close the generator; keep existing exception handling for initialization but
ensure the cleanup runs whether cache_service was created or set to None.
| async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None: | ||
| """Update Neo4j with cached inference data for a single node""" | ||
| cached_inference = node.get("cached_inference", {}) | ||
| if not cached_inference: | ||
| return | ||
|
|
||
| # Extract inference data | ||
| docstring = cached_inference.get("docstring", "") | ||
| tags = cached_inference.get("tags", []) | ||
| embedding = self.generate_embedding(docstring) | ||
|
|
||
| with self.driver.session() as session: | ||
| project = self.project_manager.get_project_from_db_by_id_sync( | ||
| node.get("project_id", "") | ||
| ) | ||
| repo_path = project.get("repo_path") if project else None | ||
| is_local_repo = True if repo_path else False | ||
|
|
||
| session.run( | ||
| """ | ||
| MATCH (n:NODE {repoId: $repo_id, node_id: $node_id}) | ||
| SET n.docstring = $docstring, | ||
| n.embedding = $embedding, | ||
| n.tags = $tags | ||
| """ | ||
| + ("" if is_local_repo else ", n.text = null, n.signature = null"), | ||
| repo_id=node.get("project_id", ""), | ||
| node_id=node["node_id"], | ||
| docstring=docstring, | ||
| embedding=embedding, | ||
| tags=tags, | ||
| ) | ||
|
|
||
| logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify project_id type and simplify boolean assignment.
Several issues to address:
-
Potential type mismatch: Line 992 retrieves
project_idas a string (with default""), butget_project_from_db_by_id_syncexpects anintparameter according to the method signature. This could cause a runtime error. -
Unnecessary async: This method has no
awaitcalls - the Neo4j session operations and project lookup are all synchronous. Consider making it a regulardefmethod unless async is required for future compatibility. -
Simplify boolean assignment: Line 996 can be simplified.
Apply this diff to fix the type issue and simplify:
- with self.driver.session() as session:
- project = self.project_manager.get_project_from_db_by_id_sync(
- node.get("project_id", "")
- )
- repo_path = project.get("repo_path") if project else None
- is_local_repo = True if repo_path else False
+ with self.driver.session() as session:
+ # Convert project_id to int for DB lookup
+ project_id = node.get("project_id")
+ if project_id:
+ try:
+ project_id = int(project_id) if isinstance(project_id, str) else project_id
+ except (ValueError, TypeError):
+ logger.warning(f"Invalid project_id for node {node['node_id']}: {project_id}")
+ project_id = None
+
+ project = self.project_manager.get_project_from_db_by_id_sync(project_id) if project_id else None
+ repo_path = project.get("repo_path") if project else None
+ is_local_repo = bool(repo_path)Additionally, consider changing the method signature to def instead of async def since there are no async operations.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None: | |
| """Update Neo4j with cached inference data for a single node""" | |
| cached_inference = node.get("cached_inference", {}) | |
| if not cached_inference: | |
| return | |
| # Extract inference data | |
| docstring = cached_inference.get("docstring", "") | |
| tags = cached_inference.get("tags", []) | |
| embedding = self.generate_embedding(docstring) | |
| with self.driver.session() as session: | |
| project = self.project_manager.get_project_from_db_by_id_sync( | |
| node.get("project_id", "") | |
| ) | |
| repo_path = project.get("repo_path") if project else None | |
| is_local_repo = True if repo_path else False | |
| session.run( | |
| """ | |
| MATCH (n:NODE {repoId: $repo_id, node_id: $node_id}) | |
| SET n.docstring = $docstring, | |
| n.embedding = $embedding, | |
| n.tags = $tags | |
| """ | |
| + ("" if is_local_repo else ", n.text = null, n.signature = null"), | |
| repo_id=node.get("project_id", ""), | |
| node_id=node["node_id"], | |
| docstring=docstring, | |
| embedding=embedding, | |
| tags=tags, | |
| ) | |
| logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}") | |
| async def update_neo4j_with_cached_inference(self, node: Dict[str, Any]) -> None: | |
| """Update Neo4j with cached inference data for a single node""" | |
| cached_inference = node.get("cached_inference", {}) | |
| if not cached_inference: | |
| return | |
| # Extract inference data | |
| docstring = cached_inference.get("docstring", "") | |
| tags = cached_inference.get("tags", []) | |
| embedding = self.generate_embedding(docstring) | |
| with self.driver.session() as session: | |
| # Convert project_id to int for DB lookup | |
| project_id = node.get("project_id") | |
| if project_id: | |
| try: | |
| project_id = int(project_id) if isinstance(project_id, str) else project_id | |
| except (ValueError, TypeError): | |
| logger.warning(f"Invalid project_id for node {node['node_id']}: {project_id}") | |
| project_id = None | |
| project = ( | |
| self.project_manager.get_project_from_db_by_id_sync(project_id) | |
| if project_id | |
| else None | |
| ) | |
| repo_path = project.get("repo_path") if project else None | |
| is_local_repo = bool(repo_path) | |
| session.run( | |
| """ | |
| MATCH (n:NODE {repoId: $repo_id, node_id: $node_id}) | |
| SET n.docstring = $docstring, | |
| n.embedding = $embedding, | |
| n.tags = $tags | |
| """ | |
| ("" if is_local_repo else ", n.text = null, n.signature = null"), | |
| repo_id=node.get("project_id", ""), | |
| node_id=node["node_id"], | |
| docstring=docstring, | |
| embedding=embedding, | |
| tags=tags, | |
| ) | |
| logger.debug(f"Updated Neo4j with cached inference for node {node['node_id']}") |
| # Check if entry already exists | ||
| existing_entry = ( | ||
| self.db.query(InferenceCache) | ||
| .filter(InferenceCache.content_hash == content_hash) | ||
| .first() | ||
| ) | ||
|
|
||
| if existing_entry: | ||
| # Entry already exists - update access tracking and return | ||
| existing_entry.access_count += 1 | ||
| existing_entry.last_accessed = func.now() | ||
| self.db.commit() | ||
| logger.debug( | ||
| f"Cache entry already exists for content_hash: {content_hash[:12]}... (updated access tracking)" | ||
| ) | ||
| return existing_entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: check-then-insert is not atomic.
The check for existing entry (lines 73-77) followed by insert (lines 90-106) creates a race window where concurrent processes can both pass the existence check and attempt to insert the same content_hash, leading to unique constraint violations.
While the exception handler (lines 107-122) recovers from this race condition, it's inefficient to rely on exception handling for expected concurrent behavior.
Consider using an atomic upsert with PostgreSQL's ON CONFLICT:
from sqlalchemy.dialects.postgresql import insert
def store_inference(self, content_hash: str, inference_data: Dict[str, Any], ...) -> InferenceCache:
stmt = insert(InferenceCache).values(
content_hash=content_hash,
project_id=project_id,
node_type=node_type,
content_length=content_length,
inference_data=inference_data,
embedding_vector=embedding_vector,
tags=tags,
).on_conflict_do_update(
index_elements=['content_hash'],
set_={
'access_count': InferenceCache.access_count + 1,
'last_accessed': func.now()
}
).returning(InferenceCache)
result = self.db.execute(stmt)
cache_entry = result.scalar_one()
self.db.commit()
return cache_entryThis eliminates the race condition and is more efficient.
| except Exception as e: | ||
| logger.error(f"Error analyzing node {node_id}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use logging.exception to include traceback.
When logging errors within an exception handler, logging.exception automatically includes the stack trace, which is more helpful for debugging than logging.error.
except Exception as e:
- logger.error(f"Error analyzing node {node_id}: {e}")
+ logger.exception(f"Error analyzing node {node_id}: {e}")Based on coding guidelines (Ruff TRY400).
Note: The broad Exception catch is acceptable here since this is a diagnostic utility that should be resilient to individual node failures and continue processing remaining nodes.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.error(f"Error analyzing node {node_id}: {e}") | |
| except Exception as e: | |
| logger.exception(f"Error analyzing node {node_id}: {e}") |
🧰 Tools
🪛 Ruff (0.13.1)
89-89: Do not catch blind exception: Exception
(BLE001)
90-90: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In app/modules/parsing/utils/cache_diagnostics.py around lines 89 to 90, replace
the logger.error call inside the except block with logger.exception so the stack
trace is recorded; keep the same contextual message (e.g., "Error analyzing node
{node_id}:") and pass the exception implicitly by calling logger.exception to
include the traceback while preserving the broad Exception catch behavior.
|
@coderabbitai generate docstring |
|
Tip For best results, initiate chat on the files or code changes. Task: Generate docstrings for PR #452I'll generate comprehensive docstrings for all public classes, methods, and functions across the new inference caching system to improve coverage from 58.14% to meet the 80% threshold. Files requiring docstrings:
@dhirenmathur I'll generate comprehensive docstrings across all 8 files in the inference caching system. The docstrings will follow Google-style format and cover:
This will bring your docstring coverage from 58.14% to well above the 80% threshold. The docstrings will include detailed descriptions, parameter specifications, return values, and any raised exceptions. ⏳ Preparing changes... |
|
Caution The CodeRabbit agent's plans did not produce any file changes. |
…nection_handling, and revert_script for database migration among other things.
…ached embedding reuse - Preserved robust database connection cleanup in inference_service.py - Maintained cached embedding vector reuse functionality - Kept robust TTL configuration with proper error handling - Preserved proper database connection cleanup in cache tasks - Maintained correct node counting logic in cache diagnostics
* chore(submodule): refresh potpie-ui to latest origin/main * chore: refresh potpie-ui submodule and update .gitignore * chore: Auto-fix pre-commit issues --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|



Summary by CodeRabbit
New Features
Performance Improvements
Background Tasks
Database