Skip to content

Conversation

@JSv4
Copy link
Collaborator

@JSv4 JSv4 commented Dec 27, 2025

This proposal designs an elegant, performant MCP (Model Context Protocol)
interface for OpenContracts that provides read-only access to public resources.

Key features:

  • One-corpus-at-a-time model for focused exploration
  • Anonymous user permission model (public resources only)
  • 4 MCP resources: corpus, document, annotation, thread
  • 7 MCP tools for discovery and retrieval
  • Performance optimized using AnnotationQueryOptimizer
  • Respects consolidated permissioning guide
  • Comprehensive implementation examples and testing strategy

Implementation includes:

  • Resource URI scheme (corpus://, document://, annotation://, thread://)
  • Tool definitions (list_public_corpuses, list_documents, get_document_text,
    list_annotations, search_corpus, list_threads, get_thread_messages)
  • Security considerations (rate limiting, input validation)
  • Performance optimizations (caching, pagination, query optimization)
  • Complete code examples for server.py, resources.py, tools.py
  • Testing strategy with unit and integration tests
  • Deployment instructions

Addresses: #387

This proposal designs an elegant, performant MCP (Model Context Protocol)
interface for OpenContracts that provides read-only access to public resources.

Key features:
- One-corpus-at-a-time model for focused exploration
- Anonymous user permission model (public resources only)
- 4 MCP resources: corpus, document, annotation, thread
- 7 MCP tools for discovery and retrieval
- Performance optimized using AnnotationQueryOptimizer
- Respects consolidated permissioning guide
- Comprehensive implementation examples and testing strategy

Implementation includes:
- Resource URI scheme (corpus://, document://, annotation://, thread://)
- Tool definitions (list_public_corpuses, list_documents, get_document_text,
  list_annotations, search_corpus, list_threads, get_thread_messages)
- Security considerations (rate limiting, input validation)
- Performance optimizations (caching, pagination, query optimization)
- Complete code examples for server.py, resources.py, tools.py
- Testing strategy with unit and integration tests
- Deployment instructions

Addresses: #387
@claude
Copy link

claude bot commented Dec 27, 2025

PR Review: MCP Interface Proposal for OpenContracts

Overall Assessment

Status: APPROVE with Minor Suggestions

This is an excellent, well-researched proposal that demonstrates deep understanding of OpenContracts' architecture and permissioning model. The document is comprehensive, production-ready, and follows established patterns throughout the codebase.

Strengths

1. Architecture Alignment

  • Correctly uses AnonymousUser() for public-only access model
  • Properly leverages visible_to_user() manager methods throughout
  • Correctly applies AnnotationQueryOptimizer.get_document_annotations() for performance
  • Follows the consolidated permissioning guide's rules for annotation inheritance

2. Security Design

  • Read-only model prevents data corruption
  • Anonymous user constraint automatically enforces public-only access
  • Input validation with slug pattern matching
  • Rate limiting strategy included
  • Proper IDOR prevention mindset (same error messages)

3. Performance Considerations

  • Smart use of select_related() and prefetch_related()
  • Pagination patterns with limit/offset
  • Caching strategy for expensive operations
  • Query optimizer usage to prevent N+1 queries
  • Acknowledged the need for search_by_embedding implementation

4. Documentation Quality

  • Clear examples with JSON responses
  • Complete code implementation samples
  • Testing strategy with unit and integration tests
  • Deployment instructions
  • Future enhancement roadmap

Issues & Suggestions

Critical Issues: None

Major Concerns

1. Missing Model Method Verification

The proposal references several methods that may not exist in the current codebase:

Line 88: corpus.document_count() - Verify this method exists on Corpus model

"document_count": corpus.document_count(),  # Does this method exist?

Suggested fix: Use corpus.documents.count() or verify the method exists

Lines 557, 567: corpus.embed_text() and search_by_embedding() - These appear to be proposed/planned methods

embedder_path, query_vector = corpus.embed_text(query)  # Verify method exists
doc_results = (Document.objects...search_by_embedding(...))  # Verify method exists

Suggested action: Add a note that search_corpus implementation depends on embedding infrastructure that may need additional development, or confirm these methods exist.

2. Async/Sync Boundary Issues

The proposal marks resource functions as async def but uses synchronous Django ORM calls inside:

Lines 80-92, 117-146, 177-202: Resource functions marked async but contain blocking ORM calls

async def get_corpus_resource(corpus_slug: str) -> str:  # async keyword
    anonymous = AnonymousUser()
    corpus = Corpus.objects.visible_to_user(anonymous).get(slug=corpus_slug)  # blocking call!

Issue: Mixing async def with synchronous ORM will block the event loop.

Suggested fix: Either:

  1. Use sync_to_async wrapper for Django ORM calls
  2. Remove async keyword and keep functions synchronous (simpler, recommended for Django)
  3. Use Django 4.1+ async ORM interface (if available)

Example fix:

from asgiref.sync import sync_to_async

async def get_corpus_resource(corpus_slug: str) -> str:
    anonymous = AnonymousUser()
    corpus = await sync_to_async(
        lambda: Corpus.objects.visible_to_user(anonymous).get(slug=corpus_slug)
    )()
    # ... rest of implementation

Minor Issues

3. Incomplete Error Handling

Lines 129-130, 259-260: Uses raise NotFoundError() but class is not imported/defined

if not document:
    raise NotFoundError()  # Not defined in imports

Suggested fix: Import or define the exception, or use Django's Http404 / custom exception

4. Missing Helper Function Implementations

Multiple formatter functions are referenced but not implemented:

  • format_corpus_summary() (line 335)
  • format_document_summary() (line 393)
  • format_annotation() (line 506)
  • format_message_with_replies() (line 284)
  • format_thread_summary() (line 650)

Suggestion: Add a section showing example implementations of these formatters, or note they need to be implemented.

5. Thread Message Recursion Potential

Lines 277-284: build_threaded_messages() references format_message_with_replies() which would need to recursively load replies

def build_threaded_messages(thread: Conversation, user) -> list:
    messages = (ChatMessage.objects
                .visible_to_user(user)
                .filter(conversation=thread, parent_message__isnull=True)
                .order_by('created_at'))
    return [format_message_with_replies(msg, user) for msg in messages]  # Recursive?

Concern: Could cause N+1 queries if not using prefetch_related('replies')

Suggested fix: Show prefetching pattern:

messages = (ChatMessage.objects
            .visible_to_user(user)
            .filter(conversation=thread, parent_message__isnull=True)
            .prefetch_related('replies__replies')  # Prefetch nested replies
            .order_by('created_at'))

6. Corpus Methods Referenced

Line 557: corpus.embed_text(query) - Verify this exists or note it's proposed
Line 560: text_search_fallback() - Not implemented, needs definition

7. Testing File Truncation

The test example at line 1020-1044 appears to be cut off mid-implementation. Complete the example or note it's abbreviated.

8. Configuration Settings Integration

The Django settings example (lines 1002-1014) should note where to import env from:

# config/settings/base.py
import environ  # Add this

env = environ.Env()

9. URI Parsing Fragility

Lines 767-786: URI parsing using string split("/") is fragile

parts = uri.replace("document://", "").split("/")
return await get_document_resource(parts[0], parts[1])  # What if slug contains encoded slash?

Suggested improvement: Use regex or proper URI parsing:

import re

def parse_document_uri(uri: str) -> tuple[str, str]:
    match = re.match(r'^document://([^/]+)/([^/]+)$', uri)
    if not match:
        raise ValueError(f"Invalid document URI: {uri}")
    return match.group(1), match.group(2)

Documentation/Clarity Improvements

10. CHANGELOG Update Needed

Per CLAUDE.md guidelines, this significant feature proposal should have a CHANGELOG entry:

## [Unreleased]

### Added
- MCP (Model Context Protocol) interface proposal for read-only public resource access
  - 4 resource types: corpus, document, annotation, thread
  - 7 tools for discovery and retrieval
  - Anonymous user permission model
  - Location: `docs/mcp/mcp_interface_proposal.md`

11. ConversationTypeChoices Reference

Line 632: Uses ConversationTypeChoices.THREAD - should note import location:

from opencontractserver.types.enums import ConversationTypeChoices

Code Quality Assessment

Best Practices ✅

  • Follows Django ORM patterns correctly
  • Uses proper permission checking throughout
  • Implements pagination consistently
  • Includes comprehensive docstrings
  • Shows both good and bad examples in documentation

Testing Coverage ✅

  • Unit test examples for resources
  • Integration test workflow examples
  • Permission boundary testing shown
  • Good separation of public/private access tests

Performance ✅

  • Query optimization awareness demonstrated
  • Caching strategy appropriate
  • Prefetching patterns shown
  • Notes about N+1 query prevention

Security Analysis

Strengths

  1. Anonymous-only model prevents privilege escalation
  2. Read-only prevents data tampering
  3. Public filter applied consistently
  4. Input validation included
  5. Rate limiting considered

Recommendations

  1. Add request logging for security monitoring
  2. Consider adding response size limits (prevent large document DoS)
  3. Document rate limit bypass for authenticated API access (future)

Recommendations

Before Merging (Documentation PR)

  1. ✅ Fix async/sync issues in code examples (lines 80-586)
  2. ✅ Add imports for NotFoundError or use Django exceptions
  3. ✅ Verify corpus.document_count() method exists or adjust
  4. ✅ Add note about search_corpus dependencies on embedding infrastructure
  5. ✅ Complete truncated test example (line 1044+)
  6. ✅ Add CHANGELOG entry per project guidelines
  7. ✅ Show formatter function implementations or stub signatures

For Implementation Phase

  1. Implement helper formatters in formatters.py
  2. Add comprehensive tests as outlined
  3. Benchmark performance with large corpuses
  4. Add monitoring/observability hooks
  5. Consider adding structured logging with request IDs

Performance Considerations

Potential Bottlenecks

  1. Full text retrieval (line 135-136): Large PDFs could cause memory issues
    • Suggestion: Add chunked reading or size limits
  2. Thread message trees: Deep nesting could cause performance issues
    • Suggestion: Limit recursion depth or use iterative approach
  3. Annotation listing: Large documents with 1000+ annotations
    • Suggestion: Default pagination is good (100), but document this limit

Alignment with Repository Guidelines

CLAUDE.md Compliance ✅

  • Follows permission model from consolidated guide
  • Uses query optimizers as specified
  • Respects anonymous user permission rules
  • Follows Django ORM best practices
  • Testing strategy aligns with backend patterns

Branch Strategy ✅

  • This is a documentation PR, appropriate for merge to development branch

Final Verdict

This is production-quality documentation that demonstrates excellent understanding of the codebase. The async/sync issues and missing imports are the only blockers for code examples. Once those are addressed, this proposal is ready for implementation.

Suggested Next Steps

  1. Address async/sync code examples
  2. Add missing imports/exceptions
  3. Verify or note methods that may not exist yet
  4. Update CHANGELOG.md
  5. Merge proposal documentation
  6. Create implementation plan issue referencing this proposal

Score: 9/10 - Excellent work! The minor issues are easily fixable and don't detract from the overall quality of the proposal.


Review generated by Claude Code using repository conventions from CLAUDE.md

@JSv4 JSv4 linked an issue Dec 27, 2025 that may be closed by this pull request
JSv4 and others added 2 commits December 28, 2025 17:11
- Convert async tool functions to sync with sync_to_async wrapper pattern
- Add URIParser class with regex-based URI validation for safety
- Add complete helper function implementations (format_corpus_summary,
  format_document_summary, format_annotation, format_thread_summary,
  format_message, format_message_with_replies, text_search_fallback)
- Fix NotFoundError references to use proper Django exceptions
- Add prefetch_related for threaded messages to avoid N+1 queries
- Add proper imports to all resource and tool functions
- Update CHANGELOG with MCP interface proposal entry

Addresses review feedback on PR #726 for issue #387
@claude
Copy link

claude bot commented Dec 28, 2025

PR Review: MCP Interface Proposal (#726)

Overall Assessment

This is an excellent, well-thought-out proposal for adding MCP (Model Context Protocol) interface to OpenContracts. The design is elegant, follows established patterns from the codebase, and demonstrates deep understanding of the permissioning model. The documentation is comprehensive and production-ready.

Recommendation: Approve with minor suggestions


Strengths

1. Architecture & Design Excellence

  • ✅ Clean separation of concerns (resources vs tools)
  • ✅ Hierarchical URI scheme that mirrors data model
  • ✅ One-corpus-at-a-time scoping prevents complexity
  • ✅ Read-only design eliminates security concerns
  • ✅ Anonymous user model correctly leverages existing permissions

2. Performance Awareness

  • ✅ Correctly uses AnnotationQueryOptimizer (lines 557-561 in proposal)
  • ✅ Proper prefetch_related for threaded messages (line 299)
  • select_related for annotation labels (line 571)
  • ✅ Pagination built into all list operations
  • ✅ Caching strategy outlined (lines 1202-1217)

3. Security & Permissioning

  • ✅ Anonymous user model verified in opencontractserver/annotations/query_optimizer.py:50-72
  • ✅ Correctly uses visible_to_user() pattern throughout
  • ✅ Regex-based URI validation prevents injection (URIParser class, lines 978-1013)
  • ✅ Rate limiting considerations included

4. Code Quality

  • ✅ Comprehensive imports in all code examples
  • ✅ Proper error handling with Django exceptions
  • ✅ Type hints using modern Python syntax (int | None)
  • ✅ Well-documented with inline comments
  • ✅ CHANGELOG properly updated

Issues & Suggestions

Critical Issues

None found - This is a proposal document, not production code, so no blocking issues.

Medium Priority Suggestions

1. Sync vs Async Pattern Consistency (Lines 337-363, 945-1150)

The proposal uses sync_to_async wrapper pattern, but the actual tool implementations are synchronous. While this works, consider documenting the threading implications more clearly.

Suggestion: Add note about Django ORM thread-safety:

# Note: Django ORM is not thread-safe across async contexts.
# Always use sync_to_async for ORM operations in async handlers.
# Each wrapped call runs in a separate thread pool executor.

2. Error Response Standardization (Throughout tool implementations)

Different error types (DoesNotExist, ValueError) are raised but there's no unified error response format for MCP clients.

Suggestion: Add error handling wrapper:

def mcp_error_handler(func):
    """Wrap tool functions to return consistent error responses."""
    async def wrapper(*args, **kwargs):
        try:
            return await func(*args, **kwargs)
        except Model.DoesNotExist as e:
            return {"error": "not_found", "message": str(e)}
        except ValueError as e:
            return {"error": "invalid_input", "message": str(e)}
        except PermissionDenied as e:
            return {"error": "forbidden", "message": "Access denied"}
    return wrapper

3. Missing Model Method Documentation (Lines 372, 651)

The proposal assumes methods like corpus.document_count() and corpus.embed_text() exist.

Recommendation:

  • Verify these methods exist or document they need to be added
  • For embed_text(), check if it's corpus.embedder.embed_text() or similar

4. Message Prefetch Depth (Line 299, 879)

.prefetch_related('replies__replies')  # Prefetch 2 levels

Suggestion: Document the depth limitation more clearly and consider making it configurable:

# Prefetch depth = 2 levels (root -> reply -> nested reply)
# For deeper threads, replies at depth > 2 will trigger additional queries
# Adjust PREFETCH_DEPTH in config.py if needed
PREFETCH_DEPTH = getattr(settings, 'MCP_MESSAGE_PREFETCH_DEPTH', 2)
prefetch_string = '__'.join(['replies'] * PREFETCH_DEPTH)

Low Priority / Nice-to-Have

5. Bounding Box Format (Lines 165-169)

"bounding_box": {
  "top": 120,
  "left": 50,
  "right": 450,
  "bottom": 145
}

Question: Does this match the PAWLs format used elsewhere in OpenContracts? Consider documenting the coordinate system (pixels, points, normalized 0-1, etc.)

6. Thread Conversation Type (Lines 756-759)

conversation_type=ConversationTypeChoices.THREAD,

Suggestion: Verify ConversationTypeChoices.THREAD exists in the codebase. If not, document it needs to be added to the enum.

7. Testing Coverage (Lines 1308-1396)

Excellent test examples, but consider adding:

  • Test for rate limiting behavior
  • Test for malformed URI patterns
  • Test for performance with large result sets (1000+ documents)
  • Test for deeply nested message threads

Code Examples Review

search_corpus Implementation (Lines 635-677)

Excellent, but one potential issue:

embedder_path, query_vector = corpus.embed_text(query)

if not query_vector:
    # Fallback to text search if embeddings unavailable
    return text_search_fallback(corpus, query, limit, anonymous)

Question: Does corpus.embed_text() exist? Check the actual method signature. It might be:

# Verify this exists in opencontractserver/corpuses/models.py
if hasattr(corpus, 'embedder') and corpus.embedder:
    query_vector = corpus.embedder.embed_text(query)
else:
    return text_search_fallback(...)

format_message_with_replies (Lines 901-924)

Excellent recursion handling with depth limits and prefetch awareness. One suggestion:

if current_depth >= max_depth:
    # Stop recursion at max depth
    formatted["replies"] = []
    formatted["has_more_replies"] = message.replies.exists() if hasattr(message, 'replies') else False
    return formatted

The .exists() call triggers a query even with prefetch. Use:

formatted["has_more_replies"] = len(list(message.replies.all())) > 0 if hasattr(message, 'replies') else False

Or better, just check the prefetched cache:

formatted["has_more_replies"] = bool(getattr(message, '_prefetched_objects_cache', {}).get('replies'))

Alignment with CLAUDE.md Guidelines

✅ Follows Guidelines

  • Respects consolidated permissioning guide (docs/permissioning/consolidated_permissioning_guide.md)
  • Uses visible_to_user() pattern (not deprecated resolve_oc_model_queryset)
  • Proper use of AnnotationQueryOptimizer
  • Avoids N+1 queries with prefetch/select_related
  • Anonymous user model correctly implemented
  • CHANGELOG properly updated

✅ Code Quality Standards

  • No magic numbers (uses defaults in function signatures)
  • DRY principle (format functions reused)
  • Single responsibility (each function has one purpose)
  • Constants could go in opencontractserver/constants/ if implemented

⚠️ Minor Deviations

  1. No tests provided - Only test examples in proposal. Per CLAUDE.md, actual tests should be added in opencontractserver/tests/
  2. Commit message credits Claude - Per CLAUDE.md line 3: "Never credit Claude or Claude Code in commit messages, PR messages, etc."

Security Review

✅ Excellent Security Posture

  1. Input validation: URIParser uses regex patterns for slug validation
  2. Permission model: Correctly uses anonymous user with is_public=True filtering
  3. SQL injection: No raw SQL, all Django ORM
  4. XSS: JSON serialization handles escaping
  5. IDOR prevention: Permission checks prevent access enumeration
  6. Rate limiting: Included in proposal

Recommendation

Add CORS configuration guidance for MCP server deployment:

# If serving via HTTP (not stdio)
CORS_ALLOWED_ORIGINS = env.list('MCP_CORS_ORIGINS', default=[])

Performance Considerations

✅ Excellent Performance Design

  • Query optimizer usage eliminates N+1 problems
  • Pagination prevents large memory usage
  • Caching strategy for frequently accessed data
  • Prefetch depth limits prevent recursive query explosions

Suggestion

Add query monitoring hooks for production:

# opencontractserver/mcp/middleware.py
class MCPQueryMonitor:
    """Log slow queries for MCP endpoints"""
    def __init__(self, threshold_ms=100):
        self.threshold = threshold_ms
    
    def log_slow_query(self, tool_name, duration_ms, query_count):
        if duration_ms > self.threshold:
            logger.warning(f"Slow MCP query: {tool_name} took {duration_ms}ms with {query_count} queries")

Implementation Recommendations

Phase 1: Core Implementation (This PR)

  1. Create opencontractserver/mcp/ directory structure
  2. Implement resource handlers (resources.py)
  3. Implement tool functions (tools.py)
  4. Add URI parser with regex validation
  5. Write unit tests for all functions

Phase 2: Server & Integration

  1. Implement MCP server entry point (server.py)
  2. Add Django settings configuration
  3. Add deployment documentation
  4. Integration tests with MCP client
  5. Performance testing with realistic data

Phase 3: Production Hardening

  1. Rate limiting implementation
  2. Caching layer with Redis
  3. Query monitoring and logging
  4. Load testing
  5. Security audit

Missing Documentation

Consider adding these sections to the proposal:

  1. Migration Path: How to enable MCP for existing deployments?
  2. Monitoring: What metrics should be tracked?
  3. Client Examples: Example MCP client code for common workflows
  4. Troubleshooting: Common issues and solutions
  5. API Versioning: How to handle breaking changes in the future?

Minor Typos/Issues in CHANGELOG

CHANGELOG.md:6-19 - Perfect format! Excellent use of the "Keep a Changelog" structure.

One suggestion: Add line about where implementation will live:

- **MCP server implementation**: Located in `opencontractserver/mcp/` (to be implemented)

Final Recommendations

Before Merging (If this were production code)

  1. ✅ Verify all model methods exist (corpus.embed_text(), corpus.document_count())
  2. ✅ Verify ConversationTypeChoices.THREAD enum value exists
  3. ✅ Add actual unit tests in opencontractserver/tests/mcp/
  4. ✅ Add integration tests
  5. ✅ Update commit messages to remove Claude attribution

For This Proposal PR

Ship it! This is excellent design documentation. The actual implementation can be tracked in follow-up issues/PRs.


Summary

This is outstanding work that demonstrates:

  • Deep understanding of OpenContracts architecture
  • Excellent design principles
  • Performance awareness
  • Security-first approach
  • Production-ready thinking

The proposal is comprehensive enough to serve as implementation specification. Minor suggestions above are truly minor - the core design is solid.

Score: 9.5/10 (Would be 10/10 with actual implementation + tests)

Great job! This will make OpenContracts knowledge bases accessible to AI assistants in a safe, performant way.

@codecov
Copy link

codecov bot commented Dec 28, 2025

Codecov Report

❌ Patch coverage is 93.74302% with 56 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/mcp/server.py 57.40% 46 Missing ⚠️
opencontractserver/mcp/tools.py 92.17% 9 Missing ⚠️
opencontractserver/mcp/resources.py 98.30% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Add Model Context Protocol (MCP) server module that provides AI assistants
with structured access to public OpenContracts data:

- Resources: corpus, document, annotation, thread (via custom URI schemes)
- Tools: list_public_corpuses, list_documents, get_document_text,
  list_annotations, search_corpus, list_threads, get_thread_messages
- Anonymous user model ensures only public resources are accessible
- Uses existing visible_to_user() and AnnotationQueryOptimizer patterns
- Rate limiting via Django cache
- Comprehensive unit tests (19 tests)

Closes #726
@claude
Copy link

claude bot commented Dec 29, 2025

Code Review: MCP Interface Implementation

This is an excellent, well-architected proposal for adding MCP support to OpenContracts.

Strengths

  1. Exceptional Documentation - 1400+ line proposal with clear architecture, implementation examples, security considerations
  2. Solid Permission Model - Correctly uses AnonymousUser, proper slug validation, follows .visible_to_user() pattern
  3. Performance Optimized - Uses AnnotationQueryOptimizer, prefetch_related for threads, enforces pagination limits
  4. Clean Architecture - Proper separation: server.py, tools.py, resources.py, formatters.py, permissions.py
  5. Comprehensive Testing - 346 lines covering URI parsing, permissions, resources, tools, formatters

CRITICAL Issues

1. Rate Limiter Bug (permissions.py:75-89)

  • Current cache.set() resets TTL on every call, making rate limiting ineffective
  • Fix: Use cache.incr() for atomic increment

2. Missing Dependency Version (requirements/base.txt:81)

  • Comment references wrong GitHub URL
  • Should point to modelcontextprotocol/python-sdk and pin version

HIGH Priority Issues

3. Exception Handling Too Broad (tools.py:131-135, 228-251)

  • Catching bare Exception hides KeyboardInterrupt
  • Should catch specific exceptions: IOError, OSError, FileNotFoundError

4. Missing Input Validation (tools.py:144-198)

  • list_annotations doesn't validate page parameter
  • Negative page numbers could cause issues

5. embed_text() Return Assumption (tools.py:228)

  • No verification that embed_text() returns tuple
  • Will crash if returns None

MEDIUM Priority Issues

  1. Document Count Inefficiency (formatters.py:19) - Executes COUNT query every time, should use queryset annotation
  2. Missing Logging - No logging for errors, rate limiting, security violations
  3. Test Coverage Gaps - Missing tests for get_document_text, search_corpus, get_thread_messages, error conditions

Security Considerations

  1. DoS via Large Files - get_document_text reads entire files into memory. Add file size limits (10MB max)

Recommendations

Before Merge (Critical)

  1. Fix rate limiter cache bug
  2. Fix dependency package/version
  3. Narrow exception handling
  4. Add input validation
  5. Add file size limits
  6. Merge changelog entries (single Unreleased header)

Follow-up (Separate PRs)

  1. Add comprehensive logging
  2. Expand test coverage to 90%+
  3. Add MCP client integration tests
  4. Document deployment instructions

Overall Assessment

Quality Score: 8.5/10

High-quality work with excellent architecture and documentation. Issues are mostly minor refinements. Permission model is sound, performance optimizations appropriate, follows project conventions.

Recommendation: Approve with minor changes - Fix critical issues (rate limiter, dependency), then merge.

Great work! The one-corpus-at-a-time model is elegant and the anonymous user permission strategy is exactly right.

- Add ASGI routing for /mcp/* endpoints in config/asgi.py
- Implement SSE transport handlers (GET /mcp/sse/, POST /mcp/messages/)
- Refactor server.py to use factory pattern for cleaner initialization
- Apply code formatting fixes from pre-commit hooks
@claude
Copy link

claude bot commented Dec 29, 2025

Code Review - PR #726: MCP Interface Implementation

This is an impressive and comprehensive implementation of an MCP (Model Context Protocol) interface for OpenContracts. The design is elegant, well-documented, and follows the project's architecture patterns. Below is my detailed feedback.


🎯 Overall Assessment

Summary: High-quality implementation with excellent documentation. The code follows OpenContracts conventions and implements a clean read-only public API. A few minor issues to address before merging.

Recommendation: Approve with minor revisions


✅ Strengths

1. Excellent Documentation

  • Comprehensive 1457-line proposal document (docs/mcp/mcp_interface_proposal.md) with clear examples
  • Well-documented code with docstrings following project standards
  • Clear URI scheme and resource hierarchy

2. Security-First Design

  • Anonymous user model ensures only public resources are accessible
  • Regex-based URI parsing prevents injection attacks (opencontractserver/mcp/server.py:40-77)
  • Slug validation with [A-Za-z0-9\-]+ pattern prevents directory traversal
  • Rate limiting implementation using Django cache

3. Performance Optimization

  • Proper use of existing visible_to_user() manager methods
  • Limit capping at 100 results (tools.py:36, 71)
  • Uses prefetch_related for threaded messages (mentioned in docs)
  • Leverages AnnotationQueryOptimizer for efficient queries

4. Clean Architecture

  • Separation of concerns: server.py, resources.py, tools.py, formatters.py, permissions.py
  • ASGI integration is non-invasive - routes /mcp/* separately (config/asgi.py:57-74)
  • Follows one-corpus-at-a-time model as designed

5. Test Coverage

  • Comprehensive test suite with 339 lines covering:
    • URI parsing edge cases
    • Permission validation
    • Anonymous user access patterns
    • Rate limiting
    • Formatters
  • Tests follow project patterns using TestCase and setUpTestData

⚠️ Issues to Address

1. Critical: MCP Server Disabled by Default

File: config/settings/base.py:988

MCP_SERVER = {
    "enabled": env.bool("MCP_SERVER_ENABLED", default=False),
    ...
}

Issue: The MCP server is disabled by default, but the ASGI routing unconditionally loads it.

Impact: If MCP_SERVER_ENABLED=False, the /mcp/* routes will still be mounted but may not function correctly.

Recommendation:

# In config/asgi.py
from django.conf import settings

if settings.MCP_SERVER.get('enabled', False):
    from opencontractserver.mcp.server import mcp_asgi_app
    http_application = create_http_router(django_application, mcp_asgi_app)
else:
    http_application = django_application

2. Bug: Rate Limiter Counter Logic

File: opencontractserver/mcp/permissions.py:84-88

def check_rate_limit(self, client_id: str) -> bool:
    key = f"mcp:ratelimit:{client_id}"
    current = cache.get(key, 0)
    if current >= self.max_requests:
        return False
    cache.set(key, current + 1, self.window_seconds)  # BUG: Doesn't preserve TTL correctly
    return True

Issue: Each call to cache.set() resets the TTL window. A client making 99 requests with 1 second between each would never hit the limit in a 60-second window.

Recommendation: Use atomic operations:

def check_rate_limit(self, client_id: str) -> bool:
    key = f"mcp:ratelimit:{client_id}"
    
    # Try to increment atomically
    try:
        current = cache.incr(key)
    except ValueError:
        # Key doesn't exist, create it
        cache.set(key, 1, self.window_seconds)
        return True
    
    return current <= self.max_requests

3. Missing: Rate Limiter Integration

File: opencontractserver/mcp/server.py:292-301

Issue: The RateLimiter class is implemented and tested, but never actually used in the server handlers.

Recommendation: Add rate limiting to the call_tool handler:

from .permissions import RateLimiter
from django.conf import settings

rate_limiter = RateLimiter(
    max_requests=settings.MCP_SERVER['rate_limit']['requests'],
    window_seconds=settings.MCP_SERVER['rate_limit']['window']
)

@mcp_server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    # Extract client ID from request context
    client_id = "anonymous"  # TODO: Extract from ASGI scope if available
    
    if not rate_limiter.check_rate_limit(client_id):
        raise ValueError("Rate limit exceeded")
    
    handler = TOOL_HANDLERS.get(name)
    # ... rest of implementation

4. Missing: Error Handling for File Operations

File: opencontractserver/mcp/resources.py (referenced in docs, not fully shown)

Concern: Document text extraction should handle missing/corrupt files:

full_text = ""
if document.txt_extract_file:
    try:
        with document.txt_extract_file.open('r') as f:
            full_text = f.read()
    except (IOError, OSError) as e:
        logger.error(f"Failed to read text file for {document.slug}: {e}")
        full_text = "[Text extraction file unavailable]"

5. Potential Issue: Underscore in Slug Pattern

Files: Multiple

Current Pattern: [A-Za-z0-9\-]+

Observation: OpenContracts models use SlugField, which allows underscores by default. If any existing slugs contain underscores, they'll be inaccessible via MCP.

Recommendation:

  • Verify that all existing slugs in production match this pattern
  • Or update pattern to [A-Za-z0-9\-_]+ if underscores are used
  • Document this restriction clearly in the proposal

6. Minor: Changelog Credit Line

File: CHANGELOG.md:6

Per CLAUDE.md baseline rules #3: "Never credit Claude or Claude Code in commit messages, PR messages, etc."

While this is a proposal/documentation PR rather than a production feature, consider whether the extensive implementation details in the changelog are appropriate at this stage.


🔍 Security Review

✅ Good Security Practices

  1. Input validation: Regex-based slug validation prevents injection
  2. Anonymous-only access: No privilege escalation possible
  3. Read-only operations: No mutations allowed
  4. Permission filtering: Uses existing .visible_to_user() consistently

⚠️ Recommendations

  1. Add CORS headers if MCP will be accessed from browsers
  2. Consider request size limits for search queries (prevent DoS)
  3. Add logging for failed permission checks (security monitoring)
  4. Document authentication in deployment section (currently anonymous-only, but extensible)

🎨 Code Quality

Style & Conventions ✅

  • Follows CLAUDE.md DRY principles
  • Proper separation of concerns
  • Type hints used consistently (str | None, tuple[str, str])
  • Imports follow project conventions

Testing ✅

  • Good coverage of edge cases
  • Follows TestCase pattern correctly
  • Uses setUpTestData for efficiency
  • Tests security boundaries (private corpus denial)

Performance ✅

  • Pagination implemented correctly
  • Limit capping prevents abuse
  • Uses efficient querysets
  • Async/sync bridging with sync_to_async is appropriate

📋 Testing Recommendations

  1. Run the test suite to ensure all tests pass:

    docker compose -f test.yml run django pytest opencontractserver/mcp/tests/ -v
  2. Integration test the ASGI routing:

    • Start local server
    • Verify /mcp/sse/ endpoint responds
    • Verify non-MCP routes still work
  3. Test with actual MCP client (if available):

    • Verify SSE transport works end-to-end
    • Test resource URIs resolve correctly
    • Confirm tools execute and return valid JSON

📚 Documentation Quality

The docs/mcp/mcp_interface_proposal.md is exceptional:

  • Clear problem statement and design principles
  • Complete code examples with context
  • Implementation details for all resources and tools
  • Security considerations addressed
  • Testing strategy outlined

Minor suggestions:

  1. Add a "Quick Start" section at the top
  2. Include example MCP client configuration
  3. Document environment variables needed (MCP_SERVER_ENABLED, etc.)

🚀 Deployment Considerations

  1. Environment Variables: Document required settings in deployment guide
  2. Cache Backend: Ensure production cache (Redis) is configured for rate limiting
  3. Monitoring: Add metrics for MCP endpoint usage
  4. Graceful Degradation: Handle MCP disabled state cleanly

✅ Checklist for Merge


🎉 Conclusion

This is excellent work! The MCP interface is well-designed, performant, and follows OpenContracts architecture principles. The few issues identified are straightforward to fix and don't diminish the quality of this contribution.

Once the critical issues are addressed, this will be a valuable addition to the project.

Estimated effort to address issues: 1-2 hours

Great job on this comprehensive implementation! 🚀

- Add docs/mcp/README.md with getting started guide, tool/resource
  reference, transport options, and architecture overview
- Remove proposal document (replaced by actual documentation)
- Switch to Streamable HTTP transport (stateless mode) for reliability
- Use DocumentPath for corpus document membership (source of truth)
- Add CorpusSettings embed model configuration UI
@claude
Copy link

claude bot commented Dec 29, 2025

PR Review: MCP Interface Implementation

This is a well-designed implementation of a Model Context Protocol server for OpenContracts. The code quality is high, security considerations are solid, and the architecture aligns well with project conventions.

Strengths

Architecture & Design

  • Clean separation of concerns across server.py, tools.py, resources.py, formatters.py, permissions.py, config.py
  • Stateless HTTP transport avoids session initialization race conditions
  • Follows OpenContracts patterns: Uses visible_to_user() correctly
  • URI parsing with regex prevents injection attacks

Security Model

  • Read-only by design
  • Anonymous user model ensures only public resources accessible
  • Slug validation prevents injection
  • IDOR prevention via slug-based lookups

Performance

  • Uses AnnotationQueryOptimizer to eliminate N+1 queries
  • Prefetching for threaded messages
  • Proper pagination with limits capped at 100

Test Coverage

  • Comprehensive unit tests for URI parsing, permissions, resources, tools
  • Tests verify public/private access separation

Critical Issues

1. Missing MCP Dependency (BLOCKING)

requirements/base.txt adds mcp>=1.0.0, but this package does not exist on PyPI yet.

ACTION REQUIRED: Update requirements/base.txt with working dependency source or wait for PyPI release.

2. Frontend Permission Logic Complexity (HIGH)

CorpusSettings.tsx (lines 645-668) has complex ownership detection with two fallback mechanisms that is fragile.

RECOMMENDATION: Add isOwner boolean field to GraphQL CorpusType instead of inferring from permissions.

3. Rate Limiting Not Enforced (MEDIUM)

RateLimiter class exists but is not used in MCP server handlers.

RECOMMENDATION: Add rate limiting to call_tool and read_resource handlers.

4. Missing creator.id Field (MEDIUM)

CorpusSettings.tsx checks corpus.creator.id - verify GraphQL fragment includes this field.

5. Potential N+1 in Document Search (LOW)

tools.py:78 fetches corpus document IDs then filters - use subquery instead for large corpora.

Recommendations Summary

Must Fix Before Merge:

  1. Fix MCP dependency in requirements/base.txt
  2. Verify/add creator.id field in GraphQL

Should Fix (High Priority):
3. Simplify frontend ownership logic
4. Enforce rate limiting
5. Fix N+1 in document search

Nice to Have:
6. Add integration tests for ASGI routing
7. Implement caching for resources
8. Consolidate constants
9. Export mcp_asgi_app from init.py

Verdict

Overall Assessment: High-quality work with excellent architecture and security design.

Recommendation: Approve with required changes

Great work! The code demonstrates strong understanding of both MCP protocol and OpenContracts architecture. With the dependency issue resolved, this will be a valuable addition.

Files Reviewed: 16 | Lines Added: 1959 | Lines Removed: 22

@JSv4 JSv4 mentioned this pull request Dec 29, 2025
- Add tests for tools.py: list_documents, get_document_text,
  list_annotations, search_corpus, list_threads, get_thread_messages
- Add tests for resources.py: document, annotation, thread resources
- Add tests for formatters.py: all formatter functions now at 100%
- Add tests for server.py components: lifespan manager, URI parser

Coverage improvements:
- tools.py: 18% → 92%
- resources.py: 22% → 98%
- formatters.py: 37% → 100%
- Overall MCP module: 59.76% → 94%
@claude
Copy link

claude bot commented Dec 29, 2025

Code Review - PR #726: MCP Interface Implementation

This is an excellent implementation of the Model Context Protocol (MCP) server for OpenContracts. The code quality is high, architecture is sound, and testing is comprehensive. Below are my detailed findings:


✅ Strengths

1. Architecture & Design

  • Clean separation of concerns: Resources, tools, formatters, and permissions are properly modularized
  • Stateless HTTP transport: Smart choice to use StreamableHTTP in stateless mode, avoiding session initialization race conditions
  • Read-only by design: Security-first approach with no mutation operations
  • Anonymous user model: Elegant solution for public-only access using Django's existing permission system

2. Security

  • Proper permission checks: Consistently uses visible_to_user() pattern throughout
  • Input validation: Regex-based URI parsing prevents injection attacks (slug pattern: [A-Za-z0-9\-]+)
  • No IDOR vulnerabilities: All queries filtered by visibility before slug matching
  • Rate limiting foundation: RateLimiter class ready for production deployment (though not currently enforced)

3. Performance

  • Query optimization: Uses AnnotationQueryOptimizer to prevent N+1 queries
  • Proper prefetching: Thread messages use prefetch_related('replies__replies')
  • Pagination enforcement: Hard limits (max 100 results) prevent resource exhaustion
  • Efficient document lookups: Uses get_documents() (DocumentPath source of truth)

4. Test Coverage

  • Comprehensive test suite: 1432 lines of tests covering all components
  • URI parsing tests: Validates both valid and invalid patterns
  • Permission boundary tests: Ensures private resources are inaccessible to AnonymousUser
  • Edge cases: No text file, empty results, nonexistent resources, max depth recursion
  • Integration tests: Full resource and tool workflows

5. Code Quality

  • Type hints: Modern Python with from __future__ import annotations and TYPE_CHECKING
  • Documentation: Clear docstrings with Args/Returns sections
  • Error handling: Appropriate exception raising (DoesNotExist, ValueError, ObjectDoesNotExist)
  • Consistent patterns: All tools follow the same structure and formatting

⚠️ Issues Found

CRITICAL: Corpus Visibility Permission Check Regression (Frontend)

Location: frontend/src/components/corpuses/CorpusSettings.tsx:644-666

Issue: The frontend permission logic for changing corpus visibility has a critical flaw:

// Compare by ID first, fallback to email comparison for reliability
const isOwnerByIdentity = Boolean(
  currentUser &&
    corpus.creator &&
    ((currentUser.id &&
      corpus.creator.id &&
      currentUser.id === corpus.creator.id) ||
      (currentUser.email &&
        corpus.creator.email &&
        currentUser.email === corpus.creator.email))
);

Problems:

  1. Email comparison is unreliable: Email can be changed, making it an unstable identifier
  2. Race condition: If currentUser isn't loaded yet, the fallback to hasFullOwnerPermissions may incorrectly grant access
  3. Inconsistency with backend: Backend SetCorpusVisibility mutation checks ownership via corpus.creator == user (Django User instance equality, which uses ID)

Impact:

  • Non-owners with all core permissions could potentially change corpus visibility in the UI
  • Email-based ownership check could fail if user changes email between sessions

Recommendation:

// Only trust ID-based comparison, require currentUser to be loaded
const isOwner = Boolean(
  currentUser?.id && 
  corpus.creator?.id && 
  currentUser.id === corpus.creator.id
);

const canChangeVisibility = isOwner || canPermission;

If currentUser is guaranteed to be loaded by AuthGate, remove the fallback entirely. If not, show a loading state instead of falling back to permission inference.


MEDIUM: Missing Bounding Box Data in Annotation Formatter

Location: opencontractserver/mcp/formatters.py:39-56

Issue: The format_annotation() function doesn't include the bounding_box field, but tests expect it:

# Test expects this (test_mcp.py:1033):
self.assertEqual(data["bounding_box"]["x"], 10)

# But formatter doesn't include it (formatters.py:49-56):
return {
    "id": str(annotation.id),
    "page": annotation.page,
    "raw_text": annotation.raw_text or "",
    "annotation_label": label_data,
    "structural": annotation.structural,
    "created": annotation.created.isoformat() if annotation.created else None,
}
# Missing: "bounding_box": annotation.bounding_box

Impact: MCP clients cannot get annotation bounding box coordinates for rendering

Recommendation: Add bounding box to formatter:

return {
    "id": str(annotation.id),
    "page": annotation.page,
    "raw_text": annotation.raw_text or "",
    "annotation_label": label_data,
    "bounding_box": annotation.bounding_box,  # Add this
    "structural": annotation.structural,
    "created": annotation.created.isoformat() if annotation.created else None,
}

MEDIUM: Rate Limiter Not Enforced

Location: opencontractserver/mcp/permissions.py:59-89

Issue: RateLimiter class is implemented and tested, but never actually used in the MCP server request handling

Impact: No protection against DoS attacks or resource exhaustion from malicious clients

Recommendation:

  1. Add rate limiting to create_mcp_asgi_app() in server.py:
    from .permissions import RateLimiter
    from django.conf import settings
    
    rate_limiter = RateLimiter(
        max_requests=settings.MCP_SERVER['rate_limit']['requests'],
        window_seconds=settings.MCP_SERVER['rate_limit']['window']
    )
    
    async def app(scope, receive, send):
        # Extract client ID from scope (IP address, session, etc.)
        client_id = scope.get('client')[0] if scope.get('client') else 'unknown'
        
        if not rate_limiter.check_rate_limit(client_id):
            # Return 429 Too Many Requests
            await send({...})
            return
        # ... rest of handler
  2. Or document that rate limiting should be handled at the reverse proxy level (nginx/Cloudflare)

LOW: Missing MCP Dependency in requirements.txt

Issue: The PR uses mcp library but doesn't add it to requirements/base.txt

Evidence: Imports like:

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.server.streamable_http_manager import StreamableHTTPSessionManager
from mcp.types import Resource, TextContent, Tool

Impact: Installation will fail without manual pip install mcp

Recommendation: Add to requirements/base.txt:

mcp>=1.0.0  # Model Context Protocol server

LOW: Configuration Enabled by Default

Location: config/settings/base.py:988-999

Issue: MCP_SERVER['enabled'] defaults to False, but the HTTP router in config/asgi.py unconditionally routes /mcp/* to the MCP app

# asgi.py always routes to MCP regardless of settings
if path == "/mcp" or path.startswith("/mcp/"):
    await mcp_app(scope, receive, send)

Impact: Minor - MCP endpoint is exposed even when MCP_SERVER['enabled'] = False

Recommendation: Check the setting in the ASGI router:

from django.conf import settings

async def router(scope, receive, send):
    path = scope.get("path", "")
    if settings.MCP_SERVER.get('enabled', False) and (path == "/mcp" or path.startswith("/mcp/")):
        await mcp_app(scope, receive, send)
    else:
        await django_app(scope, receive, send)

LOW: Hardcoded Magic Numbers

Location: Multiple files

Issue: Several hardcoded limits that should be constants:

  1. opencontractserver/mcp/resources.py:29 - [:20] label limit
  2. opencontractserver/mcp/formatters.py:86 - max_depth: int = 3 recursion limit
  3. opencontractserver/mcp/resources.py:99 - [:500] text preview length

Recommendation: Move to opencontractserver/mcp/config.py or opencontractserver/constants/:

# mcp/config.py
MAX_LABELS_IN_RESOURCE = 20
MAX_MESSAGE_REPLY_DEPTH = 3
DOCUMENT_TEXT_PREVIEW_LENGTH = 500

🎯 Best Practices Observed

  1. CLAUDE.md Compliance:

    • ✅ Uses existing patterns (visible_to_user(), AnnotationQueryOptimizer)
    • ✅ Proper DocumentPath usage (source of truth for corpus membership)
    • ✅ No dead code or duplication
    • ✅ Tests follow established patterns (TransactionTestCase, TestCase)
  2. Django Best Practices:

    • ✅ No raw SQL queries
    • ✅ Proper use of select_related/prefetch_related
    • ✅ Type annotations throughout
    • ✅ ASGI app properly integrated with Django lifecycle
  3. MCP Protocol Compliance:

    • ✅ Correct JSON-RPC 2.0 structure
    • ✅ Proper resource URI patterns
    • ✅ Tool schemas with validation
    • ✅ Both stdio and HTTP transports supported

📝 Minor Suggestions

  1. Documentation: Add API rate limit guidance to docs/mcp/README.md
  2. Error messages: Consider i18n support for user-facing error strings
  3. Monitoring: Add logging for MCP request metrics (request count, latency, errors)
  4. Caching: Consider caching corpus metadata (currently re-fetched on every request)

🏁 Verdict

APPROVE WITH MINOR FIXES REQUIRED

This is excellent work! The MCP implementation is well-architected, thoroughly tested, and follows project conventions. The critical issue with corpus visibility permissions should be fixed before merge, but it's a frontend-only issue that doesn't affect the MCP server itself.

Required fixes before merge:

  1. ✅ Fix corpus visibility ownership check in CorpusSettings.tsx
  2. ✅ Add bounding_box to annotation formatter
  3. ✅ Add mcp to requirements.txt

Recommended for follow-up PR:

  • Enforce rate limiting in production
  • Move magic numbers to constants
  • Add setting check to ASGI router
  • Add monitoring/logging

Great job! This is production-ready code once the required fixes are applied. 🚀

@JSv4 JSv4 merged commit 21a5f77 into main Dec 29, 2025
13 checks passed
@JSv4 JSv4 deleted the claude/add-mcp-interface-t4xGr branch December 29, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MCP Server Interface

3 participants