A comprehensive end-to-end system for building intelligent chatbots from PDF documentation using Oracle Database 23ai's native vector search capabilities, RDF knowledge graphs, and advanced deduplication techniques.
This project transforms Oracle AI Vector Search documentation into a production-ready RAG (Retrieval-Augmented Generation) system by:
- Extracting structured content from PDF documents with intelligent deduplication
- Building RDF knowledge graphs with semantic relationships
- Creating vector embeddings optimized for Oracle Database 23ai
- Providing AI-powered Q&A with comprehensive duplicate detection
- Validating system quality through multi-dimensional analysis
PDF Document β RDF Graph β Vector Embeddings β Oracle Database β AI Chatbot
β β β β β
Content Extraction Knowledge Embeddings Vector Storage Smart Retrieval
& Deduplication Relationships Generation with Metadata & LLM Response
File | Purpose | Key Features |
---|---|---|
createRDFGraph.py |
PDF β RDF conversion | Advanced deduplication, TOC extraction, semantic chunking |
oracleRAG.py |
Traditional RAG pipeline | Direct PDF processing, Oracle vector storage |
rdfPoweredChatbot.py |
RDF-enhanced chatbot | Knowledge graph integration, relationship-aware responses |
File | Purpose | Key Features |
---|---|---|
validateRDFGraph.py |
Quality assessment | 5-tier validation, chatbot readiness scoring |
rdfDupAnalysis.py |
Duplicate detection | Content analysis, URI validation, relationship audit |
# Python packages
pip install PyPDF2 rdflib oracledb langchain numpy
pip install sentence-transformers langchain-community langchain-ollama
# Docker (for Oracle Database)
# Ollama with Llama model (for local LLM)
The easiest way to get started is using Docker with Oracle Database 23ai Free:
# Start Oracle Database 23ai container with vector support
docker run --name free23ai -d -p 1521:1521 \
-e ORACLE_PASSWORD=Welcome12345 \
-e APP_USER=testuser \
-e APP_USER_PASSWORD=Welcome12345 \
gvenzl/oracle-free:23.7-slim-faststart
That's it! The scripts are pre-configured to work with this setup out of the box.
The scripts are pre-configured with these default settings that match the Docker container:
# Connection example
connection = oracledb.connect(
user="testuser",
password="Welcome12345",
dsn="localhost:1521/FREEPDB1"
)
If you prefer using Oracle Autonomous Database, update the connection parameters and point at the downloaded wallet:
# Connect to ADB
connection = oracledb.connect(
user="admin",
password="Password",
dsn="mydb_high",
config_dir="/Users/dev/Wallets/Wallet_mydb",
wallet_location="/Users/dev/Wallets/Wallet_mydb",
wallet_password="Password"
)
- Start Oracle Database (if using Docker)
docker run --name free23ai -d -p 1521:1521 \
-e ORACLE_PASSWORD=Welcome12345 \
-e APP_USER=testuser \
-e APP_USER_PASSWORD=Welcome12345 \
gvenzl/oracle-free:23.7-slim-faststart
- Generate RDF Knowledge Graph
python createRDFGraph.py
# Creates: vectorsearchcleanedchunked.nt
- Validate Graph Quality
python validateRDFGraph.py
# Outputs: Comprehensive quality assessment
- Build Vector-Powered Chatbot
python rdfPoweredChatbot.py
# Creates: Complete RAG system with chatbot interface
# Run oracleRAG.py for direct PDF processing
python oracleRAG.py
# Features:
# - Direct PDF chunking
# - HuggingFace embeddings
# - Oracle vector storage
# - Basic similarity search
# Step 1: Create clean RDF graph
python createRDFGraph.py
# Step 2: Validate quality (optional but recommended)
python validateRDFGraph.py
# Step 3: Build enhanced chatbot
python rdfPoweredChatbot.py
# Benefits:
# - Semantic relationships preserved
# - Advanced duplicate prevention
# - Context-aware responses
# - Rich metadata integration
- Header-level deduplication using normalized comparison
- Position-based overlap detection to prevent content reuse
- Content similarity analysis using sequence matching
- Final validation to ensure no duplicate chunks exist
- Table of Contents extraction for document structure
- Conservative header variations for robust matching
- Natural break point detection for optimal chunking
- Technical content validation with domain-specific keywords
- Native VECTOR data type support for Oracle 23ai
- Efficient similarity search using cosine distance
- LOB handling for large text content
- Rich metadata storage with JSON support
- Comprehensive error handling and recovery mechanisms
- Real-time duplicate detection during retrieval
- Debug monitoring with detailed logging
- Quality validation with scoring system
The validation system provides a 5-tier assessment:
- Basic Structure (20 pts) - Headers, chunks, triples count
- Content Quality (25 pts) - Word distribution, depth analysis
- Chunk Characteristics (20 pts) - Size optimization, consistency
- Semantic Relationships (15 pts) - Connection richness, diversity
- Chatbot Readiness (20 pts) - Technical coverage, actionable content
- 85%+: π EXCELLENT - Production ready
- 70-84%: β GOOD - Minor optimizations needed
- 50-69%:
β οΈ FAIR - Moderate improvements required - <50%: β POOR - Significant restructuring needed
# After running rdfPoweredChatbot.py
chain, connection, retriever_connection = process_rdf_documents()
# Ask questions
answer = chain.invoke({"question": "What are Vector Indexes?"})
print(answer)
# System provides:
# - Context from relevant document sections
# - Semantic relationship awareness
# - Duplicate detection during retrieval
# - Rich metadata for citations
- "What are Vector Indexes?"
- "How do I create vector embeddings?"
- "What are the different index types I can use?"
- "What are the performance considerations for vector indexes?"
Docker Database Not Starting
# Check container status
docker ps -a
# View logs if container failed
docker logs free23ai
# Restart if needed
docker restart free23ai
Connection Errors
# Verify database is running
docker exec -it free23ai sqlplus testuser/Welcome12345@FREEPDB1
# Check if container ports are accessible
telnet localhost 1521
LOB Object Errors
# Solution: Use robust retriever in rdfPoweredChatbot.py
# The system automatically falls back to LOB-safe methods
Duplicate Content in Results
# Run analysis tools first:
python rdfDupAnalysis.py
# Then regenerate with stricter deduplication:
# Adjust similarity thresholds in createRDFGraph.py
Empty RDF Graph
# Check PDF path and format
# Verify TOC extraction in createRDFGraph.py
# Review header matching patterns
- Chunk Size: Optimal range 300-1500 characters for embeddings
- Batch Processing: Use batch sizes of 100 for large datasets
- Index Creation: Add database indexes for faster retrieval
- Memory Management: Monitor connection pooling for large documents
# HuggingFace (default)
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
VECTOR_DIMENSION = 384
# Nomic (alternative)
EMBEDDING_MODEL = "nomic-embed-text-v1.5"
VECTOR_DIMENSION = 768
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 20 # Overlap between chunks
MAX_CHUNK_SIZE = 450 # Maximum chunk size for embeddings
VECTOR_TABLE = "rdf_vector_chunks" # RDF-enhanced system
VECTOR_TABLE = "hf_emb" # Traditional RAG system
- PyPDF2: PDF text extraction
- rdflib: RDF graph creation and SPARQL queries
- oracledb: Oracle Database connectivity
- langchain: Document processing and RAG pipeline
- numpy: Vector operations and normalization
- sentence-transformers: Text embedding generation
- langchain-ollama: Local LLM integration
- langchain-nomic: Nomic embedding support
- difflib: Content similarity analysis
- Fork the repository
- Create a feature branch
- Add comprehensive tests
- Update documentation
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Oracle Database 23ai team for native vector search capabilities
- LangChain community for RAG framework
- HuggingFace for pre-trained embedding models
- Ollama project for local LLM deployment
For questions or issues:
- Check the troubleshooting section above
- Review the validation output for specific recommendations
- Examine debug logs for detailed error information
- Open an issue with relevant log excerpts and system configuration
Built with β€οΈ for Oracle AI Vector Search and knowledge graph enthusiasts