A comprehensive guide for connecting locally deployed Large Language Models (LLMs) with local datasets using the Model Context Protocol (MCP).
This repository provides step-by-step instructions for:
- Setting up a local development environment with Conda
- Installing and configuring llama-cpp-python
- Running local language models (demonstrated with Microsoft Phi-4)
- Connecting models with local data sources via MCP (coming soon)
- Ubuntu 22.04+ (tested on 22.04.5 LTS)
- 8GB+ RAM (16GB recommended for larger models)
- Internet connection for initial setup
- Basic familiarity with command line operations
-
Download Miniconda:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-
Install Miniconda:
bash Miniconda3-latest-Linux-x86_64.sh
Follow the interactive prompts. Accept the license and default installation location unless you have specific requirements.
-
Reload your shell configuration:
source ~/.bashrc
-
Create a dedicated conda environment:
conda create -n llama python=3.10
-
Activate the environment:
conda activate llama
-
Add conda-forge channel for better package availability:
conda config --add channels conda-forge
-
Verify your setup:
conda env list conda config --show channels
-
Install llama-cpp-python:
conda install llama-cpp-python
Note: If you have a CUDA-capable GPU, consider installing the CUDA version for better performance:
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python -
Download a model (Microsoft Phi-4 example):
mkdir -p models cd models wget https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-q4.gguf cd ..
Create a file called test_model.py:
from llama_cpp import Llama
# Initialize the model
model = Llama(
model_path="./models/phi-4-q4.gguf",
n_ctx=2048, # Context window
n_threads=4, # Number of CPU threads
verbose=False
)
# Generate text
prompt = "Q: Explain the de Sitter thermodynamics using the Painlevé-Gullstrand (PG) coordinates"
output = model(
prompt,
max_tokens=1000,
stop=["Q:", "\n\n"],
echo=False,
temperature=0.7
)
print("Response:")
print(output['choices'][0]['text'])Run the script:
python test_model.pyfrom llama_cpp import Llama
model = Llama(model_path="./models/phi-4-q4.gguf", verbose=False)
print("Chat with Phi-4 (type 'quit' to exit)")
print("-" * 40)
while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break
response = model(f"Human: {user_input}\nAssistant:",
max_tokens=500,
stop=["Human:", "\n\n"],
temperature=0.7)
print(f"Assistant: {response['choices'][0]['text'].strip()}")- Memory Usage: Quantized models (Q4, Q8) balance performance and memory usage
- Context Length: Adjust
n_ctxbased on your use case and available RAM - Threading: Set
n_threadsto your CPU core count for optimal performance - GPU Acceleration: Use CUDA or Metal builds for significant speed improvements
Installation fails on conda install:
# Try pip installation instead
pip install llama-cpp-pythonModel loading errors:
- Verify the model file path is correct
- Ensure sufficient RAM is available
- Check model file integrity with
ls -la models/
Slow inference:
- Reduce model size (try Q2_K or Q4_K variants)
- Increase
n_threadsparameter - Consider GPU acceleration
- Basic local LLM setup
- Model inference examples
- MCP integration for local data sources
- Advanced prompting techniques
- Performance optimization guide
- Docker containerization
- Web interface development
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- llama.cpp for the efficient C++ implementation
- Microsoft for the Phi-4 model
- Hugging Face for model hosting