An AI-powered medical assistant that accepts patient voice and medical image inputs, processes them through a multimodal RAG pipeline, and returns realistic doctor-like spoken responses. Try it here
This project demonstrates a full-stack, multimodal chatbot designed for simulated medical consultations. It leverages state-of-the-art models for speech-to-text (STT), image-based diagnostic reasoning, and text-to-speech (TTS) to create an interactive, human-like AI doctor assistant.
-
Voice Input: Transcribe patient speech into text using Groq's Whisper API.
-
Image Diagnosis: Analyze medical images (e.g., X-rays, dermatology photos) with LLaMA-4 Scout via Groq for diagnostic insights.
-
Speech Output: Convert AI-generated responses into natural-sounding doctor voices with ElevenLabs.
-
Web Interface: User-friendly Gradio UI for recording audio and uploading images to receive text and audio feedback.
-
Prompt Engineering: Tailored system prompts to ensure concise, human-like, and clinically appropriate responses.
-
Backend:
Python 3.10 -
Frontend: Gradio
-
STT: Groq Whisper (
whisper-large-v3) -
Image Analysis:
meta-llama/llama-4-scout-17b-16e-instructon Groq -
TTS: ElevenLabs API (
eleven_turbo_v2) -
Containerization: Docker (optional for GPU-based Space)
-
Deployment: Hugging Face Spaces
- Clone the repository
git clone https://github.com/yourusername/multimodal-medical-chatbot.git cd multimodal-medical-chatbot - Create and activate a virtual environment
python -m venv venv source venv/bin/activate # Linux / macOS venv\Scripts\activate
- Install dependencies
pip install -r requirements.txt
- Set up environment variables: Create a
.envfile in the project root:GROQ_API_KEY=your_groq_api_key ELEVENLABS_API_KEY=your_elevenlabs_api_key
Start the Gradio app locally:
python app.pyOpen your browser at http://localhost:7860, then:
-
Upload or record patient audio.
-
Upload a medical image.
-
View the transcribed text and AI doctor’s diagnosis.
-
Listen to the spoken response.