Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 61 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,144 +1,96 @@
<div align="center">
<img alt="License" src="https://img.shields.io/badge/license-Apache 2.0-blue?style=for-the-badge">
<a href="https://epflight.github.io/MultiMeditron/index.html">
<img alt="Documentation build" src="https://img.shields.io/github/actions/workflow/status/EPFLiGHT/MultiMeditron/docs.yml?style=for-the-badge&label=Documentation">
</a>
<img alt="Docker build" src="https://img.shields.io/github/actions/workflow/status/EPFLiGHT/MultiMeditron/docker.yml?style=for-the-badge&label=Docker">
</div>
# MultiMeditron

<img src="assets/multimeditron.png" alt="MultiMeditron">
MultiMeditron pitch, link to paper on arxiv, etc

**MultiMeditron** is a **modular multimodal large language model (LLM)** built by students and researchers from [**LiGHT Lab**](https://www.light-laboratory.org/).
It is designed to seamlessly integrate multiple modalities such as text, images, or other data types into a single unified model architecture.
## Install Dependencies

Build and run our Docker image.
All scripts assume execution from the repository root inside the container.

## 🚀 Key Features

* **🔗 Modular Design:**
Easily plug in new modalities by following our well-documented interface. Each modality embedder (e.g., CLIP, Whisper, etc.) can be independently developed and added to the model.

* **🧩 Modality Interleaving:**
Supports interleaved multimodal inputs (e.g., text-image-text sequences), enabling complex reasoning across different data types.

* **⚡ Scalable Architecture:**
Designed for distributed and multi-node environments — ideal for large-scale training or inference.

* **🧠 Flexible Model Backbone:**
Combine any modality embedder (like CLIP or SigLIP) with any LLM (like Llama, Qwen, or custom fine-tuned models).
```
docker build -t project-name -f docker/Dockerfile .
docker run --gpus all -it \
-v $(pwd):/workspace \
project-name
```
Comment on lines +10 to +15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the user to build their own Docker image? We have a CI that builds a new one on every push on the master (for both ARM64 and AMD64 architecture), if a user wants to revert back to an old version of the master, they can do so by updating the tag. It would be better for them to just:

Suggested change
```
docker build -t project-name -f docker/Dockerfile .
docker run --gpus all -it \
-v $(pwd):/workspace \
project-name
```
# Uncomment for ARM64 architecture
# docker pull michelducartier24/multimeditron-git:latest-arm64
docker pull michelducartier24/multimeditron-git:latest-amd64


If Docker is not used, install dependencies manually:
```
pip install -r requirements.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pip install -r requirements.txt
pip install ".[flash-attn]"

```

## 🏗️ Model Architecture
Note: Maybe it would be nice to offer this as a python package?
Do we already have `pyproject.toml`? Then the instructions here could be `pip install -e .`

<div align="center">
<img src="./assets/architecture.png" alt="MultiMeditron architecture">
</div>
## Running our Code

## ⚙️ Setup
Model checkpoints are published on huggingface. To run download a checkpoint and run generate a reply, use this `generate` helper script.
```
generate.sh examples/sample_input.?
```

## Reproduce the Paper

### Using Docker (recommended)
All experiments are configured using Hydra. Configuration files are stored in the `cookbooks/` directory.

On AMD64 architecture:
- The **main recipe**, `cookbooks/main.yaml` represents the final model configuration reported in the paper.
- **Ablation recipes** live in `cookbooks/ablations/`.
- Evaluation-specific settings live under `cookbooks/eval/`.

```bash
docker pull michelducartier24/multimeditron-git:latest-amd64
Before you can run through the training, you need to download our dataset via
```

On ARM64 architecture:

```bash
docker pull michelducartier24/multimeditron-git:latest-arm64
download.sh
```

### Using uv

**Prerequisite:** To install the right version of torch with your CUDA driver, please refer to [this documentation](https://pytorch.org/get-started/locally/)

Install [uv](https://docs.astral.sh/uv/):

This main training writes checkpoints to `checkpoints/main/`. You can use our `main.yaml` configuration to reproduce the training run of the MultiMeditron paper, or provide your own configuration.
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
bash scripts/train.sh cookbooks/main.yaml
```

Clone the repository:

Ablations are designed to execute a well defined set of ablations and write write checkpoints to a separate subdirectory `checkpoints/ablations/<ablation_name>/`
```bash
git clone https://github.com/EPFLiGHT/MultiMeditron.git
cd MultiMeditron
bash scripts/ablate.sh
```

Install dependencies:

Evaluation is handled by a single entry point:
```bash
uv pip install -e ".[flash-attn]"
bash scripts/eval.sh
```

By default, this script:
- Finds the latest checkpoint for the main model and all ablations
- Runs all available benchmarks
- Saves raw evaluation outputs to `data/eval/`

## 💬 Inference Example

Here’s an example showing how to use **MultiMeditron** with **Llama 3.1 (8B)** and a single image input.

```python
import torch
from transformers import AutoTokenizer
import os
from multimeditron.dataset.preprocessor import modality_preprocessor
from multimeditron.dataset.loader import FileSystemImageLoader
from multimeditron.model.model import MultiModalModelForCausalLM
from multimeditron.dataset.preprocessor.modality_preprocessor import ModalityRetriever, SamplePreprocessor
from multimeditron.model.data_loader import DataCollatorForMultimodal

ATTACHMENT_TOKEN = "<|reserved_special_token_0|>"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct", dtype=torch.bfloat16)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_special_tokens({'additional_special_tokens': [ATTACHMENT_TOKEN]})
attachment_token_idx = tokenizer.convert_tokens_to_ids(ATTACHMENT_TOKEN)

# Load model
model = MultiModalModelForCausalLM.from_pretrained("path/to/trained/model")
model.to("cuda")

# Define input
modalities = [{"type": "image", "value": "path/to/image"}]
conversations = [{
"role": "user",
"content": f"{ATTACHMENT_TOKEN} Describe the image."
}]
sample = {"conversations": conversations, "modalities": modalities}
Evaluation does not produce plots or tables directly. It only generates structured data.
Analysis and visualization are intentionally separated from evaluation.

loader = FileSystemImageLoader(base_path=os.getcwd())

collator = DataCollatorForMultimodal(
tokenizer=tokenizer,
tokenizer_type="llama",
modality_processors=model.processors(),
modality_loaders={"image": loader},
attachment_token_idx=attachment_token_idx,
add_generation_prompt=True,
)

batch = collator([sample])

with torch.no_grad():
outputs = model.generate(batch=batch, temperature=0.1)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0])
Aggregate and post-process results:
```bash
python scripts/analyze.py
```

Generate plots and figures:
```bash
python scripts/plot.py
```
Comment on lines +68 to +76
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really have a lot of plots and figures at the moment (just tables), but maybe for the future, it would be great


## 🧩 Adding a New Modality

MultiMeditron’s architecture is fully **extensible**.
To add a new modality, see the [developer documentation](https://epflight.github.io/MultiMeditron/guides/add_modality.html) for a step-by-step guide.
All plots should be generated solely from files in `data/eval`, ensuring full reproducibility.

## The Data Pipeline

## ⚖️ License
You can download preprocessed data via `download.sh`. This will generate ready-to-use data formatted to be compatible with our training code.

This project is licensed under the Apache 2.0 License, see the [LICENSE 🎓](LICENSE) file for details.
Alternatively you can run `download_raw.sh` and `python preprocess.py` to download and preprocess the data yourself.
The preprocessing uses thirdparty LLM tools:
- it requires an OpenAI API key in an env variable
- it is not fully deterministic. The data produced by `download.sh` and `download_raw.sh && python preprocess.py` may differ significantly due to changes in thirdparty APIs.

## Extend the Paper

## 📖 Cite us
MultiMeditron mini pitch v2. It's designed to be reproducible, extensible, modular.
You can for example write your own modality projectors.
Point to interesting code files.
Point to API docs.

TODO
etc.