fw-ai · aidando73 · May 24, 2025 · May 24, 2025 · May 24, 2025 · May 24, 2025
diff --git a/learn/vlm-finetuning/.gitignore b/learn/vlm-finetuning/.gitignore
@@ -0,0 +1,6 @@
+donotcommit/
+examples/
+deepspeed_configs/
+outputs/
+!sample_data/images/
+deploy_vars.sh
diff --git a/learn/vlm-finetuning/2p5-7b.yaml b/learn/vlm-finetuning/2p5-7b.yaml
@@ -0,0 +1,77 @@
+base_model: Qwen/Qwen2.5-VL-7B-Instruct
+processor_type: AutoProcessor
+
+# these 3 lines are needed for now to handle vision chat templates w images
+skip_prepare_dataset: true
+remove_unused_columns: false
+sample_packing: false
+
+# Qwen 2.5 uses the same chat template as Qwen 2.0
+# In fact Qwen 2.5 template has been known to have bugs with images.
+# https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/discussions/11
+chat_template: qwen2_vl
+datasets:
+  - path: sample_data/train.jsonl
+    type: chat_template
+    ds_type: json
+    split: train
+    field_messages: messages
+dataset_prepared_path: last_run_prepared
+val_set_size: 0.0
+output_dir: ./outputs/out
+
+adapter:
+lora_model_dir:
+
+sequence_len: 8192
+pad_to_sequence_len: false
+
+wandb_project:
+wandb_entity:
+wandb_watch:
+wandb_name:
+wandb_log_model:
+
+gradient_accumulation_steps: 4
+micro_batch_size: 8
+num_epochs: 1
+optimizer: adamw_bnb_8bit
+lr_scheduler: cosine
+learning_rate: 0.0002
+
+gradient_checkpointing: true
+logging_steps: 1
+flash_attention: true
+eager_attention:
+
+warmup_ratio: 0.1
+weight_decay: 0.0
+
+deepspeed: deepspeed_configs/zero1_torch_compile.json # multi-gpu only
+
+# Save a checkpoint every 1000 steps
+save_strategy: # Set to `"no"` to skip checkpoint saves, `"epoch"` at end of each epoch, `"best"` when better result is achieved, leave empty to infer from `save_steps`.
+save_steps: 1000
+evals_per_epoch:
+saves_per_epoch:
+# Save only the last 5 checkpoints
+save_total_limit: 5
+
+lora_r:
+lora_alpha:
+lora_dropout:
+lora_target_modules:
+
+bf16: true
+fp16:
+tf32: true
+
+plugins:
+  - axolotl.integrations.liger.LigerPlugin
+liger_rope: true
+liger_rms_norm: true
+liger_glu_activation: true
+liger_layer_norm: true
+liger_fused_linear_cross_entropy: true
+
+max_steps: 10
diff --git a/learn/vlm-finetuning/README.md b/learn/vlm-finetuning/README.md
@@ -0,0 +1,130 @@
+# Full Finetuning Qwen 2.5 7B and deploying on Fireworks
+
+### Pre-requisites:
+- [miniconda3](https://www.anaconda.com/docs/getting-started/miniconda/install)
+  - Or anyway to create a python 3.10 environment
+- git
+- 1 8xH100 node (1 8xA100 should should work too)
+
+### Setup
+
+```bash
+git clone -b aidand-vlm-full-finetune [email protected]:aidando73/cookbook.git
+# Or if using https:
+git clone -b aidand-vlm-full-finetune https://github.com/aidando73/cookbook.git
+
+cd cookbook/learn/vlm-finetuning
+
+# Create environment
+conda create --name axolotl-env python=3.10 -y
+conda activate axolotl-env
+
+# Install dependencies
+pip install uv
+uv pip install -U packaging==23.2 setuptools==75.8.0 wheel==0.45.1 ninja==1.11.1.4 requests==2.32.3 "huggingface-hub[cli]==0.31.0" torch==2.5.1
+uv pip install --no-build-isolation "axolotl[flash-attn,deepspeed]==0.9.2"
+```
+
+We'll be using `axolotl` to finetune the model. I've tried finetuning with `trl`, but found that support for liger-kernel wasn't working and it adds a few breaking changes to config files.
+
+To learn more about `axolotl`, see the [docs](https://docs.axolotl.ai/).
+
+```bash
+# Fetch deepspeed configs and examples
+axolotl fetch deepspeed_configs 
+axolotl fetch examples
+```
+
+### Formatting your dataset
+
+Dataset should be in a .jsonl format similar to (but not exactly the same as) OpenAI chat format. E.g.,
+
+```json
+{"messages": [{"role": "user", "content": [{"type": "text", "text": "What's in these two images?"}, {"type": "image", "base64": "data:image/jpeg;base64,..."}, {"type": "image", "path": "path/to/image/relative/to/where/command/is/being/executed.jpg"}]}, {"role": "assistant", "content": [{"type": "text", "text": "There are two images of a cat and a dog."}]}]}
+{"messages": [{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]}, {"role": "user", "content": [{"type": "text", "text": "What's in this image?"}, {"type": "image", "url": "https://example.com/cat.jpg"}]}, {"role": "assistant", "content": [{"type": "text", "text": "There is a cat in the image."}]}]}
+```
+
+Reference the [axolotl multimodal docs](https://docs.axolotl.ai/docs/multimodal.html#dataset-format) for more details.
+You can ask Claude/Cursor/ChatGPT to generate a script to format your dataset if you give it a few samples of your data.
+It's recommended to avoid using "url" for images, as network conditions could cause your training run to fail.
+
+For this tutorial, we'll be using a sample synthetic dataset [sample_data/train.jsonl](sample_data/train.jsonl) dataset. It contains 50 rows, of images of food (specified by path) and contains assistant responses that reason in `<think>...</think>` tags before classifying them. These responses were generated from Qwen 2.5 VL 32B Instruct. Images were downloaded from https://huggingface.co/datasets/ethz/food101.
+
+#### Common issues:
+
+- Messages with `{"content": "Regular text here"}` are not supported. This should be instead `{"content": [{"type": "text", "text": "Regular text here"}]}`. Otherwise you will get error:
+```
+pyarrow.lib.ArrowInvalid: JSON parse error: Column(/messages/[]/content) changed from string to array in row 0
+```
+- If using relative image path, paths should be relative to where axolotl is called from. E.g., for path `image_dir/image.jpg` if you are in `vlm-finetuning` directory, then the path to the image should be `vlm-finetuning/image_dir/image.jpg`.
+
+### Training
+
+An already prepared axolotl config file is provided in [2p5-7b.yaml](2p5-7b.yaml).
+
+To finetune, run:
+
+```bash
+axolotl train 2p5-7b.yaml
+```
+
+The final model will be saved in `outputs/out` and checkpoints will be saved in `outputs/out/checkpoint-<step>`.
+
+### Deploying on Fireworks
+
+**Pre-requisite:**
+- Install [firectl](https://docs.fireworks.ai/tools-sdks/firectl/firectl)
+
+### Deployment
+
+1. This requires a few variables to be set so we'll keep them in a file called `deploy_vars.sh` and load them when we run commands. First create this file:
+
+```bash
+cp deploy_vars.sh.example deploy_vars.sh
+```
+
+2. Next open `deploy_vars.sh` and add your account ID and API key.
+3. Validate `ACCOUNT_ID` and `FIREWORKS_API_KEY` are correct.
+
+```bash
+source deploy_vars.sh && firectl -a $ACCOUNT_ID list models --api-key $FIREWORKS_API_KEY # You should see either an empty list or all your current models
+ls $CHECKPOINT/config.json $CHECKPOINT/model-*-of-*.safetensors # You should see config.json and *.safetensors files
+```
+
+4. Next we create the model:
+
+```bash
+# For some reason axolotl checkpoints are missing a few config files
+# So download them from the base model (we exclude weights)
+huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct --local-dir $CHECKPOINT --exclude "*.safetensors" "model.safetensors.index.json"
+
+# Load variables before running firectl commands
+source deploy_vars.sh && firectl -a $ACCOUNT_ID create model $MODEL_NAME $CHECKPOINT --api-key $FIREWORKS_API_KEY
+```
+
+Next we create the deployment:
+
+```bash
+source deploy_vars.sh && firectl -a $ACCOUNT_ID create deployment accounts/$ACCOUNT_ID/models/$MODEL_NAME \
+  --accelerator-type="NVIDIA_H100_80GB" \
+  --min-replica-count 1 \
+  --accelerator-count 1 \
+  --api-key $FIREWORKS_API_KEY \
+  --deployment-id $MODEL_NAME # We set the deployment ID to the model name
+```
+
+Wait until the deployment is ready.
+
+```bash
+watch -c "firectl -a $ACCOUNT_ID list deployments --order-by='create_time desc' --api-key $FIREWORKS_API_KEY"
+```
+
+Then you can test the deployment:
+
+```bash
+source deploy_vars.sh && python fw_req.py --model accounts/$ACCOUNT_ID/models/$MODEL_NAME#accounts/$ACCOUNT_ID/deployments/$MODEL_NAME --api-key $FIREWORKS_API_KEY
+```
+
+### Thanks for trying Fireworks
+
+Please email me at [email protected] if you have any questions/feedback. Or [drop something in my calendar](https://calendar.google.com/calendar/u/0/appointments/schedules/AcZssZ2iKVtCNOXAOLoYRcGh4ppHL_ztUU-osdlrAeR8dyvoZY2V-pMMMu_ozOjvTVeLg65Erkuu0UET).
diff --git a/learn/vlm-finetuning/deploy_vars.sh.example b/learn/vlm-finetuning/deploy_vars.sh.example
@@ -0,0 +1,16 @@
+# Input required 👇
+# You can get your API key at https://fireworks.ai/settings/users/api-keys
+export FIREWORKS_API_KEY=
+# (After you set your API key) You can get your account ID by running
+# source deploy_vars.sh && firectl get account --api-key $FIREWORKS_API_KEY
+export ACCOUNT_ID=
+
+
+
+
+
+# This is the directory where the checkpoint is stored
+# You can also point this at checkpoints like outputs/out/checkpoint-1000
+export CHECKPOINT=outputs/out
+# Model name
+export MODEL_NAME=sft-qwen2p5-vl-7b-instruct
diff --git a/learn/vlm-finetuning/fw_req.py b/learn/vlm-finetuning/fw_req.py
@@ -0,0 +1,60 @@
+#!/usr/bin/env python3
+import requests
+import base64
+import json
+import os
+import argparse
+import sys
+
+def main():
+    parser = argparse.ArgumentParser(description='Query Fireworks AI vision model with an image')
+    parser.add_argument('--model',
+                       help='Model to use',
+                       required=True)
+    parser.add_argument('--api-key',
+                       help='Fireworks API key',
+                       required=True)
+    args = parser.parse_args()
+
+    # Read and encode image
+    with open("icecream.png", 'rb') as f:
+        image_data = base64.b64encode(f.read()).decode('utf-8')
+
+    payload = {
+        "model": args.model,
+        "messages": [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "image_url", "image_url": {
+                        "url": f"data:image/png;base64,{image_data}"
+                    }},
+                    {"type": "text", "text": "What's in this image?"},
+                ]
+            }
+        ]
+    }
+
+    try:
+        response = requests.post(
+            "https://api.fireworks.ai/inference/v1/chat/completions",
+            headers={
+                "Authorization": f"Bearer {args.api_key}",
+                "Content-Type": "application/json"
+            },
+            json=payload
+        )
+
+        response.raise_for_status()
+        result = response.json()
+        print(result['choices'][0]['message']['content'])        
+    except requests.exceptions.RequestException as e:
+        print(f"Error making request: {e}", file=sys.stderr)
+        sys.exit(1)
+    except json.JSONDecodeError as e:
+        print(f"Error parsing response: {e}", file=sys.stderr)
+        print(f"Raw response: {response.text}", file=sys.stderr)
+        sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/learn/vlm-finetuning/sample_data/images/1.jpeg b/learn/vlm-finetuning/sample_data/images/1.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/10.jpeg b/learn/vlm-finetuning/sample_data/images/10.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/11.jpeg b/learn/vlm-finetuning/sample_data/images/11.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/12.jpeg b/learn/vlm-finetuning/sample_data/images/12.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/13.jpeg b/learn/vlm-finetuning/sample_data/images/13.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/14.jpeg b/learn/vlm-finetuning/sample_data/images/14.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/15.jpeg b/learn/vlm-finetuning/sample_data/images/15.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/16.jpeg b/learn/vlm-finetuning/sample_data/images/16.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/17.jpeg b/learn/vlm-finetuning/sample_data/images/17.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/18.jpeg b/learn/vlm-finetuning/sample_data/images/18.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/19.jpeg b/learn/vlm-finetuning/sample_data/images/19.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/2.jpeg b/learn/vlm-finetuning/sample_data/images/2.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/20.jpeg b/learn/vlm-finetuning/sample_data/images/20.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/21.jpeg b/learn/vlm-finetuning/sample_data/images/21.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/22.jpeg b/learn/vlm-finetuning/sample_data/images/22.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/23.jpeg b/learn/vlm-finetuning/sample_data/images/23.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/24.jpeg b/learn/vlm-finetuning/sample_data/images/24.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/25.jpeg b/learn/vlm-finetuning/sample_data/images/25.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/26.jpeg b/learn/vlm-finetuning/sample_data/images/26.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/27.jpeg b/learn/vlm-finetuning/sample_data/images/27.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/28.jpeg b/learn/vlm-finetuning/sample_data/images/28.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/29.jpeg b/learn/vlm-finetuning/sample_data/images/29.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/3.jpeg b/learn/vlm-finetuning/sample_data/images/3.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/30.jpeg b/learn/vlm-finetuning/sample_data/images/30.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/31.jpeg b/learn/vlm-finetuning/sample_data/images/31.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/32.jpeg b/learn/vlm-finetuning/sample_data/images/32.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/33.jpeg b/learn/vlm-finetuning/sample_data/images/33.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/34.jpeg b/learn/vlm-finetuning/sample_data/images/34.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/35.jpeg b/learn/vlm-finetuning/sample_data/images/35.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/36.jpeg b/learn/vlm-finetuning/sample_data/images/36.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/37.jpeg b/learn/vlm-finetuning/sample_data/images/37.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/38.jpeg b/learn/vlm-finetuning/sample_data/images/38.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/39.jpeg b/learn/vlm-finetuning/sample_data/images/39.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/4.jpeg b/learn/vlm-finetuning/sample_data/images/4.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/40.jpeg b/learn/vlm-finetuning/sample_data/images/40.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/41.jpeg b/learn/vlm-finetuning/sample_data/images/41.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/42.jpeg b/learn/vlm-finetuning/sample_data/images/42.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/43.jpeg b/learn/vlm-finetuning/sample_data/images/43.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/44.jpeg b/learn/vlm-finetuning/sample_data/images/44.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/45.jpeg b/learn/vlm-finetuning/sample_data/images/45.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/46.jpeg b/learn/vlm-finetuning/sample_data/images/46.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/47.jpeg b/learn/vlm-finetuning/sample_data/images/47.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/48.jpeg b/learn/vlm-finetuning/sample_data/images/48.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/49.jpeg b/learn/vlm-finetuning/sample_data/images/49.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/5.jpeg b/learn/vlm-finetuning/sample_data/images/5.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/6.jpeg b/learn/vlm-finetuning/sample_data/images/6.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/7.jpeg b/learn/vlm-finetuning/sample_data/images/7.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/8.jpeg b/learn/vlm-finetuning/sample_data/images/8.jpeg
diff --git a/learn/vlm-finetuning/sample_data/images/9.jpeg b/learn/vlm-finetuning/sample_data/images/9.jpeg