Skip to content

DerrickKirimi/SparseFusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SparseFusion

SparseFusion: Exploring Mixture-of-Experts Visual Language Models

🌟 Project Overview

SparseFusion is an open-source, "from-scratch" implementation of a Mixture-of-Experts (MoE) based Visual Language Model (VLM).

This repository builds upon an existing "from-scratch" SeeMOE codebase (by AviSoori1x). The project's core focus is on exploring and implementing advanced concepts within MoE systems and optimizing the VLM for practical experimentation. Through this work, the goal is to deepen the understanding and practical implementation of MoE mechanics, VLM fusion, and efficient deep learning practices, especially when operating under resource constraints.

📦 Available on HuggingFace Hub: Hugging Face

✨ Key Features & Implementations

This project delves into and implements crucial features found in modern MoE and VLM architectures, extending the foundational SeeMOE codebase.

Current Features (Original SeeMOE Foundation):

  • Modular Architecture: Clear separation of components (Router, Experts, Vision Encoder, Language Decoder, Multimodal Projector).
  • MoE Routing: Implements a NoisyTopkRouter for routing tokens to a subset of experts.
  • Vision-Language Fusion: Concatenates pooled image features with text embeddings.
  • Transformer Decoder: A decoder structure processing combined multimodal tokens.

🚀 Core Implementations & Learning Objectives:

These are the primary features being integrated, representing key learning objectives and significant steps in understanding advanced MoE VLM design.

  • Expert Capacity Control:

    • What it adds: Mechanisms to limit the number of tokens an individual expert processes, preventing overload and managing token flow within the MoE layer.
    • Learning Focus: Understanding practical resource management and token dispatching in MoE architectures.
  • Auxiliary Load Balancing Loss:

    • What it adds: A training objective designed to encourage a more even distribution of tokens across all experts.
    • Learning Focus: Exploring methods for stabilizing MoE training and improving expert utilization, mitigating "dead expert" issues.
  • Subword Tokenization (e.g., BPE, SentencePiece):

    • What it adds: Transition from character-level tokenization to a more advanced subword scheme for the language model component.
    • Learning Focus: Implementing real-world text processing techniques crucial for building capable language models.
  • Pre-trained Vision Encoder Integration (e.g., CLIP-ViT, ViT-S/16):

    • What it adds: Integration of powerful, pre-trained Vision Transformers to leverage learned visual representations.
    • Learning Focus: Understanding how to effectively integrate pre-trained backbones into custom VLM architectures, optimizing for available compute resources.
  • Modality Type Embeddings:

    • What it adds: Introduction of distinct learned embeddings to differentiate between visual and text tokens.
    • Learning Focus: Exploring techniques for robust multimodal fusion and clear signal separation within the model.
  • Expert Dropout:

    • What it adds: Application of dropout regularization specifically to the outputs of experts.
    • Learning Focus: Understanding regularization strategies in MoE models to promote generalization and diverse expert specialization.

🔬 Future Architectural Explorations:

These are more advanced architectural concepts planned for deeper exploration and implementation once the core system is stable.

  • Rotary Positional Embeddings (RoPE): Investigating the benefits of RoPE for better generalization to longer sequences by replacing absolute embeddings.
  • Multi-Token Image Support: Exploring how to represent images as sequences of visual tokens (e.g., patch-based features) for more granular visual understanding within the VLM.
  • Advanced Interpretability: Developing tools and visualizations to understand model behavior, including:
    • Expert Usage Maps: Heatmaps and statistics for token-to-expert routing and overall expert load.
    • Multimodal Attention Heatmaps: Visualizations of cross-modal attention patterns within the transformer decoder.

🔺 Scaling & Deployment Considerations:

These considerations acknowledge techniques for large-scale MoE deployments that are generally outside the scope of experimentation on free GPU resources but are important for a comprehensive understanding.

  • ViT Pre-training from Scratch: Recognizing the immense computational cost and data requirements involved in pre-training large vision backbones.
  • Expert Parallelism (e.g., DeepSpeed-MoE): Understanding distributed training techniques that shard experts across multiple GPUs for extremely large models.
  • Caching Token-Expert Assignments: Exploring optimization strategies for inference in distributed MoE systems.

🛠️ Technologies Used

  • Python
  • PyTorch: Core deep learning framework.
  • Hugging Face Transformers / Timm: For pre-trained models and tokenizers.
  • NumPy, Matplotlib: For data handling, visualization, and analysis.
  • MLflow: (Optional, for experiment tracking)

🚀 Getting Started

This section details how to set up the project and begin experimentation.

Prerequisites:

  • Python 3.x
  • pip (Python package manager)
  • Compute Environment: GPU acceleration is highly recommended (and practically required for training). Free GPU cloud environments like Google Colab or Kaggle Kernels are recommended due to local hardware constraints.

Installation:

  1. Clone the repository:
    git clone https://github.com/DerrickKirimi/SparseFusion.git 
    cd SparseFusion
  2. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate # On Windows: .\venv\Scripts\activate
  3. Install dependencies:
    pip install -r requirements.txt

Usage:

  • Example command for training:
    python train.py --epochs 1 --batch_size 16 --model sparsefusion
    Parameters like batch size or number of experts may need adjustment based on the compute environment's VRAM.
  • Further detailed instructions on how to run inference and use specific features will be provided as the project develops.

🧪 Experiments & Analysis

This project serves as a platform for hands-on experimentation and analysis of MoE and VLM concepts. Key areas of focus include:

  • MoE Ablation Studies: Comparing different configurations of MoE (e.g., analyzing routing behavior, top-k selection, and the impact of load balancing vs. its absence).
  • Expert Load Visualization: Developing tools to interpret expert utilization patterns and token routing dynamics.
  • Multimodal Fusion Diagnostics: Tracing attention and token flow to gain insights into cross-modal interactions within the VLM.

📄 License

This project is shared for educational and research purposes. If you'd like to use or contribute to it, feel free to reach out via LinkedIn.

🙏 Acknowledgements

  • Inspired by the "from-scratch" SeeMOE implementation by AviSoori1x, which served as a foundational learning resource.
  • Driven by research ideas on Mixture-of-Experts, Visual Language Models, and sparse modeling from the open machine learning community.
  • Grateful for the free compute resources provided by Google Colab and Kaggle, which enable these experiments.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published