Skip to content

srebroa/Awesome-LLM-VLM-Foundation-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 

Repository files navigation

Awesome-LLM-VLM-Foundation-Models 🚀⭐⭐⭐

Awesome curated list of LLM, VLM and other Foundation Models

No. Model Year Company Size & Context Window Best For / Strengths Access & Cost
1GPT-4o (“Omni”)2024-05OpenAIMultimodal / 128K tokensText+image+audio+voice; fast & free-tier useFree-tier in ChatGPT; API: $2.50/1M in, $10/1M out
2GPT-4o mini2024-07OpenAI~8 B params / 128K tokensCost-effective multimodalChatGPT replacement; API: $0.15/1M in, $0.60/1M out
3o4-mini-high / o4-mini2025-04OpenAICompact reasoning / multimodalSTEM, coding, fast reasoning with visionAPI: input $1.10, output $4.40 per 1M tokens
4o3-mini-high / o3-mini2024OpenAISmall reasoning modelsTechnical/scientific reasoning on a budgetAPI: same pricing as o3 & mini models
5Llama 4 Maverick2025Meta AILarge Mixture-of-Expert (128 experts), 400B parameters, 1M contextCoding, reasoning; GPT-4o-level$0.19-$0.49/1M in & out tokens
6Llama 4 Scout2025Meta AISmall (fits 1 A100/H100) / 109B parameters, 10M contextGeneralist, long context small modelOpen weights
7Llama 3.1 405B2024Meta AI405 B params / 128K tokensResearch, long-context, codingOpen source
8Claude 3.7 Sonnet2024-10Anthropic~175 B params / 200K tokensExtended reasoning & codingAPI (paid via Anthropic)
9Claude 42025-04Anthropic~200 B params / 200K tokensAdvanced coding, creative writing, multimodal tasksAPI (paid via Anthropic); AWS Bedrock, Vertex AI
10Claude 4.1 Opus2025-07Anthropic~250 B params / 200K tokensSuperior coding, creative writing, transparent reasoningAPI (paid via Anthropic); $15/1M out
11Gemini 2.5 Pro2024-05Google DMUndisclosed; multimodal / 1M tokensAdvanced reasoning, multimodalAPI (paid via Google)
12Gemini 2.5 Pro Preview2025-03Google DMUndisclosed; multimodal / 1M tokensAdvanced reasoning, coding, multimodal tasksGoogle AI Studio (free experimental access); API: $1.25/1M in, $10/1M out
13Gemini 2.5 Flash2025-04Google DMUndisclosed; multimodal / 1M tokensFast, cost-effective multimodal tasksAPI: $0.3/1M tokens in, $2.5/1M out; Google AI Studio, Vertex AI
14Gemini 2.5 Flash-Lite2025-07Google DMUndisclosed; multimodal / 1M tokensCost-effective, low-latency tasks like classification, summarizationAPI: $0.10/1M in, $0.40/1M out; Google AI Studio, Vertex AI
15Stable LM 2 12B2024-04Stability AI12 B paramsOpen model with good benchmarksOpen source
16Qwen 2.5-VL 32B2025-03Alibaba32 B params; multimodal / 128K tokensVision+language tasksOpen source (Apache 2.0)
17Mistral Small 3.12025-03Mistral AI24 B params; 128K tokensImage & doc understandingOpen source (Apache 2.0)
18Gemma 3 (27B)2025-03Google DM27 B paramsOne-GPU efficient modelOpen source
19EmbeddingGemma2025-09Google DMSmall, optimized for embeddings, on-device use cases / 308M params / 2K tokensText embeddings for semantic search, clusteringOpen source
20Fox-1 1.6B Instruct2024-11Fox-1 project1.6 B paramsInstruction-following small LLM, conversationalOpen source (Apache 2.0)
21Grok 32025-02xAI (Elon Musk)Unknown (Chat-focused) / 1M tokensConversational AI, Twitter/X integrationProprietary (likely X Premium)
22Grok-3 mini2025-02xAI (Elon Musk)Small; reasoning-focused / 1M tokensCost-effective reasoning, coding, STEM tasksProprietary (likely X Premium)
27Grok 42025-07xAI (Elon Musk)~2.4T params / 256K tokensAdvanced reasoning, coding (Grok 4 Code), multimodal, real-time dataSuperGrok $30/mo, Heavy $300/mo; API: $3/1M in, $15/1M out; X Premium+ access
23Grok-4 Heavy2025-07xAI (Elon Musk)UnknownAdvanced reasoning, coding, real-time data, high-compute tasksSuperGrok $300/mo; API: $3/1M in, $15/1M out
24DeepSeek R12025DeepSeek AIReasoning-focused / 128K tokensReasoning tasks, competitive with GPT-4.5Open weights
25DeepSeek V3.12025-06DeepSeek AIUndisclosed; reasoning-focused / 128K tokensAdvanced reasoning, coding, cost-efficiencyOpen weights
26Cerebras Qwen3-32B2025-05Cerebras32 B paramsHigh-speed reasoningOpen source (Apache 2.0)
28Kimi K22025-07Moonshot AI1T params (32B active) / 128K tokensMixture-of-experts (MoE). Agentic intelligence, coding, reasoning, tool useOpen source (Modified MIT); API: $0.15/1M in, $2.50/1M out
29gpt-oss-20b2025-08OpenAI21B params (3.6B active) / 128K tokensReasoning, agentic tasks, local deployment, low latencyOpen source (Apache 2.0), downloadable via Hugging Face, Ollama, GitHub
30gpt-oss-120b2025-08OpenAI117B params (5.1B active) / 128K tokensDeep reasoning, agentic tasks, enterprise-grade deploymentOpen source (Apache 2.0), downloadable via Hugging Face, Ollama, GitHub
31GPT-52025-08OpenAI~15T params / 400K tokensAdvanced reasoning, coding, multimodal, scientific tasksAPI: $1.25/1M in, $10/1M out; ChatGPT Plus/Pro/Team, Free-tier access

Foundation Models Leaderboards (2025)

This is a curated list of new and up-to-date leaderboards for Large Language Models (LLMs), Vision-Language Models (VLMs), and multimodal models, published or updated in 2025. Each leaderboard provides performance metrics, rankings, and comparisons for state-of-the-art foundation models.

  1. LLM Leaderboard 2025 - llm-stats.com
    Comprehensive leaderboard for LLMs with performance metrics and benchmark data. Includes interactive analysis tools to compare models like GPT-4o, Llama, o1, Gemini, and Claude based on context window, speed, and price.

  2. Open LLM Leaderboard - Hugging Face
    Evaluates open-source LLMs using benchmarks like IFEval, BBH, and MATH. Features real-time filtering and analysis of models, with community voting and comprehensive results.

  3. LLM Leaderboard 2025 - Vellum
    Compares capabilities, price, and context window for leading commercial and open-source LLMs. Features 2025 benchmark data from model providers and independent evaluations, focusing on non-saturated benchmarks (excluding MMLU).

  4. LLM Leaderboard - Artificial Analysis
    Ranks over 100 LLMs across metrics like intelligence, price, performance, speed (tokens per second), and context window. Provides detailed comparisons for models from OpenAI, Google, DeepSeek, Alibaba Cloud and others.

  5. SEAL LLM Leaderboards
    Expert-driven, private evaluations of LLMs across domains like coding and instruction following. Uses curated datasets to prevent overfitting and ensure high-complexity evaluations.

  6. Open VLM Leaderboard - Hugging Face
    Ranks open-source VLMs using 23 multimodal benchmarks (e.g., MMBench_V11, MathVista). Evaluates models like GPT-4v, Gemini, QwenVLPlus, and LLaVA on image-text tasks.

  7. Zero-Shot Video Question Answer on Video-MME
    This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.

Frameworks and Tools for LLMs, VLMs, and Foundation Models (2025)

This list highlights key frameworks, tools, and libraries for developing, deploying, and managing Large Language Models (LLMs), Vision-Language Models (VLMs), and foundation models.

🛠 Application Development & Prompt Engineering Frameworks

  1. LangChain
    A versatile framework for building LLM-powered applications. It simplifies prompt chaining, memory management, and integration with external data sources like vector databases and APIs. Used for chatbots, RAG systems, and agent-based workflows.

  2. LlamaIndex
    A data framework designed for connecting LLMs with custom data sources. It excels in data ingestion, indexing, and retrieval for RAG applications, enabling semantic search and context-aware querying. Ideal for document analysis and knowledge base systems.

  3. DSPy
    A framework for programming foundation models by defining tasks rather than crafting prompts. It optimizes pipelines for LLMs using modular components, improving performance in tasks like reasoning and text generation. Suited for developers seeking maintainable codebases.

  4. Semantic Kernel
    A Microsoft-developed SDK for integrating LLMs into applications. It supports orchestration of AI tasks, memory management, and plugins for connecting to external tools. Used for building scalable AI agents in Python, C#, and Java.

  5. AutoGen
    A Python-based framework for creating multi-agent LLM systems. It enables agents to collaborate on tasks like data retrieval and code execution, enhancing complex workflows. Used for building autonomous AI agents and research.


🔍 Retrieval-Augmented Generation (RAG) & Semantic Search

  1. Haystack
    An open-source framework for building LLM-powered search and RAG applications. It supports semantic search, document retrieval, and question answering, with integrations for Hugging Face, OpenAI, and vector stores like Pinecone. Used for enterprise search systems.

  2. Chroma
    An open-source embedding database optimized for managing and searching vector embeddings. Commonly used for semantic search and RAG pipelines with LangChain or LlamaIndex.

  3. Jina
    A scalable cloud-native framework for multimodal search and neural semantic retrieval. Supports building RAG pipelines with images, text, and more.

  4. Qdrant
    An open-source vector search engine for storing and querying embeddings at scale. Built for semantic search, recommendation engines, and RAG applications.


🚀 Model Serving & Deployment

  1. Ollama
    A lightweight framework for running LLMs locally. It provides a simple API and supports models like Llama 3 and Gemma, enabling developers to build and test AI applications on personal hardware. Perfect for local AI development and prototyping.

  2. OpenLLM
    Run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more) or custom models as OpenAI-compatible APIs with a single command.

  3. vLLM
    An open-source library designed to serve LLMs efficiently and at scale, especially for inference. Uses PagedAttention to optimize memory usage, batching, and throughput.

  4. Text Generation Inference (TGI)
    Hugging Face’s optimized inference server for deploying large Transformer models with low latency and high throughput.

  5. FastChat
    A powerful open-source framework to serve and chat with LLMs interactively. Includes a web UI, REST API, and support for various model families like Vicuna and LLaMA.


⚙️ ML Workflow Automation & Management

  1. MLflow
    An open-source platform for managing the machine learning lifecycle, including LLMs and VLMs. It supports experiment tracking, model versioning, and deployment, with integrations for LangChain, LlamaIndex, and DSPy. Ideal for reproducible AI workflows.

  2. n8n
    An open-source, low-code workflow automation platform. It integrates LLMs with external tools and APIs to automate tasks like data processing or chatbot responses. Used for building scalable AI-driven workflows with minimal coding.

  3. Flowise
    An open-source, low-code platform for building LLM applications. It features a drag-and-drop interface and integrates with LangChain and LlamaIndex, making it accessible for non-coders to create chatbots and RAG systems.


🧑‍🔧 Fine-Tuning & Training Optimization

  1. Hugging Face Transformers
    A comprehensive library for training, fine-tuning, and deploying LLMs and VLMs. It supports models like BERT, GPT, and CLIP, with tools for NLP, computer vision, and multimodal tasks. Used for research and production-grade AI applications.

  2. PEFT (Parameter-Efficient Fine-Tuning)
    A library for efficient fine-tuning of large models using techniques like LoRA, prompt tuning, and adapters. Ideal for customizing LLMs on limited hardware.

  3. bitsandbytes
    A lightweight CUDA extension for quantization and low-bit inference/training of LLMs. Enables memory-efficient training of large models.

  4. LMFlow
    A framework for easy and fast fine-tuning, instruction tuning, and deployment of LLMs. Includes support for model compression and evaluation.


✅ Evaluation, Testing & Benchmarking

  1. DeepEval
    A testing framework for evaluating LLM applications. It offers over 14 research-backed metrics to assess RAG pipelines and safety risks, integrating with frameworks like LangChain and LlamaIndex. Used for quality assurance in AI development.

  2. PromptTools
    A Python library for debugging, comparing, and evaluating LLM prompts with visualizations and logging support.

  3. AlpacaEval
    A community-driven evaluation toolkit for benchmarking LLMs' instruction-following ability using standardized prompts.

  4. OpenCompass
    A comprehensive open-source framework for large-scale benchmarking of LLMs and VLMs using curated datasets and metrics.

  5. OpenCompass
    A comprehensive open-source framework for large-scale benchmarking of LLMs and VLMs using curated datasets and metrics.

  6. VLMEvalKit 🖼️
    Open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks (support 220+ LMMs, 80+ benchmarks).


📊 Interactive UI & Demos

  1. Gradio
    An intuitive Python library for creating interactive web interfaces for ML models. Popular for prototyping and demonstrating LLM/VLM applications.

  2. Open WebUI
    An open-source web interface for interacting with local and hosted LLMs. Supports multiple backends and provides a sleek, extensible UI.


🎨 Multimodal & Vision-Language Models

  1. OpenMMLab Multimoda-GPTl
    Based on the open-source multi-modal model OpenFlamingo create various visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue.

  2. OpenMMLab MMagic
    Multimodal Advanced, Generative, and Intelligent Creation (MMagic).


🧠 Interpretability & Analysis

  1. Transformer Lens
    A library for visualizing and interpreting transformer internals. Helps researchers understand model behavior neuron-by-neuron.

🖼️ Vision‑Language Model (VLM) Zoo

  1. GLM-4.1V-9B-Thinking, 🤗 HF - MIT License 🚀
    Open-source VLM from THUDM, excelling in multimodal reasoning with support for 64k context, 4K image processing, and bilingual (English/Chinese) capabilities. It outperforms many models of similar size and rivals larger models like Qwen2.5-VL-72B on 18/28 benchmarks, including STEM and long document understanding.

  2. Qwen 2.5 VL (7B / 72B)
    Multimodal VLM from Alibaba with dynamic resolution, video input, object localization and support for ~29 languages. Top open‑source performer in OCR and agentic workflows.

  3. Gemma 3 (4B–27B)
    Google’s open multimodal model with SigLIP image encoder, excels in multilingual captioning and VQA; strong 128k context performance.

  4. PaliGemma
    Compact Gᴇᴍᴍᴀ‑2 B‑based VLM combining SigLIP visual encoder with strong captioning, segmentation, and VQA transferability.

  5. Llama 3.2 Vision (11B/90B)
    Vision‑adapted Llama model with excellent OCR, document understanding, VQA, and 128k token context.

  6. Phi‑4 Multimodal
    Microsoft’s VLM supporting vision‑language tasks with MIT license and edge‑friendly capabilities.

  7. DeepSeek‑VL
    Open‑source VLM optimized for scientific reasoning and compact deployment.

  8. CogVLM
    Strong-performing model in VQA and vision-centric tasks.

  9. BakLLaVA
    LAION‑Ontocord-Skunkworks OSS AI group LMM combining Mistral 7B with LLaVA architecture for efficient VQA pipelines.

📄 OCR Model Zoo

  1. OCRFlux
    OCRFlux is a multimodal large language model based toolkit for converting PDFs and images into clean, readable, plain Markdown text. A 3B parameter model that can run on a single NVIDIA 3090 GPU, making it accessible for local deployment.

  2. Llama-3.1-Nemotron-Nano-VL-8B-V1, 🤗 HF
    Llama-Nemotron-Nano-VL-8B-V1 (by NVIDIA) is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world.

  3. Qwen 2.5 VL (32B / 72B)
    State‑of‑the‑art open OCR performance (~75% accuracy), outperforms even Mistral‑OCR; excels in document, video, and multilingual text extraction.

  4. Mistral‑OCR
    Purpose‑trained OCR variant of Mistral, delivering ~72.2% accuracy on structured document benchmarks.

  5. Llama 3.2 Vision (11B / 90B)
    Strong OCR and document understanding capabilities, part of the top open VLMs.

  6. Gemma 3 27B
    Offers competitive OCR performance through its vision‑language architecture.

  7. DeepSeek‑v3‑03‑24
    Lightweight, open‑source OCR-ready VLM evaluated in 2025 benchmarks.

  8. TextHawk 2
    Bilingual OCR and grounding VLM showing state‑of‑the‑art across OCRBench, DocVQA, ChartQA, with 16× fewer tokens.

  9. VISTA‑OCR
    New lightweight generative OCR model unifying detection and recognition with only 150M params; interactive and high‑accuracy.

  10. PP‑DocBee
    Multimodal document understanding model with superior performance on English/Chinese benchmarks.

📄 Medical LLMs, VLMs and MLLMs (multimodal)

Model Year Company/Org Used for
MedGemma 2025 Google DeepMind OCR, image captioning, general vision NLP (3B–28B parameters)
MedSigLIP 2025 Google DeepMind Scalable multimodal medical reasoning
Med‑Gemini 2024 Google DeepMind Multimodal medical applications
LLaVA-Med 2024 Microsoft Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities
CONCH 2024 Mahmood Lab + Harvard Medical School Vision-Language Pathology Foundation Model - Nature Medicine
BioMistral‑7B 2024 CNRS + Mistral Medical-domain fine-tuned LLM on PubMed (7B parameters)
BioMedLM 2.7B 2024 Stanford CRFM+MosaicML Medical-domain trained exclusively on biomedical abstracts and papers from The Pile
Med‑PaLM M 2022-2023 Google Research Multimodal medical Q&A with image and text input

Papers 📑 ⭐

No. Title Authors Journal Name Year
1 Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models Yukang Yang, et al. arXiv:2502.20332 2025
2 Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models James Chua, et al. arXiv:2506.13206 2025
3 Emergent Response Planning in Large Language Models Zhichen Dong, et al. arXiv:2502.06258 2025
4 Emergent Abilities in Large Language Models: A Survey Leonardo Berti, et al. arXiv:2503.05788 2025
5 LIMO: Less Is More for Reasoning Yixin Ye, et al. arXiv:2502.03387 2025
6 An Introduction to Vision-Language Modeling Florian Bordes, et al. arXiv:2405.17247 2024
7 What Matters When Building Vision-Language Models? Hugo Laurençon, et al. arXiv:2405.02246 2024
8 Building and better understanding vision-language models: insights and future directions Hugo Laurençon, et al. arXiv:2408.12637 2024
9 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Zhiyu Wu, et al. arXiv:2412.10302 2024
10 Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Peng Wang, et al. arXiv:2409.12191 2024
11 PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter Junfei Xiao, et al. arXiv:2402.10896 2024
12 Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving Akshay Gopalkrishnan, et al. arXiv:2403.19838 2024

Conferences & Papers 🥇 📑 ⭐

  1. CVPR - IEEE / CVF Computer Vision and Pattern Recognition Conference

  2. NeurIPS - Conference on Neural Information Processing Systems

  3. ICLR - International Conference on Learning Representations

  4. ACL - Association for Computational Linguistics

If you need support with your AI project or if you're simply AI and new technology enthusiast, don't hesitate to connect with me on LinkedIn 👍

About

🚀⭐⭐⭐ Awesome curated list of LLM, VLM and other Foundation Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published