Awesome curated list of LLM, VLM and other Foundation Models
No. | Model | Year | Company | Size & Context Window | Best For / Strengths | Access & Cost |
---|---|---|---|---|---|---|
1 | GPT-4o (“Omni”) | 2024-05 | OpenAI | Multimodal / 128K tokens | Text+image+audio+voice; fast & free-tier use | Free-tier in ChatGPT; API: $2.50/1M in, $10/1M out |
2 | GPT-4o mini | 2024-07 | OpenAI | ~8 B params / 128K tokens | Cost-effective multimodal | ChatGPT replacement; API: $0.15/1M in, $0.60/1M out |
3 | o4-mini-high / o4-mini | 2025-04 | OpenAI | Compact reasoning / multimodal | STEM, coding, fast reasoning with vision | API: input $1.10, output $4.40 per 1M tokens |
4 | o3-mini-high / o3-mini | 2024 | OpenAI | Small reasoning models | Technical/scientific reasoning on a budget | API: same pricing as o3 & mini models |
5 | Llama 4 Maverick | 2025 | Meta AI | Large Mixture-of-Expert (128 experts), 400B parameters, 1M context | Coding, reasoning; GPT-4o-level | $0.19-$0.49/1M in & out tokens |
6 | Llama 4 Scout | 2025 | Meta AI | Small (fits 1 A100/H100) / 109B parameters, 10M context | Generalist, long context small model | Open weights |
7 | Llama 3.1 405B | 2024 | Meta AI | 405 B params / 128K tokens | Research, long-context, coding | Open source |
8 | Claude 3.7 Sonnet | 2024-10 | Anthropic | ~175 B params / 200K tokens | Extended reasoning & coding | API (paid via Anthropic) |
9 | Claude 4 | 2025-04 | Anthropic | ~200 B params / 200K tokens | Advanced coding, creative writing, multimodal tasks | API (paid via Anthropic); AWS Bedrock, Vertex AI |
10 | Claude 4.1 Opus | 2025-07 | Anthropic | ~250 B params / 200K tokens | Superior coding, creative writing, transparent reasoning | API (paid via Anthropic); $15/1M out |
11 | Gemini 2.5 Pro | 2024-05 | Google DM | Undisclosed; multimodal / 1M tokens | Advanced reasoning, multimodal | API (paid via Google) |
12 | Gemini 2.5 Pro Preview | 2025-03 | Google DM | Undisclosed; multimodal / 1M tokens | Advanced reasoning, coding, multimodal tasks | Google AI Studio (free experimental access); API: $1.25/1M in, $10/1M out |
13 | Gemini 2.5 Flash | 2025-04 | Google DM | Undisclosed; multimodal / 1M tokens | Fast, cost-effective multimodal tasks | API: $0.3/1M tokens in, $2.5/1M out; Google AI Studio, Vertex AI |
14 | Gemini 2.5 Flash-Lite | 2025-07 | Google DM | Undisclosed; multimodal / 1M tokens | Cost-effective, low-latency tasks like classification, summarization | API: $0.10/1M in, $0.40/1M out; Google AI Studio, Vertex AI |
15 | Stable LM 2 12B | 2024-04 | Stability AI | 12 B params | Open model with good benchmarks | Open source |
16 | Qwen 2.5-VL 32B | 2025-03 | Alibaba | 32 B params; multimodal / 128K tokens | Vision+language tasks | Open source (Apache 2.0) |
17 | Mistral Small 3.1 | 2025-03 | Mistral AI | 24 B params; 128K tokens | Image & doc understanding | Open source (Apache 2.0) |
18 | Gemma 3 (27B) | 2025-03 | Google DM | 27 B params | One-GPU efficient model | Open source |
19 | EmbeddingGemma | 2025-09 | Google DM | Small, optimized for embeddings, on-device use cases / 308M params / 2K tokens | Text embeddings for semantic search, clustering | Open source |
20 | Fox-1 1.6B Instruct | 2024-11 | Fox-1 project | 1.6 B params | Instruction-following small LLM, conversational | Open source (Apache 2.0) |
21 | Grok 3 | 2025-02 | xAI (Elon Musk) | Unknown (Chat-focused) / 1M tokens | Conversational AI, Twitter/X integration | Proprietary (likely X Premium) |
22 | Grok-3 mini | 2025-02 | xAI (Elon Musk) | Small; reasoning-focused / 1M tokens | Cost-effective reasoning, coding, STEM tasks | Proprietary (likely X Premium) |
27 | Grok 4 | 2025-07 | xAI (Elon Musk) | ~2.4T params / 256K tokens | Advanced reasoning, coding (Grok 4 Code), multimodal, real-time data | SuperGrok $30/mo, Heavy $300/mo; API: $3/1M in, $15/1M out; X Premium+ access |
23 | Grok-4 Heavy | 2025-07 | xAI (Elon Musk) | Unknown | Advanced reasoning, coding, real-time data, high-compute tasks | SuperGrok $300/mo; API: $3/1M in, $15/1M out |
24 | DeepSeek R1 | 2025 | DeepSeek AI | Reasoning-focused / 128K tokens | Reasoning tasks, competitive with GPT-4.5 | Open weights |
25 | DeepSeek V3.1 | 2025-06 | DeepSeek AI | Undisclosed; reasoning-focused / 128K tokens | Advanced reasoning, coding, cost-efficiency | Open weights |
26 | Cerebras Qwen3-32B | 2025-05 | Cerebras | 32 B params | High-speed reasoning | Open source (Apache 2.0) |
28 | Kimi K2 | 2025-07 | Moonshot AI | 1T params (32B active) / 128K tokens | Mixture-of-experts (MoE). Agentic intelligence, coding, reasoning, tool use | Open source (Modified MIT); API: $0.15/1M in, $2.50/1M out |
29 | gpt-oss-20b | 2025-08 | OpenAI | 21B params (3.6B active) / 128K tokens | Reasoning, agentic tasks, local deployment, low latency | Open source (Apache 2.0), downloadable via Hugging Face, Ollama, GitHub |
30 | gpt-oss-120b | 2025-08 | OpenAI | 117B params (5.1B active) / 128K tokens | Deep reasoning, agentic tasks, enterprise-grade deployment | Open source (Apache 2.0), downloadable via Hugging Face, Ollama, GitHub |
31 | GPT-5 | 2025-08 | OpenAI | ~15T params / 400K tokens | Advanced reasoning, coding, multimodal, scientific tasks | API: $1.25/1M in, $10/1M out; ChatGPT Plus/Pro/Team, Free-tier access |
This is a curated list of new and up-to-date leaderboards for Large Language Models (LLMs), Vision-Language Models (VLMs), and multimodal models, published or updated in 2025. Each leaderboard provides performance metrics, rankings, and comparisons for state-of-the-art foundation models.
-
LLM Leaderboard 2025 - llm-stats.com
Comprehensive leaderboard for LLMs with performance metrics and benchmark data. Includes interactive analysis tools to compare models like GPT-4o, Llama, o1, Gemini, and Claude based on context window, speed, and price. -
Open LLM Leaderboard - Hugging Face
Evaluates open-source LLMs using benchmarks like IFEval, BBH, and MATH. Features real-time filtering and analysis of models, with community voting and comprehensive results. -
LLM Leaderboard 2025 - Vellum
Compares capabilities, price, and context window for leading commercial and open-source LLMs. Features 2025 benchmark data from model providers and independent evaluations, focusing on non-saturated benchmarks (excluding MMLU). -
LLM Leaderboard - Artificial Analysis
Ranks over 100 LLMs across metrics like intelligence, price, performance, speed (tokens per second), and context window. Provides detailed comparisons for models from OpenAI, Google, DeepSeek, Alibaba Cloud and others. -
SEAL LLM Leaderboards
Expert-driven, private evaluations of LLMs across domains like coding and instruction following. Uses curated datasets to prevent overfitting and ensure high-complexity evaluations. -
Open VLM Leaderboard - Hugging Face
Ranks open-source VLMs using 23 multimodal benchmarks (e.g., MMBench_V11, MathVista). Evaluates models like GPT-4v, Gemini, QwenVLPlus, and LLaVA on image-text tasks. -
Zero-Shot Video Question Answer on Video-MME
This task present the results of Zeroshot Question Answer results on TGIF-QA dataset for LLM powered Video Conversational Models.
This list highlights key frameworks, tools, and libraries for developing, deploying, and managing Large Language Models (LLMs), Vision-Language Models (VLMs), and foundation models.
-
LangChain
A versatile framework for building LLM-powered applications. It simplifies prompt chaining, memory management, and integration with external data sources like vector databases and APIs. Used for chatbots, RAG systems, and agent-based workflows. -
LlamaIndex
A data framework designed for connecting LLMs with custom data sources. It excels in data ingestion, indexing, and retrieval for RAG applications, enabling semantic search and context-aware querying. Ideal for document analysis and knowledge base systems. -
DSPy
A framework for programming foundation models by defining tasks rather than crafting prompts. It optimizes pipelines for LLMs using modular components, improving performance in tasks like reasoning and text generation. Suited for developers seeking maintainable codebases. -
Semantic Kernel
A Microsoft-developed SDK for integrating LLMs into applications. It supports orchestration of AI tasks, memory management, and plugins for connecting to external tools. Used for building scalable AI agents in Python, C#, and Java. -
AutoGen
A Python-based framework for creating multi-agent LLM systems. It enables agents to collaborate on tasks like data retrieval and code execution, enhancing complex workflows. Used for building autonomous AI agents and research.
-
Haystack
An open-source framework for building LLM-powered search and RAG applications. It supports semantic search, document retrieval, and question answering, with integrations for Hugging Face, OpenAI, and vector stores like Pinecone. Used for enterprise search systems. -
Chroma
An open-source embedding database optimized for managing and searching vector embeddings. Commonly used for semantic search and RAG pipelines with LangChain or LlamaIndex. -
Jina
A scalable cloud-native framework for multimodal search and neural semantic retrieval. Supports building RAG pipelines with images, text, and more. -
Qdrant
An open-source vector search engine for storing and querying embeddings at scale. Built for semantic search, recommendation engines, and RAG applications.
-
Ollama
A lightweight framework for running LLMs locally. It provides a simple API and supports models like Llama 3 and Gemma, enabling developers to build and test AI applications on personal hardware. Perfect for local AI development and prototyping. -
OpenLLM
Run any open-source LLMs (Llama 3.3, Qwen2.5, Phi3 and more) or custom models as OpenAI-compatible APIs with a single command. -
vLLM
An open-source library designed to serve LLMs efficiently and at scale, especially for inference. Uses PagedAttention to optimize memory usage, batching, and throughput. -
Text Generation Inference (TGI)
Hugging Face’s optimized inference server for deploying large Transformer models with low latency and high throughput. -
FastChat
A powerful open-source framework to serve and chat with LLMs interactively. Includes a web UI, REST API, and support for various model families like Vicuna and LLaMA.
-
MLflow
An open-source platform for managing the machine learning lifecycle, including LLMs and VLMs. It supports experiment tracking, model versioning, and deployment, with integrations for LangChain, LlamaIndex, and DSPy. Ideal for reproducible AI workflows. -
n8n
An open-source, low-code workflow automation platform. It integrates LLMs with external tools and APIs to automate tasks like data processing or chatbot responses. Used for building scalable AI-driven workflows with minimal coding. -
Flowise
An open-source, low-code platform for building LLM applications. It features a drag-and-drop interface and integrates with LangChain and LlamaIndex, making it accessible for non-coders to create chatbots and RAG systems.
-
Hugging Face Transformers
A comprehensive library for training, fine-tuning, and deploying LLMs and VLMs. It supports models like BERT, GPT, and CLIP, with tools for NLP, computer vision, and multimodal tasks. Used for research and production-grade AI applications. -
PEFT (Parameter-Efficient Fine-Tuning)
A library for efficient fine-tuning of large models using techniques like LoRA, prompt tuning, and adapters. Ideal for customizing LLMs on limited hardware. -
bitsandbytes
A lightweight CUDA extension for quantization and low-bit inference/training of LLMs. Enables memory-efficient training of large models. -
LMFlow
A framework for easy and fast fine-tuning, instruction tuning, and deployment of LLMs. Includes support for model compression and evaluation.
-
DeepEval
A testing framework for evaluating LLM applications. It offers over 14 research-backed metrics to assess RAG pipelines and safety risks, integrating with frameworks like LangChain and LlamaIndex. Used for quality assurance in AI development. -
PromptTools
A Python library for debugging, comparing, and evaluating LLM prompts with visualizations and logging support. -
AlpacaEval
A community-driven evaluation toolkit for benchmarking LLMs' instruction-following ability using standardized prompts. -
OpenCompass
A comprehensive open-source framework for large-scale benchmarking of LLMs and VLMs using curated datasets and metrics. -
OpenCompass
A comprehensive open-source framework for large-scale benchmarking of LLMs and VLMs using curated datasets and metrics. -
VLMEvalKit 🖼️
Open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks (support 220+ LMMs, 80+ benchmarks).
-
Gradio
An intuitive Python library for creating interactive web interfaces for ML models. Popular for prototyping and demonstrating LLM/VLM applications. -
Open WebUI
An open-source web interface for interacting with local and hosted LLMs. Supports multiple backends and provides a sleek, extensible UI.
-
OpenMMLab Multimoda-GPTl
Based on the open-source multi-modal model OpenFlamingo create various visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue. -
OpenMMLab MMagic
Multimodal Advanced, Generative, and Intelligent Creation (MMagic).
- Transformer Lens
A library for visualizing and interpreting transformer internals. Helps researchers understand model behavior neuron-by-neuron.
-
GLM-4.1V-9B-Thinking, 🤗 HF - MIT License 🚀
Open-source VLM from THUDM, excelling in multimodal reasoning with support for 64k context, 4K image processing, and bilingual (English/Chinese) capabilities. It outperforms many models of similar size and rivals larger models like Qwen2.5-VL-72B on 18/28 benchmarks, including STEM and long document understanding. -
Qwen 2.5 VL (7B / 72B)
Multimodal VLM from Alibaba with dynamic resolution, video input, object localization and support for ~29 languages. Top open‑source performer in OCR and agentic workflows. -
Gemma 3 (4B–27B)
Google’s open multimodal model with SigLIP image encoder, excels in multilingual captioning and VQA; strong 128k context performance. -
PaliGemma
Compact Gᴇᴍᴍᴀ‑2 B‑based VLM combining SigLIP visual encoder with strong captioning, segmentation, and VQA transferability. -
Llama 3.2 Vision (11B/90B)
Vision‑adapted Llama model with excellent OCR, document understanding, VQA, and 128k token context. -
Phi‑4 Multimodal
Microsoft’s VLM supporting vision‑language tasks with MIT license and edge‑friendly capabilities. -
DeepSeek‑VL
Open‑source VLM optimized for scientific reasoning and compact deployment. -
CogVLM
Strong-performing model in VQA and vision-centric tasks. -
BakLLaVA
LAION‑Ontocord-Skunkworks OSS AI group LMM combining Mistral 7B with LLaVA architecture for efficient VQA pipelines.
-
OCRFlux
OCRFlux is a multimodal large language model based toolkit for converting PDFs and images into clean, readable, plain Markdown text. A 3B parameter model that can run on a single NVIDIA 3090 GPU, making it accessible for local deployment. -
Llama-3.1-Nemotron-Nano-VL-8B-V1, 🤗 HF
Llama-Nemotron-Nano-VL-8B-V1 (by NVIDIA) is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. -
Qwen 2.5 VL (32B / 72B)
State‑of‑the‑art open OCR performance (~75% accuracy), outperforms even Mistral‑OCR; excels in document, video, and multilingual text extraction. -
Mistral‑OCR
Purpose‑trained OCR variant of Mistral, delivering ~72.2% accuracy on structured document benchmarks. -
Llama 3.2 Vision (11B / 90B)
Strong OCR and document understanding capabilities, part of the top open VLMs. -
Gemma 3 27B
Offers competitive OCR performance through its vision‑language architecture. -
DeepSeek‑v3‑03‑24
Lightweight, open‑source OCR-ready VLM evaluated in 2025 benchmarks. -
TextHawk 2
Bilingual OCR and grounding VLM showing state‑of‑the‑art across OCRBench, DocVQA, ChartQA, with 16× fewer tokens. -
VISTA‑OCR
New lightweight generative OCR model unifying detection and recognition with only 150M params; interactive and high‑accuracy. -
PP‑DocBee
Multimodal document understanding model with superior performance on English/Chinese benchmarks.
Model | Year | Company/Org | Used for |
---|---|---|---|
MedGemma | 2025 | Google DeepMind | OCR, image captioning, general vision NLP (3B–28B parameters) |
MedSigLIP | 2025 | Google DeepMind | Scalable multimodal medical reasoning |
Med‑Gemini | 2024 | Google DeepMind | Multimodal medical applications |
LLaVA-Med | 2024 | Microsoft | Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities |
CONCH | 2024 | Mahmood Lab + Harvard Medical School | Vision-Language Pathology Foundation Model - Nature Medicine |
BioMistral‑7B | 2024 | CNRS + Mistral | Medical-domain fine-tuned LLM on PubMed (7B parameters) |
BioMedLM 2.7B | 2024 | Stanford CRFM+MosaicML | Medical-domain trained exclusively on biomedical abstracts and papers from The Pile |
Med‑PaLM M | 2022-2023 | Google Research | Multimodal medical Q&A with image and text input |
No. | Title | Authors | Journal Name | Year |
---|---|---|---|---|
1 | Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models | Yukang Yang, et al. | arXiv:2502.20332 | 2025 |
2 | Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models | James Chua, et al. | arXiv:2506.13206 | 2025 |
3 | Emergent Response Planning in Large Language Models | Zhichen Dong, et al. | arXiv:2502.06258 | 2025 |
4 | Emergent Abilities in Large Language Models: A Survey | Leonardo Berti, et al. | arXiv:2503.05788 | 2025 |
5 | LIMO: Less Is More for Reasoning | Yixin Ye, et al. | arXiv:2502.03387 | 2025 |
6 | An Introduction to Vision-Language Modeling | Florian Bordes, et al. | arXiv:2405.17247 | 2024 |
7 | What Matters When Building Vision-Language Models? | Hugo Laurençon, et al. | arXiv:2405.02246 | 2024 |
8 | Building and better understanding vision-language models: insights and future directions | Hugo Laurençon, et al. | arXiv:2408.12637 | 2024 |
9 | DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding | Zhiyu Wu, et al. | arXiv:2412.10302 | 2024 |
10 | Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution | Peng Wang, et al. | arXiv:2409.12191 | 2024 |
11 | PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter | Junfei Xiao, et al. | arXiv:2402.10896 | 2024 |
12 | Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving | Akshay Gopalkrishnan, et al. | arXiv:2403.19838 | 2024 |
-
CVPR - IEEE / CVF Computer Vision and Pattern Recognition Conference
-
NeurIPS - Conference on Neural Information Processing Systems
-
ICLR - International Conference on Learning Representations
-
ACL - Association for Computational Linguistics
If you need support with your AI project or if you're simply AI and new technology enthusiast, don't hesitate to connect with me on LinkedIn 👍