This section includes research papers focusing on Chain-of-Thought reasoning. Chain-of-Thought reasoning is essential for improving the logical flow and performance of large language models in tasks requiring complex thought processes.
Chain-of-Thought (CoT) Reasoning Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding Chain-of-Thought Reasoning Without Prompting Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs (repeated) Self-Harmonized Chain of Thought To CoT or not to CoT? Chain-of-thought helps mainly with math and symbolic reasoning Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Compressed Chain of Thought: Efficient Reasoning Through Dense Representations Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation Demystifying Long Chain-of-Thought Reasoning in LLMs SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models LightThinker: Thinking Step-by-Step Compression Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? Chain of Draft: Thinking Faster by Writing Less HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs Can Language Models Perform Robust Reasoning in Chain-of-thought Prompting with Noisy Rationales? ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates Unlocking the Capabilities of Thought: A Reasoning Boundary Framework to Quantify and Optimize Chain-of-Thought Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models Decompose, Analyze and Rethink: Solving Intricate Problems with Human-like Reasoning Cycle Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Model Efficient Inference for Large Reasoning Models: A Survey The Impact of Reasoning Step Length on Large Language Models LLMs Can Easily Learn to Reason from demonstration structure, not content, is what matters! Token-Budget-Aware LLM Reasoning Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models
Multi-Hop Reasoning K-Level Reasoning with Large Language Models (repeated) Do Large Language Models Latently Perform Multi-Hop Reasoning? Offline Reinforcement Learning for LLM Multi-Step Reasoning BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering Similarity is Not All You Need: Endowing Retrieval-Augmented Generation with Multi–layered Thoughts QueryAgent: A Reliable and Efficient Reasoning Framework with Environmental Feedback-based Self-Correction Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study Premise Order Matters in Reasoning with Large Language Models MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition Improving Multi-Hop Reasoning in LLMs by Learning from Rich Human Feedback Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge RConE: Rough Cone Embedding for Multi-Hop Logical Query Answering on Multi-Modal Knowledge Graphs Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Mathematical Reasoning DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning Common 7B Language Models Already Possess Strong Math Capabilities MathScale: Scaling Instruction Tuning for Mathematical Reasoning Improve Mathematical Reasoning in Language Models by Automated Process Supervision Learn Beyond The Answer: Training Language Models with Reflection for Mathematical Reasoning Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models — The Story Goes On InfinityMATH: A Scalable Instruction Tuning Dataset in Programmatic Mathematical Reasoning GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning Self-rewarding correction for mathematical reasoning SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving AlphaMath Almost Zero: Process Supervision without Process DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing MATHPILE: A Billion-Token-Scale Pre-training Corpus for Math OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Commonsense Reasoning Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Structured Chemistry Reasoning with Large Language Models Causal Reasoning and Large Language Models: Opening a New Frontier for Causality CheXWorld: Exploring Image World Modeling for Radiograph Representation Learning KnowZRel: Common Sense Knowledge-based Zero-Shot Relationship Retrieval for Generalised Scene Graph Generation What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks Benchmarks for Automated Commonsense Reasoning: A Survey The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
Visual Reasoning Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Large Multi-modal Models Can Interpret Features in Large Multi-modal Models Progressive Multimodal Reasoning via Active Retrieval VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models Video-R1: Reinforcing Video Reasoning in MLLMs LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? SWE-BENCH MULTIMODAL: Do AI Systems Generalize to Visual Software Domains? How Far Are We from Intelligent Visual Deductive Reasoning? LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Visual-RFT: Visual Reinforcement Fine-Tuning Token-Efficient Long Video Understanding for Multimodal LLMs OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
Temporal Reasoning Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models (repeated) Large Language Models Can Learn Temporal Reasoning Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models Large language models-guided dynamic adaptation for temporal knowledge graph reasoning Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models Timo: Towards better temporal reasoning for language models Temporal reasoning transfer from text to video TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues Tram: Benchmarking temporal reasoning for large language models
Code/Algorithmic Reasoning 1)CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution 2)DeepSeek-Coder: When the Large Language Model Meets Programming — The Rise of Code Intelligence 3) Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models 4) MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code 5) Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level 6) Competitive Programming with Large Reasoning Models 7) CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction 8) Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning doubtful 9) OctoPack: Instruction Tuning Code Large Language Models 10) Open-Book Neural Algorithmic Reasoning 11) Towards Advancing Code Generation with Large Language Models: A Research Roadmap 12) CodeRosetta: Pushing the Boundaries of Unsupervised Code Translation for Parallel Programming 13) SEMCODER: Training Code Language Models with Comprehensive Semantics Reasoning 14)Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning 15) Detecting command injection vulnerabilities in Linux-based embedded firmware with LLM-based taint analysis of library functions 16) CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis 17) DeGPT: Optimizing Decompiler Output with LLM 18) Refining Decompiled C Code with Large Language Models 19) VERT: Verified Equivalent Rust Transpilation with Large Language Models as Few-Shot Learners 20)Harnessing the Power of LLM to Support Binary Taint Analysis 21) GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning 22)LongCoder: A Long-Range Pre-trained Language Model for Code Completion
🔍 Retrieval-Augmented Generation (RAG)-based Reasoning BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack ReLiK: Retrieve and LinK, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learning with LLMs RAG LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs MemoRAG: Moving Towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain Chain-of-Retrieval Augmented Generation DeepRAG: Thinking to Retrieval Step by Step for Large Language Models SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model SearchRAG: Can search engines be helpful for answering LLM-based medical Questions? Retrieval-augmented Large Language Models for Financial Time Series Forecasting ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models Retrieval Attention: Accelerating Long-Context LLM Inference via Vector Retrieval Searching for Best Practices in Retrieval-Augmented Generation RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation Graph Retrieval-Augmented Generation: A Survey
🛠️ Tool-Augmented Reasoning / Agentic Reasoning Efficient Tool Use with Chain-of-Abstraction Reasoning Can large language models explore in context Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs Knowledge Mechanisms in Large Language Models: A Survey and Perspective Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction MALT: Improving Reasoning with Multi-Agent LLM Training Efficiently Serving LLM Reasoning Programs with Certaindex GuardReasoner: Towards Reasoning-based LLM Safeguards The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks START: Self-taught Reasoner with Tools Open Deep Search: Democratizing Search with Open-source Reasoning Agents Let Models Speak Ciphers: Multiagent Debate Through Embeddings TOOLCHAIN: Efficient Action Space Navigation in Large Language Models with a Search AVATAR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning Towards an AI co-scientist Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees Toward Efficient Inference for Mixture of Experts MAGDI: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents Auditing Prompt Caching in Language Model APIs Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
🎮 Reinforcement Learning for Reasoning Beyond A: Better Planning with Transformers via Search Dynamics Bootstrapping Teaching Large Language Models to Reason with Reinforcement Learning Iterative Reasoning Preference Optimization Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning The Lessons of Developing Process Reward Models in Mathematical Reasoning Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Reward-Guided Speculative Decoding for Efficient LLM Reasoning Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution Compositional Preference Models for Aligning LLMs Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF DeepSeek-R1-Zero & DeepSeek-R1: Reinforcement Learning for Advanced Reasoning Retrospex: Language Agent Meets Offline Reinforcement Learning Critic ReST-MCTS∗ : LLM Self-Training via Process Reward Guided Tree Search RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold Reasoning in Flux: Enhancing Large Language Models Reasoning through Uncertainty-aware Adaptive Guidance Training Language Models to Reason Efficiently On the Emergence of Thinking in LLMs I: Searching for the Right Intuition Reasoning Language Models: A Blueprint
Multilingual Reasoning An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning MindMerger: Efficiently Boosting LLM Reasoning in non-English Languages MAPO: Advancing Multilingual Reasoning through Multilingual-Alignment-as-Preference Optimization Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations LangBridge: Multilingual Reasoning Without Multilingual Supervision M4u: Evaluating multilingual understanding and reasoning for large multimodal models A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models mCoT: Multilingual instruction tuning for reasoning consistency in language models Towards better understanding of program-of-thought reasoning in cross-lingual and multilingual environments Language models are multilingual chain-of-thought reasoners What is missing in multilingual visual reasoning and how to fix it XCOPA: A multilingual dataset for causal commonsense reasoning SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment
🧬 Meta-Reasoning / Self-Evolving Reasoning
- Self-Discover: Large Language Models Self-Compose Reasoning Structures 2)Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models 3)Advancing LLM Reasoning Generalists with Preference Trees 4)Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing 5)THOUGHTSCULPT: Reasoning with Intermediate Revision and Search 6)Self-Recognition in Language Models 7)Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers 8)Not All LLM Reasoners Are Created Equal 9)RATIONALYST: Pre-training Process-Supervision for Improving Reasoning `10)Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch 11)Reverse Thinking Makes LLMs Stronger Reasoners 12)Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM’s Reasoning Capability 13)Are Your LLMs Capable of Stable Reasoning? 14)Diving into Self-Evolving Training for Multimodal Reasoning
- B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners 16)Evolving Deeper LLM Thinking 18)Large Language Models Think Too Fast To Explore Effectively 19)Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs 20)LIMO: Less is More for Reasoning 21)Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models 22)Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach 23)Small Models Struggle to Learn from Strong Reasoners
- Diverse Inference and Verification for Advanced Reasoning 25)I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders 26)Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking 27)MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs 28)Understanding the Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation
- Large Memory Models (LM2): Enhancing AI’s Long-Term Reasoning 30)What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models 31)Effectively Controlling Reasoning Models through Thinking Intervention
- DeepSeek-R1 Thoughtology: Let’s about LLM Reasoning
- MPO: Boosting LLM Agents with Meta Plan Optimization
- A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
🧠💬 Social/Cognitive Reasoning
Can Large Language Models Understand Context? OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong Great Models Think Alike and This Undermines AI Oversight Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs JudgeLRM: Large Reasoning Models as a Judge Understanding Social Reasoning in Language Models with Language Models DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models "Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching Two Heads are Better than One: Zero-shot Cognitive Reasoning via Multi-LLM Knowledge Fusion Do LLM Agents Exhibit Social Behavior? Evaluating Social Biases in LLM Reasoning Can LLMs Reason Like Humans? Assessing Theory of Mind Reasoning in LLMs for Open-Ended Questions