Skip to content

Reiji2/robotics_arXiv_daily

 
 

Repository files navigation

Updated on 2026.01.08

Usage instructions: here

Table of Contents
  1. Manipulation
  2. VLM
  3. VLA
  4. Humanoid
  5. Dexterous

Manipulation

Publish Date Title Authors PDF Code
2025-07-23 ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents Hesheng Wang Team 2507.17462 null
2025-07-23 Ctx2TrajGen: Traffic Context-Aware Microscale Vehicle Trajectories using Generative Adversarial Imitation Learning Byeongjoon Noh Team 2507.17418 null
2025-07-23 Confounded Causal Imitation Learning with Instrumental Variables Zhi Geng Team 2507.17309 null
2025-07-23 Prolonging Tool Life: Learning Skillful Use of General-purpose Tools through Lifespan-guided Reinforcement Learning Takamitsu Matsubara Team 2507.17275 null
2025-07-23 Towards Human-level Intelligence via Human-like Whole-Body Manipulation Zhaohui An Team 2507.17141 null
2025-07-22 Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots Aitor Arrieta Team 2507.17049 null
2025-07-19 Sensor-Space Based Robust Kinematic Control of Redundant Soft Manipulator by Learning Charlie C. L. Wang Team 2507.16842 null
2025-07-22 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning Fu-En Yang Team 2507.16815 null
2025-07-22 Equivariant Goal Conditioned Contrastive Reinforcement Learning Robert Platt Team 2507.16139 null
2025-07-21 Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers Iman Soltani Team 2507.15833 null
2025-07-21 Strong, Accurate, and Low-Cost Robot Manipulator Donghyun Kim Team 2507.15693 null
2025-07-21 Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Zongqing Lu Team 2507.15597 null
2025-07-22 GR-3 Technical Report Yichu Yang Team 2507.15493 null
2025-07-20 Learning-Based Modeling of a Magnetically Steerable Soft Suction Device for Endoscopic Endonasal Interventions Eric Diller Team 2507.15155 null
2025-07-20 Reinforcement Learning for Flow-Matching Policies Somayeh Sojoudi Team 2507.15073 null
2025-07-20 Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper Yunzhu Li Team 2507.15062 null
2025-07-20 LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading Lu Zhang Team 2507.14995 null
2025-07-20 Heterogeneous object manipulation on nonlinear soft surface through linear controller Andres Faiña Team 2507.14967 null
2025-07-20 KGN-Pro: Keypoint-Based Grasp Prediction through Probabilistic 2D-3D Correspondence Learning Guangyao Zhai Team 2507.14820 null
2025-07-19 BT-TL-DMPs: A Novel Robot TAMP Framework Combining Behavior Tree, Temporal Logic and Dynamical Movement Primitives Yongchun Fang Team 2507.14582 null
2025-07-18 Improving Low-Cost Teleoperation: Augmenting GELLO with Force Kai Arulkumaran Team 2507.13602 null
2025-07-17 The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Kai Chen Team 2507.13332 null
2025-07-17 ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning Johannes A. Stork Team 2507.13088 null
2025-07-17 Generalist Bimanual Manipulation via Foundation Video Diffusion Models Jun Zhu Team 2507.12898 null
2025-07-17 Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) Jost Tobias Springenberg Team 2507.12856 null
2025-07-17 DEMONSTRATE: Zero-shot Language to Robotic Control via Multi-task Demonstration Learning Melanie N. Zeilinger Team 2507.12855 null
2025-07-17 Learning to Predict Mobile Robot Stability in Off-Road Environments Parikshit Maini Team 2507.12731 null
2025-07-18 EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos Xiaolong Wang Team 2507.12440 null
2025-07-16 The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey Jiming Chen Team 2507.11840 null
2025-07-15 Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification Zsolt Kira Team 2507.11662 null
2025-07-15 MPC-based Coarse-to-Fine Motion Planning for Robotic Object Transportation in Cluttered Environments Steven Liu Team 2507.11211 null
2025-07-15 A Robust Controller based on Gaussian Processes for Robotic Manipulators with Unknown Uncertainty Ruggero Carli Team 2507.11170 null
2025-07-15 Enhancing Autonomous Manipulator Control with Human-in-loop for Uncertain Assembly Environments Kazuya Yoshida Team 2507.11006 null
2025-07-15 Object-Centric Mobile Manipulation through SAM2-Guided Perception and Imitation Learning Jun Morimoto Team 2507.10899 null
2025-07-14 Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection Colin Bellinger Team 2507.10814 null
2025-07-14 rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding Kaiyu Hang Team 2507.10776 null
2025-07-14 A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers Arko Barman Team 2507.10775 null
2025-07-14 Vision Language Action Models in Robotic Manipulation: A Systematic Review Irfan Hussain Team 2507.10672 null
2025-07-16 GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning Dandan Tu Team 2507.10628 null
2025-07-14 MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation Mengyuan Liu Team 2507.10543 null
2025-07-14 Prompt Informed Reinforcement Learning for Visual Coverage Path Planning Venkat Margapuri Team 2507.10284 null
2025-07-14 Should We Ever Prefer Decision Transformer for Offline Reinforcement Learning? Keith Ross Team 2507.10174 null
2025-07-16 MTF-Grasp: A Multi-tier Federated Learning Approach for Robotic Grasping Monowar Bhuyan Team 2507.10158 null
2025-07-13 Learning to Control Dynamical Agents via Spiking Neural Networks and Metropolis-Hastings Sampling Ali Al-Zawqari Team 2507.09540 null
2025-07-13 Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles Keqiang Li Team 2507.09537 null
2025-07-13 SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation Boyu Wang Team 2507.09459 null
2025-07-12 DAA: Deep Angular A Star for Image-based Path Planning* Zhiwei Xu Team 2507.09305 null
2025-07-15 Learning and Transferring Better with Depth Information in Visual Reinforcement Learning Jingdong Zhao Team 2507.09180 null
2025-07-12 PRAG: Procedural Action Generator Karla Stepanova Team 2507.09167 null
2025-07-12 Towards Human-level Dexterity via Robot Learning Gagan Khandate Team 2507.09117 null
2025-07-11 Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction Max Simchowitz Team 2507.09061 null
2025-07-11 Behavioral Exploration: Learning to Explore via In-Context Adaptation Sergey Levine Team 2507.09041 null
2025-07-11 Learning human-to-robot handovers through 3D scene reconstruction Changjae Oh Team 2507.08726 null
2025-07-11 Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots Yue Gao Team 2507.08303 null
2025-07-11 CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations He Wang Team 2507.08262 null
2025-07-10 Imitation Learning for Obstacle Avoidance Using End-to-End CNN-Based Sensor Fusion Raafat E. Shalaby Team 2507.08112 null
2025-07-15 EXPO: Stable Reinforcement Learning with Expressive Policies Chelsea Finn Team 2507.07986 null
2025-07-15 Reinforcement Learning with Action Chunking Sergey Levine Team 2507.07969 null
2025-07-09 Self-Wearing Adaptive Garments via Soft Robotic Unfurling Allison M. Okamura Team 2507.07221 null
2025-07-09 Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand Xinjun Sheng Team 2507.06822 null
2025-07-09 Learning safe, constrained policies via imitation learning: Connection to Probabilistic Inference and a Naive Algorithm George A. Vouros Team 2507.06780 null
2025-07-13 Spatial-Temporal Aware Visuomotor Diffusion Policy Learning Yanwei Fu Team 2507.06710 null
2025-07-09 Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement Martin Riedmiller Team 2507.06701 null
2025-07-09 Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning Jian Cheng Team 2507.06628 null
2025-07-09 Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic Fabio Ramos Team 2507.06625 null
2025-07-09 Token Bottleneck: One Token to Remember Dynamics Sangdoo Yun Team 2507.06543 null
2025-07-08 Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction Alessio Del Bue Team 2507.06404 null
2025-07-08 EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow Liang Wang Team 2507.06224 null
2025-07-08 Is Diversity All You Need for Scalable Robotic Manipulation? Hongyang Li Team 2507.06219 null
2025-07-08 Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model Toshiaki Tsuji Team 2507.06174 null
2025-07-08 Learning Agile Tensile Perching for Aerial Robots from Demonstrations Basaran Bahadir Kocer Team 2507.06172 null
2025-07-08 SCCRUB: Surface Cleaning Compliant Robot Utilizing Bristles Jeffrey Ian Lipton Team 2507.06053 null
2025-07-08 LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving Jian Sun Team 2507.05754 null
2025-07-08 Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning Daniel Rakita Team 2507.05695 null
2025-07-08 Integrating Diffusion-based Multi-task Learning with Online Reinforcement Learning for Robust Quadruped Robot Control Bin Liang Team 2507.05674 null
2025-07-08 Stable Tracking-in-the-Loop Control of Cable-Driven Surgical Manipulators under Erroneous Kinematic Chains Michael C. Yip Team 2507.05663 null
2025-07-08 DreamGrasp: Zero-Shot 3D Multi-Object Reconstruction from Partial-View Images for Robotic Manipulation Frank Chongwoo Park Team 2507.05627 null
2025-07-07 Gaussian Process-Based Active Exploration Strategies in Vision and Touch Nadia Figueroa Team 2507.05522 null
2025-07-07 A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation Russ Tedrake Team 2507.05331 null
2025-07-07 VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting Yanzhi Wang Team 2507.05116 null
2025-07-07 When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning Sebastien Ourselin Team 2507.05011 null
2025-07-07 Training-free Generation of Temporally Consistent Rewards from VLMs Jian Tang Team 2507.04789 null
2025-07-07 DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics Mingsheng Shang Team 2507.04661 null
2025-07-07 PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation Chee-Meng Chew Team 2507.04633 null
2025-07-07 Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts Junjie Hu Team 2507.04631 null
2025-07-06 VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation Lei Han Team 2507.04524 null
2025-07-06 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Xin Jin Team 2507.04447 null
2025-07-06 Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks Yi Fang Team 2507.04331 null
2025-07-05 Are Learning-Based Approaches Ready for Real-World Indoor Navigation? A Case for Imitation Learning Sebastian Houben Team 2507.04086 null
2025-07-05 Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation Yadan Luo Team 2507.04049 null
2025-07-08 RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot Hao Dong Team 2507.03930 null
2025-07-05 DK-RRT: Deep Koopman RRT for Collision-Aware Motion Planning of Space Manipulators in Dynamic Debris Environments Dezhi Yu Team 2507.03878 null
2025-07-04 Dexterous Teleoperation of 20-DoF ByteDexter Hand via Human Motion Retargeting Zeyu Ren Team 2507.03227 null
2025-07-02 cVLA: Towards Efficient Camera-Space VLAs Thomas Brox Team 2507.02190 null
2025-07-02 Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN Matthias Kerzel Team 2507.02171 null
2025-07-02 TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types Wei-Shi Zheng Team 2507.01857 null
2025-07-02 S3D: A Spatial Steerable Surgical Drilling Framework for Robotic Spinal Fixation Procedures Farshid Alambeigi Team 2507.01779 null
2025-07-03 TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control Yanwei Fu Team 2507.01424 null
2025-07-01 Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives Bakir Lacevic Team 2507.01198 null
2025-07-01 Imitation Learning for Satellite Attitude Control under Unknown Perturbations Xiaoli Bai Team 2507.01161 null
2025-07-01 SonoGym: High Performance Simulation for Challenging Surgical Tasks with Robotic Ultrasound Philipp Fürnstahl Team 2507.01152 null
2025-07-01 Geometry-aware 4D Video Generation for Robot Manipulation Shuran Song Team 2507.01099 null
2025-07-01 DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation Pulkit Agrawal Team 2507.01008 null
2025-07-04 Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations Yunzhu Li Team 2507.00990 null
2025-07-01 HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning Chenjia Bai Team 2507.00833 null
2025-07-01 Learning Steerable Imitation Controllers from Unstructured Animal Motions Stelian Coros Team 2507.00677 null
2025-07-01 RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation Siddhartha Srinivasa Team 2507.00435 null
2025-07-01 Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning Yang Gao Team 2506.23944 null
2025-06-30 World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation Lin Shao Team 2506.23919 null
2025-06-30 Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning Alexey Skrynnik Team 2506.23793 null
2025-06-30 PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? Ransalu Senanayake Team 2506.23725 null
2025-07-04 ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation Mac Schwager Team 2506.23126 null
2025-06-29 Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots Yue Gao Team 2506.23125 null
2025-06-28 Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation Navid Azizan Team 2506.22827 null
2025-06-28 SPI-BoTER: Error Compensation for Industrial Robots via Sparse Attention Masking and Hybrid Loss with Spatial-Physical Information Yuqiang Wu Team 2506.22788 null
2025-06-28 Learning Efficient Robotic Garment Manipulation with Standardization Bin He Team 2506.22769 null
2025-06-28 RoboPearls: Editable Video Simulation for Robot Manipulation Xiaodan Liang Team 2506.22756 null
2025-06-27 Spherical Pendulum with Quad-Rotor Thrust Vectoring Actuation -- A Novel Mechatronics and Control Benchmark Platform Tsu-Chin Tsao Team 2506.22410 null
2025-06-27 RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation Abhinav Valada Team 2506.22007 null
2025-06-26 Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation Venkat Krovi Team 2506.21732 null
2025-06-24 Ark: An Open-source Python-based Framework for Robot Learning Haitham Bou-Ammar Team 2506.21628 null
2025-06-24 FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models Huiping Zhuang Team 2506.21627 null
2025-06-26 ACTLLM: Action Consistency Tuned Large Language Model Chenliang Xu Team 2506.21250 null
2025-07-02 World-aware Planning Narratives Enhance Large Vision-Language Model Planner Xipeng Qiu Team 2506.21230 null
2025-06-26 UAIbot: Beginner-friendly web-based simulator for interactive robotics learning and research Vinicius Mariano Gonçalves Team 2506.21178 null
2025-06-26 Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions Cewu Lu Team 2506.21057 null
2025-06-26 Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends Zeng-Guang Hou Team 2506.20966 null
2025-06-25 Learning-Based Distance Estimation for 360° Single-Sensor Setups Andreas Zell Team 2506.20586 null
2025-06-25 Learn to Position -- A Novel Meta Method for Robotic Positioning Xiaoming Tao Team 2506.20445 null
2025-06-25 Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration Quanquan Gu Team 2506.20307 null
2025-06-24 Unified Vision-Language-Action Model Zhaoxiang Zhang Team 2506.19850 null
2025-06-24 T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models Qingyao Wu Team 2506.19498 null
2025-06-24 Is an object-centric representation beneficial for robotic manipulation ? Liming Chen Team 2506.19408 null
2025-06-24 Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference Nutan Chen Team 2506.19303 null
2025-06-25 AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation Hui Shen Team 2506.19269 null
2025-06-24 Robust Behavior Cloning Via Global Lipschitz Regularization Sean B. Andersson Team 2506.19250 null
2025-06-23 CUPID: Curating Data your Robot Loves with Influence Functions Jeannette Bohg Team 2506.19121 null
2025-06-23 Multimodal Anomaly Detection with a Mixture-of-Experts Dongheui Lee Team 2506.19077 null
2025-06-25 FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation Lillian Chin Team 2506.18960 null
2025-06-23 RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base Xiangyang Xue Team 2506.18856 null
2025-06-23 SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives Jia Pan Team 2506.18825 null
2025-06-23 Learning Point Correspondences In Radar 3D Point Clouds For Radar-Inertial Odometry Jan Steinbrener Team 2506.18580 null
2025-06-23 Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots Alessandro Di Nuovo Team 2506.18365 null
2025-06-23 Robotic Manipulation of a Rotating Chain with Bottom End Fixed Quang-Cuong Pham Team 2506.18355 null
2025-06-23 Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies Xiaolin Chang Team 2506.18304 null
2025-06-23 Learning Approach to Efficient Vision-based Active Tracking of a Flying Target by an Unmanned Aerial Vehicle Souma Chowdhury Team 2506.18264 null
2025-06-22 RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation Yao Mu Team 2506.18088 null
2025-06-21 RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models Xiao Li Team 2506.17639 null
2025-06-21 Imitation Learning for Active Neck Motion Enabling Robot Manipulation beyond the Field of View Yasuo Kuniyoshi Team 2506.17624 null
2025-06-20 Kinematic Model Optimization via Differentiable Contact Manifold for In-Space Manipulation Satyandra K. Gupta Team 2506.17458 null
2025-06-20 Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping Jingjin Yu Team 2506.17110 null
2025-06-24 Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration Marco Hutter Team 2506.16986 null
2025-06-20 Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections Shuran Song Team 2506.16685 null
2025-06-19 CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity Yunzhu Li Team 2506.16652 null
2025-06-19 Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control Ran Tian Team 2506.16565 null
2025-06-19 An Optimization-Augmented Control Framework for Single and Coordinated Multi-Arm Robotic Manipulation Ozgur S. Oguz Team 2506.16555 null
2025-06-19 Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining Ding Zhao Team 2506.16475 null
2025-06-19 GoalLadder: Incremental Goal Discovery with Vision-Language Models Shimon Whiteson Team 2506.16396 null
2025-06-19 CapsDT: Diffusion-Transformer for Capsule Robot Manipulation Hongliang Ren Team 2506.16263 null
2025-06-19 ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models Siyuan Huang Team 2506.16211 null
2025-06-19 FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation Wei Tang Team 2506.16201 null
2025-06-19 ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation Jitendra Malik Team 2506.15953 null
2025-06-18 Learning from Planned Data to Improve Robotic Pick-and-Place Planning Efficiency Kensuke Harada Team 2506.15920 null
2025-06-18 Improving Robotic Manipulation: Techniques for Object Pose Estimation, Accommodating Positional Uncertainty, and Disassembly Tasks from Examples Viral Rasik Galaiya Team 2506.15865 null
2025-06-18 Vision in Action: Learning Active Perception from Human Demonstrations Shuran Song Team 2506.15666 null
2025-06-18 Learning Task-Agnostic Skill Bases to Uncover Motor Primitives in Animal Behaviors Anqi Wu Team 2506.15190 null
2025-06-18 Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation Yukiyasu Domae Team 2506.15157 null
2025-06-18 TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality Eiichi Yoshida Team 2506.15146 null
2025-06-17 RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills Chuang Gan Team 2506.14763 null
2025-06-17 Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation Mustafa Mukadam Team 2506.14754 null
2025-06-17 SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning Shuo Wang Team 2506.14648 null
2025-06-17 Latent Action Diffusion for Cross-Embodiment Manipulation Robert K. Katzschmann Team 2506.14608 null
2025-06-19 ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes Hao Dong Team 2506.14317 null
2025-06-17 Steering Robots with Inference-Time Interactions Yanwei Wang Team 2506.14287 null
2025-06-17 AMPLIFY: Actionless Motion Priors for Robot Learning from Videos Animesh Garg Team 2506.14198 null
2025-06-17 Non-Overlap-Aware Egocentric Pose Estimation for Collaborative Perception in Connected Autonomy Peng Gao Team 2506.14180 null
2025-06-17 GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation Yebin Liu Team 2506.14135 null
2025-06-16 ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning Abhishek Gupta Team 2506.13867 null
2025-06-16 Touch begins where vision ends: Generalizable policies for contact-rich manipulation Raunaq Bhirangi Team 2506.13762 null
2025-06-16 Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins Wei-Chiu Ma Team 2506.13761 null
2025-06-16 What Matters in Learning from Large-Scale Datasets for Robot Manipulation Danfei Xu Team 2506.13536 null
2025-06-16 A Survey on Imitation Learning for Contact-Rich Tasks in Robotics Arash Ajoudani Team 2506.13498 null
2025-06-16 Learning Swing-up Maneuvers for a Suspended Aerial Manipulation Platform in a Hierarchical Control Framework Christian Ott Team 2506.13478 null
2025-06-16 VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation Wei Pan Team 2506.13428 null
2025-06-15 SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration Wenwu Zhu Team 2506.12723 null
2025-06-15 Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence Andrea Bajcsy Team 2506.12678 null
2025-06-15 Goal-based Self-Adaptive Generative Adversarial Imitation Learning (Goal-SAGAIL) for Multi-goal Robotic Manipulation Tasks George Vogiatzis Team 2506.12676 null
2025-06-14 AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making Qingyao Wu Team 2506.12374 null
2025-06-13 Role of Uncertainty in Model Development and Control Design for a Manufacturing Process Francis Assadian Team 2506.12273 null
2025-06-13 SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies Danfei Xu Team 2506.11948 null
2025-06-13 mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity Robert K. Katzschmann Team 2506.11916 null
2025-06-13 ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations Maria Bauza Villalonga Team 2506.11775 null
2025-06-13 Control Architecture and Design for a Multi-robotic Visual Servoing System in Automated Manufacturing Environment Rongfei Li Team 2506.11387 null
2025-06-12 Influence Functions for Data Attribution in Linear System Identification and LQR Control Dongmei Chen Team 2506.11293 null
2025-06-12 Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation Cordelia Schmid Team 2506.11261 null
2025-06-12 Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop Angjoo Kanazawa Team 2506.10968 null
2025-06-12 GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation Jiangmiao Pang Team 2506.10966 null
2025-06-12 Human-Robot Navigation using Event-based Cameras and Reinforcement Learning Rodrigo Verschae Team 2506.10790 null
2025-06-12 Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success Kapil Katyal Team 2506.10359 null
2025-06-11 Innovative Adaptive Imaged Based Visual Servoing Control of 6 DoFs Industrial Robot Manipulators Francis Assadian Team 2506.10240 null
2025-06-11 One For All: LLM-based Heterogeneous Mission Planning in Precision Agriculture Stefano Carpin Team 2506.10106 null
2025-06-11 eFlesh: Highly customizable Magnetic Touch Sensing using Cut-Cell Microstructures Raunaq Bhirangi Team 2506.09994 null
2025-06-11 Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation Xiao Ma Team 2506.09990 null
2025-06-11 From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models Chen Feng Team 2506.09930 null
2025-06-11 Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving Chen Lv Team 2506.09800 null
2025-06-11 CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings Davide Boscaini Team 2506.09699 null
2025-06-11 Advances on Affordable Hardware Platforms for Human Demonstration Acquisition in Agricultural Applications Néstor García Team 2506.09494 null
2025-06-11 DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects Hong Liu Team 2506.09491 null
2025-06-11 Time-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation Le Wang Team 2506.09422 null
2025-06-11 Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation Xiang Li Team 2506.09384 null
2025-06-11 ContextBuddy: AI-Enhanced Contextual Insights for Security Alert Investigation (Applied to Intrusion Detection) Cecile Paris Team 2506.09365 null
2025-06-10 UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation Li Fei-Fei Team 2506.09284 null
2025-06-10 Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism Bolei Zhou Team 2506.09176 null
2025-06-10 FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency Jian Tang Team 2506.08822 null
2025-06-10 Towards Biosignals-Free Autonomous Prosthetic Hand Control via Imitation Learning Xianta Jiang Team 2506.08795 null
2025-06-10 Bayesian Inverse Physics for Neuro-Symbolic Robot Learning Frank Kirchner Team 2506.08756 null
2025-06-10 Deep Reinforcement Learning-Based Motion Planning and PDE Control for Flexible Manipulators Jouni Mattila Team 2506.08639 null
2025-06-10 RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping Gitta Kutyniok Team 2506.08632 null
2025-06-10 Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots Lijun Zhu Team 2506.08416 null
2025-06-11 HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation Cong Wang Team 2506.08296 null
2025-06-09 ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving Xinggang Wang Team 2506.08052 null
2025-06-09 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Tieniu Tan Team 2506.07961 null
2025-06-09 BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation Xilin Chen Team 2506.07530 null
2025-06-09 Reinforcement Learning via Implicit Imitation Guidance Chelsea Finn Team 2506.07505 null
2025-06-09 RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy Hui Cheng Team 2506.07490 null
2025-06-08 CARoL: Context-aware Adaptation for Robot Learning Xuan Wang Team 2506.07006 null
2025-06-07 SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game Shanghang Zhang Team 2506.06690 null
2025-06-07 RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation Si Liu Team 2506.06677 null
2025-06-07 Self-Adapting Improvement Loops for Robotic Learning Chen Sun Team 2506.06658 null
2025-06-06 Enhancing Robot Safety via MLLM-Based Semantic Interpretation of Failure Data Somil Bansal Team 2506.06570 null
2025-06-06 NeSyPack: A Neuro-Symbolic Framework for Bimanual Logistics Packing Changliu Liu Team 2506.06567 null
2025-06-06 MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping Farshad Khorrami Team 2506.06535 null
2025-06-06 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model Mingkui Tan Team 2506.06199 null
2025-06-06 Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization Tingnan Zhang Team 2506.06196 null
2025-06-10 BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning Rudolf Lioutikov Team 2506.06072 null
2025-06-06 Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning Ping Luo Team 2506.05985 null
2025-06-06 Optimal Robotic Velcro Peeling with Force Feedback Volkan Isler Team 2506.05812 null
2025-06-06 Where Do We Look When We Teach? Analyzing Human Gaze Behavior Across Demonstration Devices in Robot Imitation Learning Hiroshi Bito Team 2506.05808 null
2025-06-06 FlowOE: Imitation Learning with Flow Policy from Ensemble RL Experts for Optimal Execution under Heston Volatility and Concave Market Impacts Zhi Chen Team 2506.05755 null
2025-06-06 You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping Xiangyang Xue Team 2506.05719 null
2025-06-05 A Smooth Sea Never Made a Skilled $\texttt{SAILOR}$ : Robust Imitation via Learning to Search Gokul Swamy Team 2506.05294 null
2025-06-05 LiPo: A Lightweight Post-optimization Framework for Smoothing Action Chunks Generated by Learned Policies Suhan Park Team 2506.05165 null
2025-06-05 DemoSpeedup: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration Huazhe Xu Team 2506.05064 null
2025-06-06 ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning Jian Tang Team 2506.04941 null
2025-06-05 Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion Qi Dou Team 2506.04716 null
2025-06-05 Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning Wanxiang Che Team 2506.04625 null
2025-06-04 SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning Aleksandr Panov Team 2506.04505 null
2025-06-04 Object-centric 3D Motion Field for Robot Learning from Human Videos Pieter Abbeel Team 2506.04227 null
2025-06-04 Splatting Physical Scenes: End-to-End Real-to-Sim from Imperfect Robot Data Leonard Hasenclever Team 2506.04120 null
2025-06-04 STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization Liqiang Nie Team 2506.03863 link
2025-06-04 SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models Jian Tang Team 2506.03574 null
2025-06-05 Confidence-Guided Human-AI Collaboration: Reinforcement Learning with Distributional Proxy Value Propagation for Autonomous Driving Hu Chuan Team 2506.03568 link
2025-06-03 ORV: 4D Occupancy-centric Robot Video Generation Hao Zhao Team 2506.03079 null
2025-06-03 Geometric Visual Servo Via Optimal Transport Ashutosh Tiwari Team 2506.02768 null
2025-06-03 Rodrigues Network for Learning Robot Actions Leonidas Guibas Team 2506.02618 null
2025-06-03 Reachability Weighted Offline Goal-conditioned Resampling Joni Pajarinen Team 2506.02577 null
2025-06-02 Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning Pheng-Ann Heng Team 2506.01953 null
2025-06-02 Feel the Force: Contact-Driven Learning from Humans Lerrel Pinto Team 2506.01944 null
2025-06-02 Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control Dahua Lin Team 2506.01943 null
2025-06-02 FreeTacMan: Robot-free Visuo-Tactile Data Collection System for Contact-rich Manipulation Hongyang Li Team 2506.01941 null
2025-06-02 Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics Matej Hoffmann Team 2506.01756 null
2025-06-02 Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning Kang Liu Team 2506.01710 link
2025-06-02 WoMAP: World Models For Embodied Open-Vocabulary Object Localization Anirudha Majumdar Team 2506.01600 null
2025-06-02 FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens Yuexin Ma Team 2506.01583 null
2025-06-02 Trajectory First: A Curriculum for Discovering Diverse Policies Marc Toussaint Team 2506.01568 null
2025-06-02 Variational Adaptive Noise and Dropout towards Stable Recurrent Neural Networks Shingo Murata Team 2506.01350 null
2025-06-01 OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation Valts Blukis Team 2506.01196 null
2025-06-01 HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control Jeannette Bohg Team 2506.01185 null
2025-06-01 Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning Jing Li Team 2506.00782 null
2025-05-31 XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity Benjamin Busam Team 2506.00599 null
2025-05-31 Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents Zhou Yu Team 2506.00320 null
2025-05-30 3D Gaussian Splat Vulnerabilities Polo Chau Team 2506.00280 null
2025-05-30 Bi-Manual Joint Camera Calibration and Scene Representation Weiming Zhi Team 2505.24819 null
2025-05-30 MagicGripper: A Multimodal Sensor-Integrated Gripper for Contact-Rich Robotic Manipulation Dandan Zhang Team 2505.24382 null
2025-05-30 Imitation Learning-Based Path Generation for the Complex Assembly of Deformable Objects Christoffer Sloth Team 2505.24339 null
2025-05-30 SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping Hao Dong Team 2505.24305 null
2025-05-30 Safety-Aware Robust Model Predictive Control for Robotic Arms in Dynamic Environments Suwoong Lee Team 2505.24209 null
2025-05-30 Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control Guanya Shi Team 2505.24198 null
2025-05-29 Mobi- $π$ : Mobilizing Your Robot Learning Policy Jeannette Bohg Team 2505.23692 null
2025-05-30 Normalizing Flows are Capable Models for RL Benjamin Eysenbach Team 2505.23527 null
2025-05-29 Optimization-based Posture Generation for Whole-body Contact Motion by Contact Point Search on the Body Surface Masayuki Inaba Team 2505.23501 null
2025-05-29 Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents Lichao Sun Team 2505.23450 null
2025-05-29 Enhanced DACER Algorithm with High Diffusion Efficiency Shengbo Eben Li Team 2505.23426 null
2025-05-29 RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer Zhizhong Su Team 2505.23171 null
2025-05-28 SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning Yuke Zhu Team 2505.22626 null
2025-05-28 Hybrid Learning for Cold-Start-Aware Microservice Scheduling in Dynamic Edge Environments Weijia Jia Team 2505.22424 link
2025-05-28 Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning Marian Verhelst Team 2505.22404 null
2025-05-28 State and Input Constrained Adaptive Tracking Control of Uncertain Euler-Lagrange Systems with Robustness and Feasibility Analysis Shubhendu Bhasin Team 2505.22352 null
2025-05-28 ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation Wenqiang Zhang Team 2505.22159 null
2025-05-28 Learning Compositional Behaviors from Demonstration and Language Jiajun Wu Team 2505.21981 null
2025-05-29 ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge Yi Xu Team 2505.21906 null
2025-05-28 Streaming Flow Policy: Simplifying diffusion $/$ flow-matching policies by treating action trajectories as flow trajectories Siddharth Ancha Team 2505.21851 null
2025-05-27 PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation Tianmin Shu Team 2505.21652 null
2025-05-30 Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks Bryan A. Plummer Team 2505.21649 null
2025-05-27 CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception Tapomayukh Bhattacharjee Team 2505.21495 null
2025-05-27 EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation Robert Platt Team 2505.21351 null
2025-05-27 EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild Gonzalo Ferrer Team 2505.21282 null
2025-05-27 Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations Tanvi Verma Team 2505.21182 null
2025-05-27 Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning George Retsinas Team 2505.20962 null
2025-05-27 Learning Unified Force and Position Control for Legged Loco-Manipulation Siyuan Huang Team 2505.20829 null
2025-05-27 Spatial RoboGrasp: Generalized Robotic Grasping Control Policy Luhui Hu Team 2505.20814 null
2025-05-27 Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt Jianyu Chen Team 2505.20795 null
2025-05-28 ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image Ruohan Gao Team 2505.20498 null
2025-05-26 OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation Farshad Khorrami Team 2505.20425 null
2025-05-26 Co-Design of Soft Gripper with Neural Physics Xiaolong Wang Team 2505.20404 null
2025-05-26 EgoZero: Robot Learning from Smart Glasses Lerrel Pinto Team 2505.20290 null
2025-05-26 URPlanner: A Universal Paradigm For Collision-Free Robotic Motion Planning Based on Deep Reinforcement Learning Marcelo H. Ang Jr Team 2505.20175 null
2025-05-27 MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents Xiaodan Liang Team 2505.20148 link
2025-05-26 ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving Dongbin Zhao Team 2505.20024 link
2025-05-26 Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$ -Realizable MDPs Luca Viano Team 2505.19946 null
2025-05-26 TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning Dongbin Zhao Team 2505.19769 null
2025-05-26 Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning Jean-Baptiste Mouret Team 2505.19717 null
2025-05-25 Structured Reinforcement Learning for Combinatorial Decision-Making Maximilian Schiffer Team 2505.19053 link
2025-05-25 WorldEval: World Model as Real-World Robot Policies Evaluator Yi Xu Team 2505.19017 null
2025-05-25 Online Knowledge Distillation with Reward Guidance Chen Jia Team 2505.18952 null
2025-05-24 Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning Giovanni Beltrame Team 2505.18858 null
2025-05-24 On the Dual-Use Dilemma in Physical Reasoning and Force Nikolaus Correll Team 2505.18792 null
2025-05-24 VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning Ziwei Wang Team 2505.18719 null
2025-05-24 MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations Hong Thanh Nguyen Team 2505.18595 null
2025-05-24 Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning Zhiyun Lin Team 2505.18487 null
2025-05-24 Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy Yu She Team 2505.18474 null
2025-05-24 ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning Yu She Team 2505.18472 null
2025-05-23 ProgRM: Build Better GUI Agents with Progress Rewards Kai Yu Team 2505.18121 null
2025-05-23 Classification of assembly tasks combining multiple primitive actions using Transformers and xLSTMs Pedro Neto Team 2505.18012 null
2025-05-23 Is Single-View Mesh Reconstruction Ready for Robotics? Ingmar Posner Team 2505.17966 null
2025-05-23 SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data Donghyun Kim Team 2505.17695 null
2025-05-23 Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning Giorgia Ramponi Team 2505.17610 null
2025-05-23 Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy Bin Zhao Team 2505.17434 null
2025-05-23 Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space Hui Cheng Team 2505.17389 null
2025-05-22 ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems Farhad Imani Team 2505.17295 null
2025-05-22 CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning Limin Wang Team 2505.17006 null
2025-05-22 3D Equivariant Visuomotor Policy Learning via Spherical Projection Robin Walters Team 2505.16969 null
2025-05-22 Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only Donglin Wang Team 2505.16856 null
2025-05-22 Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation Soumik Sarkar Team 2505.16547 null
2025-05-24 ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models Xiuying Chen Team 2505.16517 null
2025-05-22 Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2) Junchi Yan Team 2505.16394 null
2025-05-22 TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation Hengdi Zhang Team 2505.16289 null
2025-05-22 SEM: Enhancing Spatial Understanding for Robust Robot Manipulation Zhizhong Su Team 2505.16196 null
2025-05-22 Tactile-based Reinforcement Learning for Adaptive Grasping under Observation Uncertainties Yang Ye Team 2505.16167 null
2025-05-21 WaveTouch: Active Tactile Sensing Using Vibro-Feedback for Classification of Variable Stiffness and Infill Density Objects Bakhtiyar Orazbayev Team 2505.16062 null
2025-05-25 Proactive Hierarchical Control Barrier Function-Based Safety Prioritization in Close Human-Robot Interaction Scenarios Prashanth Krishnamurthy Team 2505.16055 null
2025-05-21 UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning Si Liu Team 2505.15725 null
2025-05-21 Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization Junwei Liang Team 2505.15660 null
2025-05-21 FLARE: Robot Learning with Implicit World Modeling Linxi Fan Team 2505.15659 null
2025-05-21 Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Ken Goldberg Team 2505.15517 null
2025-05-21 Guided Policy Optimization under Partial Observability Zongqing Lu Team 2505.15418 link
2025-05-21 Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control Jungwook Choi Team 2505.15304 null
2025-05-21 Learning-based Autonomous Oversteer Control and Collision Avoidance Seung-Hyun Kong Team 2505.15275 null
2025-05-21 Filtering Learning Histories Enhances In-Context Reinforcement Learning Santiago Paternain Team 2505.15143 null
2025-05-21 Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation Xiaodong He Team 2505.15098 null
2025-05-20 RoboCulture: A Robotics Platform for Automated Biological Experimentation Milica Radisic Team 2505.14941 null
2025-05-20 Imitation Learning via Focused Satisficing Brian Ziebart Team 2505.14820 null
2025-05-20 DORA: Object Affordance-Guided Reinforcement Learning for Dexterous Robotic Manipulation Jianwei Zhang Team 2505.14819 null
2025-05-20 Vid2World: Crafting Video Diffusion Models to Interactive World Models Mingsheng Long Team 2505.14357 null
2025-05-20 AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory Ping Luo Team 2505.14030 null
2025-05-20 RLVR-World: Training World Models with Reinforcement Learning Mingsheng Long Team 2505.13934 link
2025-05-20 Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning Yutong Ban Team 2505.13925 null
2025-05-20 Learning to Insert for Constructive Neural Vehicle Routing Solver Qingfu Zhang Team 2505.13904 null
2025-05-20 Structured Agent Distillation for Large Language Model Yanzhi Wang Team 2505.13820 null
2025-05-21 Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation Georgia Chalvatzaki Team 2505.13667 null
2025-05-19 TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion Minh Nhat Vu Team 2505.13549 null
2025-05-19 GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation Rose Hendrix Team 2505.13441 null
2025-05-19 KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture R. James Cotton Team 2505.13436 null
2025-05-19 TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation Jiangmiao Pang Team 2505.12748 null
2025-05-19 Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation Chi-Wing Fu Team 2505.12744 null
2025-05-19 Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning Taesup Moon Team 2505.12737 null
2025-05-19 DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories Linxi Fan Team 2505.12705 null
2025-05-19 Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion Qi Wu Team 2505.12679 null
2025-05-19 HIL: Hybrid Imitation Learning of Diverse Parkour Skills from Videos Xue Bin Peng Team 2505.12619 null
2025-05-18 MTIL: Encoding Full History with Mamba for Temporal Imitation Learning Zhouping Yin Team 2505.12410 link
2025-05-18 PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis Zhipong Cai Team 2505.12294 null
2025-05-20 RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction Bo Zhao Team 2505.12224 null
2025-05-20 Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study Hae-Won Park Team 2505.12222 null
2025-05-17 L2D2: Robot Learning from 2D Drawings Dylan P. Losey Team 2505.12072 null
2025-05-17 H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos Shanghang Zhang Team 2505.11920 null
2025-05-17 GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation Junwei Liang Team 2505.11865 null
2025-05-17 Learning IMU Bias with Diffusion Model Guoquan Huang Team 2505.11763 null
2025-05-16 Zero-Shot Visual Generalization in Robot Manipulation Gaurav Sukhatme Team 2505.11719 null
2025-05-16 Employing Laban Shape for Generating Emotionally and Functionally Expressive Trajectories in Robotic Manipulators Alessandro Roncone Team 2505.11716 null
2025-05-16 EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video Jian Zhang Team 2505.11709 null
2025-05-16 Grounded Task Axes: Zero-Shot Semantic Skill Generalization via Task-Axis Controllers and Visual Foundation Models Oliver Kroemer Team 2505.11680 null
2025-05-16 SHIELD: Safety on Humanoids via CBFs In Expectation on Learned Dynamics Aaron D. Ames Team 2505.11494 null
2025-05-16 Exploiting Radiance Fields for Grasp Generation on Novel Synthetic Views Todor Stoyanov Team 2505.11467 null
2025-05-16 ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations Jesse Zhang Team 2505.10911 null
2025-05-16 Counterfactual Behavior Cloning: Offline Imitation Learning from Imperfect Human Demonstrations Dylan P. Losey Team 2505.10760 null
2025-05-15 Infinigen-Sim: Procedural Generation of Articulated Simulation Assets Jia Deng Team 2505.10755 null
2025-05-15 Knowledge capture, adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation Yan Jin Team 2505.10522 null
2025-05-15 IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning Junshan Zhang Team 2505.10442 null
2025-05-15 NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning Chengyuan Chen Team 2505.10359 null
2025-05-15 SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning Axel Krieger Team 2505.10251 null
2025-05-15 Training People to Reward Robots Matthew Howard Team 2505.10151 null
2025-05-15 EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation Jianye Hao Team 2505.10105 null
2025-05-15 FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation Qing Li Team 2505.10075 null
2025-05-15 APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots Guillaume Sartoretti Team 2505.10022 null
2025-05-15 ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts Yang Yu Team 2505.10010 link
2025-05-16 PointArena: Probing Multimodal Grounding Through Language-Guided Pointing Ranjay Krishna Team 2505.09990 null
2025-05-15 Learning Diverse Natural Behaviors for Enhancing the Agility of Quadrupedal Robots Chunlin Chen Team 2505.09979 null
2025-05-14 Learning Rock Pushability on Rough Planetary Terrain Cagri Kilic Team 2505.09833 null
2025-05-14 Trailblazer: Learning offroad costmaps for long range planning Srikanth Saripalli Team 2505.09739 null
2025-05-14 EnerVerse-AC: Envisioning Embodied Environments with Action Condition Guanghui Ren Team 2505.09723 null
2025-05-14 ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation Daniel Seita Team 2505.09698 null
2025-05-14 DataMIL: Selecting Data for Robot Imitation Learning with Datamodels Roberto Martín-Martín Team 2505.09603 null
2025-05-14 Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware Ken Goldberg Team 2505.09601 null
2025-05-14 VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation Shuo Wang Team 2505.09577 null
2025-05-14 Learning Long-Context Diffusion Policies via Past-Token Prediction Chelsea Finn Team 2505.09561 null
2025-05-14 Distilling Realizable Students from Unrealizable Teachers Sanjiban Choudhury Team 2505.09546 null
2025-05-14 Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion Qixin Cao Team 2505.09424 null
2025-05-14 Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model Keith Ross Team 2505.09308 null
2025-05-14 Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation Guillaume Sartoretti Team 2505.09144 null
2025-05-14 FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis He Wang Team 2505.09109 null
2025-05-14 Imitation Learning for Adaptive Control of a Virtual Soft Exoglove Letizia Gionfrida Team 2505.09099 null
2025-05-13 ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation Dongyi Wang Team 2505.08986 null
2025-05-13 Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness Wolfram Burgard Team 2505.08627 null
2025-05-13 Beyond Predefined Actions: Integrating Behavior Trees and Dynamic Movement Primitives for Robot Learning from Demonstration Todor Stoyanov Team 2505.08625 null
2025-05-13 From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Jianye Hao Team 2505.08548 null
2025-05-13 Parameter Estimation using Reinforcement Learning Causal Curiosity: Limits and Challenges Weisi Guo Team 2505.08453 null
2025-05-13 Adaptive Diffusion Policy Optimization for Robotic Manipulation Zhuang Yang Team 2505.08376 null
2025-05-13 Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation Qianchun Lu Team 2505.08364 null
2025-05-13 Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning Biwei Huang Team 2505.08361 null
2025-05-13 HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands Yunhui Liu Team 2505.08213 null
2025-05-13 CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding Shuo Wang Team 2505.08194 null
2025-05-12 What Matters for Batch Online Reinforcement Learning in Robotics? Chelsea Finn Team 2505.08078 null
2025-05-12 H $^{\mathbf{3}}$ DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning Huazhe Xu Team 2505.07819 null
2025-05-12 Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models Jia-Bin Huang Team 2505.07815 null
2025-05-12 Improving Trajectory Stitching with Flow Models Ioannis Havoutis Team 2505.07802 null
2025-05-12 Guiding Data Collection via Factored Scaling Curves Anirudha Majumdar Team 2505.07728 null
2025-05-12 GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion Peng Yin Team 2505.07455 null
2025-05-12 ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning Donglin Wang Team 2505.07395 null
2025-05-11 X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real Sanjiban Choudhury Team 2505.07096 null
2025-05-11 YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action Bailing Tian Team 2505.06923 null
2025-05-10 JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes Harish Ravichandar Team 2505.06771 null
2025-05-10 Learned IMU Bias Prediction for Invariant Visual Inertial Odometry Nikolay Atanasov Team 2505.06748 null
2025-05-10 ACORN: Adaptive Contrastive Optimization for Safe and Robust Fine-Grained Robotic Manipulation Zixian Yue Team 2505.06628 null
2025-05-10 Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach Xiaokang Yang Team 2505.06482 null
2025-05-09 Adaptive Wiping: Adaptive contact-rich manipulation through few-shot imitation learning with Force-Torque feedback and pre-trained object representations Gentiane Venture Team 2505.06451 null
2025-05-09 VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction Roni Sengupta Team 2505.06219 null
2025-05-09 Neuro-Symbolic Concepts Jiajun Wu Team 2505.06191 null
2025-05-07 Efficient Sensorimotor Learning for Open-world Robot Manipulation Yifeng Zhu Team 2505.06136 null
2025-05-09 Robot Learning Using Multi-Coordinate Elastic Maps Reza Azadeh Team 2505.06092 null
2025-05-09 TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations Abhinav Shrivastava Team 2505.06079 null
2025-05-09 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks Farshad Khorrami Team 2505.05800 null
2025-05-09 Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives Mac Schwager Team 2505.05787 null
2025-05-09 FlowHFT: Flow Policy Induced Optimal High-Frequency Trading under Diverse Market Conditions Steve Yang Team 2505.05784 null
2025-05-08 CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations Stephen Tu Team 2505.04999 null
2025-05-08 CubeDAgger: Improved Robustness of Interactive Imitation Learning without Violation of Dynamic Stability Taisuke Kobayashi Team 2505.04897 null
2025-05-08 D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation Daniel Seita Team 2505.04860 null
2025-05-07 Steerable Scene Generation with Post Training and Inference-Time Search Russ Tedrake Team 2505.04831 null
2025-05-07 Primal-dual algorithm for contextual stochastic combinatorial optimization Axel Parmentier Team 2505.04757 null
2025-05-07 Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation Henrik I. Christensen Team 2505.04619 null
2025-05-06 OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation Donglin Wang Team 2505.03912 null
2025-05-06 AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control Xiaolong Wang Team 2505.03738 null
2025-05-06 Meta-Optimization and Program Search using Language Models for Task and Motion Planning Marc Toussaint Team 2505.03725 null
2025-05-06 Ergodic Generative Flows Yinchuan Li Team 2505.03561 null
2025-05-06 RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation Sifa Zheng Team 2505.03344 null
2025-05-06 The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning Abhinav Valada Team 2505.03296 null
2025-05-05 Sim2Real Transfer for Vision-Based Grasp Verification Markus Vincze Team 2505.03046 link
2025-05-05 Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks Jia Deng Team 2505.02915 null
2025-05-05 Re-purposing a modular origami manipulator into an adaptive physical computer for machine learning and robotic perception Suyi Li Team 2505.02744 null
2025-05-05 Spatiotemporal Non-Uniformity-Aware Online Task Scheduling in Collaborative Edge Computing for Industrial Internet of Things Bo Lei Team 2505.02597 null
2025-05-05 Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning Jianqiang Li Team 2505.02483 null
2025-05-05 MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans Siyuan Huang Team 2505.02388 null
2025-05-04 Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning Hao Su Team 2505.02228 null
2025-05-04 CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation Hao Dong Team 2505.02166 null
2025-05-04 Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions Mingyu Ding Team 2505.02152 null
2025-05-03 Act Natural! Extending Naturalistic Projection to Multimodal Behavior Scenarios David Fridovich-Keil Team 2505.01945 null
2025-05-07 RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation Xiaodan Liang Team 2505.01709 null
2025-05-02 FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research Sayan Mitra Team 2505.01383 null
2025-05-06 Robotic Visual Instruction Xianzheng Ma Team 2505.00693 null
2025-05-01 Towards Autonomous Micromobility through Scalable Urban Simulation Bolei Zhou Team 2505.00690 null
2025-05-01 DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation Yang Gao Team 2505.00527 null
2025-05-01 Optimal Interactive Learning on the Job via Facility Location Planning George Konidaris Team 2505.00490 null
2025-04-30 LLM-based Interactive Imitation Learning for Robotic Manipulation Stefan Wermter Team 2504.21769 null
2025-04-30 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors Zhou Zhao Team 2504.21530 null
2025-04-30 Provably-Safe, Online System Identification Ram Vasudevan Team 2504.21486 null
2025-04-29 TesserAct: Learning 4D Embodied World Models Chuang Gan Team 2504.20995 null
2025-04-29 XPG-RL: Reinforcement Learning with Explainable Priority Guidance for Efficiency-Boosted Mechanical Search Elena Shrestha Team 2504.20969 null
2025-04-29 PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations Xuguang Lan Team 2504.20520 null
2025-04-29 SPARK Hand: Scooping-Pinching Adaptive Robotic Hand with Kempe Mechanism for Vertical Passive Grasp in Environmental Constraints Wenzeng Zhang Team 2504.20506 null
2025-04-28 UTTG_ A Universal Teleoperation Approach via Online Trajectory Generation Hesheng Wang Team 2504.19736 null
2025-04-28 GPA-RAM: Grasp-Pretraining Augmented Robotic Attention Mamba for Spatial Task Learning Mengyuan Liu Team 2504.19683 null
2025-04-27 PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies Edward Adelson Team 2504.19341 null
2025-04-29 Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation Marco Hutter Team 2504.19322 link
2025-04-27 Learning to Drive from a World Model Yassine Yousfi Team 2504.19077 null
2025-04-26 RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning Pieter Abbeel Team 2504.18904 null
2025-04-26 Imitation Learning for Autonomous Driving: Insights from Real-World Testing Tufan Kumbasar Team 2504.18847 null
2025-04-26 Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots Alfredo Weitzenfeld Team 2504.18794 null
2025-04-26 STDArm: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation Yanyong Zhang Team 2504.18792 null
2025-04-25 Generalization Capability for Imitation Learning Yixiao Wang Team 2504.18538 null
2025-04-25 Instrumentation for Better Demonstrations: A Case Study Francis wyffels Team 2504.18481 null
2025-04-25 Action Flow Matching for Continual Robot Learning Lantao Liu Team 2504.18471 null
2025-04-25 Design and Evaluation of a UGV-Based Robotic Platform for Precision Soil Moisture Remote Sensing George Nikolakopoulos Team 2504.18284 null
2025-04-28 Implementation Analysis of Collaborative Robot Digital Twins in Physics Engines Hans D. Schotten Team 2504.18200 null
2025-04-25 Offline Learning of Controllable Diverse Behaviors Ludovic Denoyer Team 2504.18160 null
2025-04-24 CIVIL: Causal and Intuitive Visual Imitation Learning Dylan P. Losey Team 2504.17959 null
2025-04-24 Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning Prithviraj Ammanabrolu Team 2504.17950 null
2025-04-24 Learning Attentive Neural Processes for Planning with Pushing Actions Nicholas Roy Team 2504.17924 null
2025-04-24 CaRL: Learning Scalable Planning Policies with Simple Rewards Andreas Geiger Team 2504.17838 null
2025-04-23 Learning Underwater Active Perception in Simulation Donald G. Dansereau Team 2504.17817 null
2025-04-24 Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation Jiangmiao Pang Team 2504.17784 null
2025-04-24 Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control Dong Xuan Team 2504.17771 null
2025-04-24 Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion Primitives Han Ding Team 2504.17216 null
2025-04-23 Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators Roberto Horowitz Team 2504.17080 null
2025-04-23 A Systematic Approach to Design Real-World Human-in-the-Loop Deep Reinforcement Learning: Salient Features, Challenges and Trade-offs Younes Zerouali Team 2504.17006 null
2025-04-23 Latent Diffusion Planning for Imitation Learning Chelsea Finn Team 2504.16925 null
2025-04-23 MOSAIC: A Skill-Centric Algorithmic Framework for Long-Horizon Manipulation Planning Maxim Likhachev Team 2504.16738 null
2025-04-23 ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance Shanghang Zhang Team 2504.16464 null
2025-04-22 Mass-Adaptive Admittance Control for Robotic Manipulators Logan E. Beaver Team 2504.16224 null
2025-04-22 $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization Ury Zhilinsky Team 2504.16054 null
2025-04-22 SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation Xiangli Nie Team 2504.15561 null
2025-04-22 VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation Matei Ciocarlie Team 2504.15535 null
2025-04-22 Few-Shot Vision-Language Action-Incremental Policy Learning Weili Guan Team 2504.15517 null
2025-04-21 LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning Boyuan Chen Team 2504.15472 null
2025-04-23 Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions Peng Qi Team 2504.15327 null
2025-04-21 Immersive Teleoperation Framework for Locomanipulation Tasks Dimitrios Kanoulas Team 2504.15229 null
2025-04-21 A Genetic Fuzzy-Enabled Framework on Robotic Manipulation for In-Space Servicing Kelly Cohen Team 2504.15226 null
2025-04-21 A General Infrastructure and Workflow for Quadrotor Deep Reinforcement Learning and Reality Deployment Huaping Liu Team 2504.15129 null
2025-04-21 SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks Animesh Garg Team 2504.14857 null
2025-04-20 Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline Hongsheng Li Team 2504.14709 null
2025-04-24 Latent Representations for Visual Proprioception in Inexpensive Robots Ladislau Bölöni Team 2504.14634 null
2025-04-18 DiffOG: Differentiable Policy Trajectory Optimization with Generalizability Yu She Team 2504.13807 null
2025-04-18 Imitation Learning with Precisely Labeled Human Demonstrations Yilong Song Team 2504.13803 null
2025-04-21 SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM Javier Civera Team 2504.13713 link
2025-04-18 Self-Mixing Laser Interferometry: In Search of an Ambient Noise-Resilient Alternative to Acoustic Sensing Francis wyffels Team 2504.13711 null
2025-04-18 On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting Jan Peters Team 2504.13618 null
2025-04-18 A Model-Based Approach to Imitation Learning through Multi-Step Predictions Na Li Team 2504.13413 null
2025-04-17 RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins Ping Luo Team 2504.13059 null
2025-04-17 Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator E. Witrant Team 2504.13056 null
2025-04-17 Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation Iman Soltani Team 2504.12967 null
2025-04-17 TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors Yi Yang Team 2504.12799 null
2025-04-17 Trajectory Adaptation using Large Language Models Ravi Prakash Team 2504.12755 null
2025-04-17 Embodied Neuromorphic Control Applied on a 7-DOF Robotic Manipulator Lei Wang Team 2504.12702 link
2025-04-21 A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation Xiaodan Liang Team 2504.12636 null
2025-04-17 Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration Jeannette Bohg Team 2504.12609 null
2025-04-16 Adapting a World Model for Trajectory Following in a 3D Game Raluca Georgescu Team 2504.12299 null
2025-04-16 Towards Forceful Robotic Foundation Models: a Literature Survey Nikolaus Correll Team 2504.11827 null
2025-04-14 Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning Fei Liu Team 2504.11493 link
2025-04-15 Next-Future: Sample-Efficient Policy Learning for Robotic-Arm Tasks Suryansh Kumar Team 2504.11247 null
2025-04-17 CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image Yi Zhu Team 2504.11230 null
2025-04-15 Superfast Configuration-Space Convex Set Computation on GPUs for Online Motion Planning Daniela Rus Team 2504.10783 link
2025-04-14 Improving In-Context Learning with Reasoning Distillation Xiang Gao Team 2504.10647 null
2025-04-14 Flying Hand: End-Effector-Centric Framework for Versatile Aerial Manipulation Teleoperation and Policy Learning Guanya Shi Team 2504.10334 null
2025-04-14 Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation Guoying Gu Team 2504.10280 null
2025-04-14 Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models Hui Cheng Team 2504.10041 link
2025-04-14 Efficient Task-specific Conditional Diffusion Policies: Shortcut Model Acceleration and SO(3) Optimization Wei Sui Team 2504.09927 null
2025-04-12 Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators Marco M. Nicotra Team 2504.09188 null
2025-04-11 BiFlex: A Passive Bimodal Stiffness Flexible Wrist for Manipulation in Unstructured Environments Roberto Martín-Martín Team 2504.08706 null
2025-04-11 Diffusion Models for Robotic Manipulation: A Survey Rania Rayyes Team 2504.08438 null
2025-04-10 Echo: An Open-Source, Low-Cost Teleoperation System with Force Feedback for Dataset Collection in Robot Learning Dzmitry Tsetserukou Team 2504.07939 null
2025-04-10 TOCALib: Optimal control library with interpolation for bimanual manipulation and obstacles avoidance Aleksandr Panov Team 2504.07708 null
2025-04-10 Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction Hesheng Wang Team 2504.07375 link
2025-04-09 Adaptive Vision-Guided Robotic Arm Control for Precision Pruning in Dynamic Orchard Environments Manoj Karkee Team 2504.07309 null
2025-04-09 AssistanceZero: Scalably Solving Assistance Games Anca Dragan Team 2504.07091 link
2025-04-09 Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation Huazhe Xu Team 2504.06961 null
2025-04-09 Developing Modular Grasping and Manipulation Pipeline Infrastructure to Streamline Performance Benchmarking Holly Yanco Team 2504.06819 null
2025-04-09 Interactive Expressive Motion Generation Using Dynamic Movement Primitives Kai O. Arras Team 2504.06735 null
2025-04-09 Overcoming Dynamic Environments: A Hybrid Approach to Motion Planning for Manipulators Gavin Paul Team 2504.06596 null
2025-04-09 CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving Yanyong Zhang Team 2504.06584 link
2025-04-09 OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning Tyler Fenstermaker Team 2504.06538 null
2025-04-08 ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface Rui Chen Team 2504.06156 null
2025-04-08 MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos Marc Pollefeys Team 2504.06084 null
2025-04-08 Learning-enhanced electronic skin for tactile sensing on deformable surface based on electrical impedance tomography Yunjie Yang Team 2504.05987 null
2025-04-08 Stratified Expert Cloning with Adaptive Selection for User Retention in Large-Scale Recommender Systems Yongqi Liu Team 2504.05628 null
2025-04-08 TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning Stephen Xia Team 2504.05585 null
2025-04-07 SPARK-Remote: A Cost-Effective System for Remote Bimanual Robot Teleoperation Karthik Desingh Team 2504.05488 null
2025-04-07 RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception Jie Song Team 2504.05287 null
2025-04-07 Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation Wei Zhang Team 2504.05225 link
2025-04-07 Wavelet Policy: Imitation Policy Learning in Frequency Domain with Wavelet Transforms Hongrui Zhu Team 2504.04991 null
2025-04-07 Embodied Perception for Test-time Grasping Detection Adaptation with Knowledge Infusion Fengyu Zhou Team 2504.04795 null
2025-04-06 Tool-as-Interface: Learning Robot Policies from Human Tool Usage through Imitation Learning Katherine Driggs-Campbell Team 2504.04612 null
2025-04-06 Diffusion-Based Approximate MPC: Fast and Consistent Imitation of Multi-Modal Action Distributions Katherine J. Kuchenbecker Team 2504.04603 null
2025-04-06 DexTOG: Learning Task-Oriented Dexterous Grasp with Language Cewu Lu Team 2504.04573 null
2025-04-06 DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments Lin Shao Team 2504.04516 null
2025-04-06 Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers Yuke Zhu Team 2504.04395 null
2025-04-05 ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning Robert K. Katzschmann Team 2504.04259 null
2025-04-09 Digital Gene: Learning about the Physical World through Analytic Concepts Cewu Lu Team 2504.04170 null
2025-04-04 Dexterous Manipulation through Imitation Learning: A Survey Hong Zhang Team 2504.03515 null
2025-04-04 GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction Weiming Zhi Team 2504.03129 null
2025-04-03 Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets Abhishek Gupta Team 2504.02792 null
2025-04-03 Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision Shibiao Xu Team 2504.02477 null
2025-04-02 RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics Qiang Nie Team 2504.02069 null
2025-04-02 Slot-Level Robotic Placement via Visual Imitation from Single Human Video Arsalan Mousavian Team 2504.01959 null
2025-04-02 Learning with Imperfect Models: When Multi-step Prediction Mitigates Compounding Error Nikolai Matni Team 2504.01766 null
2025-04-02 TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication Karla Stepanova Team 2504.01708 null
2025-04-02 8-DoFs Cable Driven Parallel Robots for Bimanual Teleportation Josie Hughes Team 2504.01554 null
2025-04-02 Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers Yuki Uranishi Team 2504.01301 null
2025-04-02 The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction Matthew K. X. J Pan Team 2504.01260 null
2025-04-01 Energy Weighted Learning Progress Guided Interleaved Multi-Task Learning Erhan Oztop Team 2504.00707 null
2025-04-01 Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs Masaya Kinoshita Team 2504.00614 null
2025-04-01 Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation Dong Wang Team 2504.00420 null
2025-03-31 CBIL: Collective Behavior Imitation Learning for Fish from Real Videos Taku Komura Team 2504.00234 null
2025-04-02 Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation Yuke Zhu Team 2503.24361 null
2025-04-02 AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World Sergey Levine Team 2503.24278 link
2025-03-31 HACTS: a Human-As-Copilot Teleoperation System for Robot Learning Jian Tang Team 2503.24070 null
2025-03-31 Learning 3D-Gaussian Simulators from RGB Videos Georg Martius Team 2503.24009 null
2025-03-31 ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos Dinesh Jayaraman Team 2503.23877 link
2025-03-31 Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning Yue Wang Team 2503.23835 null
2025-03-30 Can Visuo-motor Policies Benefit from Random Exploration Data? A Case Study on Stacking Florian T. Pokorny Team 2503.23571 null

(back to top)

VLM

Publish Date Title Authors PDF Code
2025-07-23 BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems Christian Berger Team 2507.17722 null
2025-07-23 InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation Jiangmiao Pang Team 2507.17520 null
2025-07-23 Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection Elisa Ricci Team 2507.17456 null
2025-07-23 VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization Shoaib Ehsan Team 2507.17455 null
2025-07-23 Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection Xi Li Team 2507.17436 null
2025-07-23 Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models Guanghui Sun Team 2507.17379 null
2025-07-23 RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding Tianyang Wang Team 2507.17353 null
2025-07-23 HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study Maria Spence Team 2507.17118 null
2025-07-23 FedVLM: Scalable Personalized Vision-Language Models through Federated Learning Habeeb Olufowobi Team 2507.17088 null
2025-07-22 VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings Kannan Achan Team 2507.17080 null
2025-07-22 Controllable Hybrid Captioner for Improved Long-form Video Understanding Arun Reddy Team 2507.17047 null
2025-07-22 Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning Kai Chen Team 2507.16814 null
2025-07-22 Cooling Matters: Benchmarking Large Language Models and Vision-Language Models on Liquid-Cooled Versus Air-Cooled H100 GPU Systems Arslan Munir Team 2507.16781 null
2025-07-22 Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation Ke Yang Team 2507.16716 null
2025-07-22 Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory Marco Hutter Team 2507.16713 null
2025-07-22 Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models Chao Zhang Team 2507.16524 null
2025-07-22 SceneLoom: Communicating Data with Scene Context Siming Chen Team 2507.16466 null
2025-07-22 Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models Isao Echizen Team 2507.16257 null
2025-07-22 SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction Jiaqi Wang Team 2507.15852 null
2025-07-21 Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models Erkut Erdem Team 2507.15824 null
2025-07-23 Visual-Language Model Knowledge Distillation Method for Image Quality Assessment Jiarun Song Team 2507.15680 null
2025-07-21 Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging Margret Keuper Team 2507.15576 null
2025-07-21 HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation Robby T. Tan Team 2507.15542 null
2025-07-21 Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner Lin Ma Team 2507.15509 null
2025-07-21 One Last Attention for Your Vision-Language Model Zhiqiang Shen Team 2507.15480 null
2025-07-21 EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent Xinlei Chen Team 2507.15428 null
2025-07-21 In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems Christoph Busch Team 2507.15285 null
2025-07-21 VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving Tong Heng Lee Team 2507.15266 null
2025-07-20 Survey of GenAI for Automotive Software Development: From Requirements to Executable Code Alois Knoll Team 2507.15025 null
2025-07-20 Hierarchical Cross-modal Prompt Learning for Vision-Language Models Zhenhua Huang Team 2507.14976 null
2025-07-20 FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models Mengnan Du Team 2507.14823 null
2025-07-19 IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark Ruiheng Zhang Team 2507.14449 null
2025-07-18 CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation Nicolas Thome Team 2507.14312 null
2025-07-18 In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding Leonid Sigal Team 2507.14298 null
2025-07-18 VLA-Mark: A cross modal watermark for large vision-language alignment model Xuming Hu Team 2507.14067 null
2025-07-18 EdgeVLA: Efficient Vision-Language-Action Models Benjamin Bolte Team 2507.14049 null
2025-07-18 Moodifier: MLLM-Enhanced Emotion-Driven Image Editing Sharon X. Huang Team 2507.14024 null
2025-07-18 When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models Alberto Cazzaniga Team 2507.13868 null
2025-07-18 Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions Jiajun Zhang Team 2507.13773 null
2025-07-17 LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning Margrit Betke Team 2507.13568 null
2025-07-17 COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark Vasu Sharma Team 2507.13405 null
2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning Jiaya Jia Team 2507.13348 null
2025-07-17 Leveraging Language Prior for Infrared Small Target Detection Pravendra Singh Team 2507.13113 null
2025-07-17 GLAD: Generalizable Tuning for Vision-Language Models Shifeng Chen Team 2507.13089 null
2025-07-17 Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection Changwen Zheng Team 2507.13061 null
2025-07-21 LaViPlan : Language-Guided Visual Path Planning with RLVR Hayeon Oh Team 2507.12911 null
2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning Xiaowen Chu Team 2507.12795 null
2025-07-16 VLMgineer: Vision Language Models as Robotic Toolsmiths Dinesh Jayaraman Team 2507.12644 null
2025-07-16 NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting Chaoli Wang Team 2507.12621 null
2025-07-16 MindJourney: Test-Time Scaling with World Models for Spatial Reasoning Chuang Gan Team 2507.12508 null
2025-07-16 ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving Xinge Zhu Team 2507.12499 null
2025-07-15 Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering Dimosthenis Karatzas Team 2507.12490 null
2025-07-20 PhysX-3D: Physical-Grounded 3D Asset Generation Ziwei Liu Team 2507.12465 null
2025-07-16 Describe Anything Model for Visual Question Answering on Text-rich Images Min Xu Team 2507.12441 null
2025-07-16 AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models Sihao Ding Team 2507.12414 null
2025-07-16 Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models Bernhard Kainz Team 2507.12236 null
2025-07-16 InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing Wen-Huang Cheng Team 2507.12060 null
2025-07-16 GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models Rongrong Ji Team 2507.11969 null
2025-07-16 POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering Qin Jin Team 2507.11939 null
2025-07-15 Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis Lihang Ying Team 2507.11730 null
2025-07-18 How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study Rossella Arcucci Team 2507.11200 null
2025-07-15 Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities Yang Zhang Team 2507.11155 null
2025-07-15 Assessing Color Vision Test in Large Vision-language Models Hongyang Chen Team 2507.11153 null
2025-07-15 MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models Hamza Moustafa Team 2507.11114 null
2025-07-15 Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander Lei Chen Team 2507.11079 null
2025-07-15 Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection Guanzhong Tian Team 2507.11003 null
2025-07-14 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Xiaojuan Qi Team 2507.10548 null
2025-07-14 CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding Yi Wang Team 2507.10449 null
2025-07-14 Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter Bin Luo Team 2507.10355 null
2025-07-14 Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection Wenqiang Zhang Team 2507.10225 null
2025-07-14 BlueGlass: A Framework for Composite AI Safety Kay-Ulrich Scholl Team 2507.10106 null
2025-07-14 Foundation Model Driven Robotics: A Comprehensive Review Ammar Waheed Team 2507.10087 null
2025-07-14 LayLens: Improving Deepfake Understanding through Simplified Explanations Abhinav Dhall Team 2507.10066 null
2025-07-14 CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books Dimosthenis Karatzas Team 2507.10053 null
2025-07-14 Text-Driven Causal Representation Learning for Source-Free Domain Generalization Zhen Lei Team 2507.09961 null
2025-07-13 NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection Pulei Xiong Team 2507.09795 null
2025-07-13 Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score Muhammad Haris Khan Team 2507.09615 null
2025-07-13 Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations Guiguang Ding Team 2507.09500 null
2025-07-13 GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? Huaxiu Yao Team 2507.09491 null
2025-07-12 Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models Tat-Seng Chua Team 2507.09209 null
2025-07-12 MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models Dahan Wang Team 2507.09184 null
2025-07-12 OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering Niaz Abdolrahim Team 2507.09155 null
2025-07-12 RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze Honghan Wu Team 2507.09097 null
2025-07-11 BlindSight: Harnessing Sparsity for Efficient VLMs Steven K. Reinhardt Team 2507.09071 null
2025-07-11 Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery Seana Coulson Team 2507.09011 null
2025-07-11 VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models Olivier Déforges Team 2507.08982 null
2025-07-11 ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way Subarna Tripathi Team 2507.08679 null
2025-07-11 Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance András Lőrincz Team 2507.08624 null
2025-07-11 Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data Ambedkar Dukkipati Team 2507.08610 null
2025-07-11 BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis Hui Xiong Team 2507.08607 null
2025-07-11 Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R Sanidhya Kashyap Team 2507.08505 null
2025-07-11 LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning Lei Fan Team 2507.08496 null
2025-07-11 Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models Jianping Fan Team 2507.08410 null
2025-07-11 Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning Yejin Choi Team 2507.08224 null
2025-07-10 CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why Thomas Brox Team 2507.07985 null
2025-07-10 Scaling RL to Long Videos Song Han Team 2507.07966 null
2025-07-10 SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment Lei Fan Team 2507.07939 null
2025-07-10 MoSE: Skill-by-Skill Mixture-of-Expert Learning for Autonomous Driving Chao Zhang Team 2507.07818 null
2025-07-10 Energy-Guided Decoding for Object Hallucination Mitigation Christopher Zach Team 2507.07731 null
2025-07-10 One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models Cairong Zhao Team 2507.07709 null
2025-07-10 Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought Daiki Chijiwa Team 2507.07685 null
2025-07-11 ViLU: Learning Vision-Language Uncertainties for Failure Prediction Nicolas Thome Team 2507.07620 null
2025-07-10 LOSC: LiDAR Open-voc Segmentation Consolidator Renaud Marlet Team 2507.07605 null
2025-07-10 The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs Qun Liu Team 2507.07562 null
2025-07-10 ArchiveGPT: A human-centered evaluation of using a vision language model for image cataloguing Markus Huff Team 2507.07551 null
2025-07-11 Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning David Martins de Matos Team 2507.07340 null
2025-07-09 ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation Suren Kumar Team 2507.07317 null
2025-07-09 LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation Angel X. Chang Team 2507.07299 null
2025-07-09 MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning Dan Goldwasser Team 2507.07297 null
2025-07-09 4KAgent: Agentic Any Image to 4K Super-Resolution Zhengzhong Tu Team 2507.07105 null
2025-07-11 Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models Junfei Xiao Team 2507.07104 null
2025-07-09 Evaluating Attribute Confusion in Fashion Text-to-Image Generation Davide Talon Team 2507.07079 null
2025-07-09 Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM Sibei Yang Team 2507.06973 null
2025-07-09 CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale Quan Wang Team 2507.06959 null
2025-07-09 VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation Tat-Seng Chua Team 2507.06899 null
2025-07-09 HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement Yanning Zhang Team 2507.06814 null
2025-07-09 Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu Donghyeok Choi Team 2507.06761 null
2025-07-09 Text-promptable Object Counting via Quantity Awareness Enhancement Li Li Team 2507.06679 null
2025-07-09 Cross-Modal Dual-Causal Learning for Long-Term Action Recognition Fan Chao Team 2507.06603 null
2025-07-09 Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection Xiangmin Xu Team 2507.06510 null
2025-07-09 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds Nick Haber Team 2507.06484 null
2025-07-08 VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic Andreas A. Malikopoulos Team 2507.06441 null
2025-07-08 CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions Yi R. Fung Team 2507.06210 null
2025-07-08 Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling Naga Harshita Marupaka Team 2507.06183 null
2025-07-10 Skywork-R1V3 Technical Report Yahui Zhou Team 2507.06167 null
2025-07-08 LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models Hongming Shan Team 2507.06140 null
2025-07-08 GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing Hao Liu Team 2507.05887 null
2025-07-08 Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports Hitomi Yanaka Team 2507.05799 null
2025-07-08 SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning Tao He Team 2507.05798 null
2025-07-08 A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation Yue Gao Team 2507.05731 null
2025-07-09 Integrated Structural Prompt Learning for Vision-Language Models Bin Luo Team 2507.05677 null
2025-07-08 R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding Shabnam Ghadar Team 2507.05673 null
2025-07-08 Dynamic Rank Adaptation for Vision-Language Models Bin Luo Team 2507.05668 null
2025-07-08 Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik's Cube Shenghai Yuan Team 2507.05607 null
2025-07-08 Rethinking Layered Graphic Design Generation with a Top-Down Approach Qifeng Chen Team 2507.05601 null
2025-07-08 PaddleOCR 3.0 Technical Report Yanjun Ma Team 2507.05595 null
2025-07-07 Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality Junxiao Wang Team 2507.05515 null
2025-07-07 Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model Even Oldridge Team 2507.05513 null
2025-07-07 OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts Priyadarshini Panda Team 2507.05427 null
2025-07-07 pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models Ramtin Pedarsani Team 2507.05394 null
2025-07-07 NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving Cheng Lu Team 2507.05227 null
2025-07-07 All in One: Visual-Description-Guided Unified Point Cloud Segmentation Rao Muhammad Anwer Team 2507.05211 null
2025-07-07 Differential Attention for Multimodal Crisis Event Analysis Abdullah-Al-Zubaer Imran Team 2507.05165 null
2025-07-07 INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling Bo Zheng Team 2507.05056 null
2025-07-07 Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision Nicolas Padoy Team 2507.05020 null
2025-07-07 Training-free Generation of Temporally Consistent Rewards from VLMs Jian Tang Team 2507.04789 null
2025-07-07 Vision-Language Models Can't See the Obvious Sanath Narayan Team 2507.04741 null
2025-07-07 An analysis of vision-language models for fabric retrieval Fabio Poiesi Team 2507.04735 null
2025-07-07 A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets Jie Zhou Team 2507.04699 null
2025-07-07 MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding Dinesh Manocha Team 2507.04686 null
2025-07-07 Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation Chang Xu Team 2507.04680 null
2025-07-06 VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation Lei Han Team 2507.04524 null
2025-07-08 FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection Ruixuan Wang Team 2507.04511 null
2025-07-06 MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization Changhao Chen Team 2507.04509 null
2025-07-06 Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection Sanasam Ranbir Singh Team 2507.04458 null
2025-07-06 Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions Johan Bos Team 2507.04377 null
2025-07-05 LVLM-Composer's Explicit Planning for Image Generation Amina Grant Team 2507.04152 null
2025-07-05 Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation Hunter Young Team 2507.04151 null
2025-07-05 PresentAgent: Multimodal Agent for Presentation Video Generation Yang Zhao Team 2507.04036 null
2025-07-05 A Comparative Study of Specialized LLMs as Dense Retrievers Jiafeng Guo Team 2507.03958 null
2025-07-03 ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects Cewu Lu Team 2507.02600 null
2025-07-02 cVLA: Towards Efficient Camera-Space VLAs Thomas Brox Team 2507.02190 null
2025-07-02 Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges Anuj Sharma Team 2507.02074 null
2025-07-01 Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames Cordelia Schmid Team 2507.02001 null
2025-07-02 How Do Vision-Language Models Process Conflicting Information Across Modalities? Ellie Pavlick Team 2507.01790 null
2025-07-02 Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition Muzammil Behzad Team 2507.01673 null
2025-07-02 MARVIS: Modality Adaptive Reasoning over VISualizations Chinmay Hegde Team 2507.01544 null
2025-07-02 Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence Martin Schramm Team 2507.01504 null
2025-07-02 BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments Mingzhai Sun Team 2507.01485 null
2025-07-03 TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control Yanwei Fu Team 2507.01424 null
2025-07-02 CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning Yoshitaka Ushiku Team 2507.01409 null
2025-07-02 Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model Xi Li Team 2507.01351 null
2025-07-02 AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation Jiawei Zhang Team 2507.01255 null
2025-07-02 GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Jie Tang Team 2507.01006 null
2025-07-04 Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations Yunzhu Li Team 2507.00990 null
2025-07-01 Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact Seyedali Mirjalili Team 2507.00951 null
2025-07-01 The Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses Fabio Correa Xavier Team 2507.00907 null
2025-07-01 ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models Yaqi Xie Team 2507.00898 null
2025-07-01 GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond Luc Van Gool Team 2507.00886 null
2025-07-01 UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement Xiangxiang Chu Team 2507.00721 null
2025-07-01 Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English Rajesh Sharma Team 2507.00700 null
2025-07-01 Context-Aware Academic Emotion Dataset and Benchmark Wenwu Yang Team 2507.00586 null
2025-07-01 Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation Rong Xiao Team 2507.00537 null
2025-07-01 Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving Yadan Luo Team 2507.00525 null
2025-06-30 EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations Sungzoon Cho Team 2506.24016 null
2025-06-30 The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models Tieniu Tan Team 2506.24000 null
2025-06-30 GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models Hassan Rivaz Team 2506.23903 null
2025-06-30 A Closer Look at Conditional Prompt Tuning for Vision-Language Models Heng Tao Shen Team 2506.23856 null
2025-06-30 Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model Fahad Shahbaz Khan Team 2506.23822 null
2025-06-30 Visual Textualization for Image Prompted Object Detection Yan Xu Team 2506.23785 null
2025-06-30 PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? Ransalu Senanayake Team 2506.23725 null
2025-06-30 On the Domain Robustness of Contrastive Vision-Language Models Erik Rodner Team 2506.23663 null
2025-06-30 CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models Bing Qin Team 2506.23590 null
2025-06-30 A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation Jie Xu Team 2506.23584 null
2025-07-01 ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding ShengJing Yang Team 2506.23491 null
2025-06-30 Sanitizing Manufacturing Dataset Labels Using Vision-Language Models Vinh Nguyen Team 2506.23465 null
2025-06-29 GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields Yutaka Matsuo Team 2506.23352 null
2025-06-29 IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering Brandon Y. Feng Team 2506.23329 null
2025-07-01 SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting Hongliang Ren Team 2506.23309 null
2025-06-29 Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models Tanmoy Chakraborty Team 2506.23122 null
2025-06-29 MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings Zhicheng Dou Team 2506.23115 null
2025-06-29 Empowering Small VLMs to Think with Dynamic Memorization and Exploration Long Chen Team 2506.23061 null
2025-06-29 SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions Maarten Sap Team 2506.23046 null
2025-06-28 Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models Swadesh Swain Team 2506.22982 null
2025-06-27 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning Hengshuang Zhao Team 2506.22434 null
2025-06-27 Test-Time Consistency in Vision Language Models Leonid Sigal Team 2506.22395 null
2025-06-27 Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation Xun Xu Team 2506.22375 null
2025-06-27 Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment Bo Du Team 2506.22283 null
2025-06-27 COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication Albert Gatt Team 2506.22274 null
2025-06-27 Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs Mahdieh Soleymani Baghshah Team 2506.22146 null
2025-06-27 Universal Retrieval for Multimodal Trajectory Modeling Dehan Kong Team 2506.22056 null
2025-06-27 Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation Daisuke Deguchi Team 2506.22032 null
2025-06-27 SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation Xulei Yang Team 2506.21892 null
2025-06-27 Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles Matthew J. Barth Team 2506.21885 null
2025-06-27 Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation Zhiting Hu Team 2506.21876 null
2025-06-27 On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling Ben Y. Zhao Team 2506.21874 null
2025-06-27 Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling Yong Man Ro Team 2506.21863 null
2025-06-27 Embodied Domain Adaptation for Object Detection Feras Dayoub Team 2506.21860 null
2025-06-27 The Cost of Avoiding Backpropagation Hui Guan Team 2506.21833 null
2025-06-26 ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues Carolina Nobre Team 2506.21762 null
2025-06-26 Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs Ismini Lourentzou Team 2506.21656 null
2025-06-26 Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration Jian Wu Team 2506.21509 null
2025-06-26 Global and Local Entailment Learning for Natural World Imagery Nathan Jacobs Team 2506.21476 null
2025-06-26 Spatial Mental Modeling from Limited Views Li Fei-Fei Team 2506.21458 null
2025-06-27 ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models Ziwei Liu Team 2506.21356 null
2025-06-26 LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning Hayaru Shouno Team 2506.21317 null
2025-06-26 DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Ganesh Ramakrishnan Team 2506.21316 null
2025-06-26 World-aware Planning Narratives Enhance Large Vision-Language Model Planner Xipeng QIu Team 2506.21230 null
2025-06-26 Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion Jian Liang Team 2506.21144 null
2025-06-26 V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling Bin Ran Team 2506.21041 null
2025-06-26 Multimodal Prompt Alignment for Facial Expression Recognition Shutao Li Team 2506.21017 null
2025-06-26 Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology S Kevin Zhou Team 2506.21001 null
2025-06-26 TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation Yihong Wu Team 2506.20991 null
2025-06-26 SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes Zheng Zhang Team 2506.20990 null
2025-06-26 Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends Zeng-Guang Hou Team 2506.20966 null
2025-06-26 E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs Minh-Son Dao Team 2506.20944 null
2025-06-25 Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models Zafer Dogan Team 2506.20832 null
2025-06-25 How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? Bastian Leibe Team 2506.20795 null
2025-06-27 Shape2Animal: Creative Animal Generation from Natural Silhouettes Trung-Nghia Le Team 2506.20616 null
2025-06-25 HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction Maja Matarić Team 2506.20566 null
2025-06-25 Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation Morten Rieger Hannemose Team 2506.20449 null
2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition Michael Gienger Team 2506.20373 null
2025-06-25 Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards Bo Zheng Team 2506.20332 null
2025-06-25 MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations Vikram S. Adve Team 2506.20100 null
2025-06-24 Unified Vision-Language-Action Model Zhaoxiang Zhang Team 2506.19850 null
2025-06-24 Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models Christoph M. Friedrich Team 2506.19825 null
2025-06-24 CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation Jiangmiao Pang Team 2506.19816 null
2025-06-24 UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation Zhongliang Jiang Team 2506.19694 null
2025-06-24 PEVLM: Parallel Encoding for Vision-Language Models Yong Wu Team 2506.19651 null
2025-06-24 V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis Zuozhu Liu Team 2506.19610 null
2025-06-24 ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP Bokui Chen Team 2506.19608 null
2025-06-24 Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects Angelo Cangelosi Team 2506.19579 null
2025-06-24 Visual hallucination detection in large vision-language models via evidential conflict Liping Jing Team 2506.19513 null
2025-06-24 T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models Qingyao Wu Team 2506.19498 null
2025-06-24 Emergence of Text Readability in Vision Language Models Bohyung Han Team 2506.19389 null
2025-06-24 Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference Nutan Chen Team 2506.19303 null
2025-06-24 Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models Dan Zeng Team 2506.19300 null
2025-06-24 Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding Hui Xiong Team 2506.19288 null
2025-06-24 MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models Bo Zheng Team 2506.19257 null
2025-06-24 Scaffolding Dexterous Manipulation with Vision-Language Models Dorsa Sadigh Team 2506.19212 null
2025-06-23 Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition Bjoern W. Schuller Team 2506.19079 null
2025-06-23 HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models Krzysztof Czarnecki Team 2506.19072 null
2025-06-23 GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs Guanxi Shen Team 2506.18985 null
2025-06-23 VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning Jian Zhang Team 2506.18564 null
2025-06-23 Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey Heng Tao Shen Team 2506.18504 null
2025-06-23 InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models Wenhai Wang Team 2506.18385 null
2025-06-23 Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review Jing Qin Team 2506.18378 null
2025-06-23 Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? Bill Howe Team 2506.18322 null
2025-06-24 Referring Expression Instance Retrieval and A Strong End-to-End Baseline JinQiao Wang Team 2506.18246 null
2025-06-23 Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning Xinhai Zhao Team 2506.18234 null
2025-06-22 See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis Xiaoxiao Li Team 2506.18140 null
2025-06-22 CLGRPO: Reasoning Ability Enhancement for Small VLMs Zhiwang Zhang Team 2506.18048 null
2025-06-22 Adapting Vision-Language Models for Evaluating World Models Sarah Parisot Team 2506.17967 null
2025-06-21 RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models Marco Pavone Team 2506.17811 null
2025-06-21 MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation Xiaochuan Shi Team 2506.17664 null
2025-06-21 Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning Yu-Chiang Frank Wang Team 2506.17645 null
2025-06-21 CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning Xiaoling Wang Team 2506.17629 null
2025-06-21 DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving Zhengzhong Tu Team 2506.17590 null
2025-06-21 HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models Tao He Team 2506.17587 null
2025-06-20 Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction Jose Dolz Team 2506.17503 null
2025-06-20 Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation Ismail Ben Ayed Team 2506.17500 null
2025-06-20 General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting Georgios Georgakis Team 2506.17462 null
2025-06-20 Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? Klara Nahrstedt Team 2506.17417 null
2025-06-20 VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning Hengshuang Zhao Team 2506.17221 null
2025-06-20 Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Chuang Gan Team 2506.17218 null
2025-06-20 Do We Need Large VLMs for Spotting Soccer Actions? Sandeep Chaurasia Team 2506.17144 null
2025-06-20 Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments Nathaniel D. Bastian Team 2506.16994 null
2025-06-20 FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation Jinqiao Wang Team 2506.16806 null
2025-06-20 Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes Chen Feng Team 2506.16805 null
2025-06-20 Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models Xiaohua Xu Team 2506.16760 null
2025-06-20 TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion Xinbo Gao Team 2506.16730 null
2025-06-20 V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos Xiaoyu Qin Team 2506.16716 null
2025-06-20 VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation Liang Ding Team 2506.16703 null
2025-06-20 LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation Jing Liu Team 2506.16691 null
2025-06-19 CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity Yunzhu Li Team 2506.16652 null
2025-06-19 History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation Fatemeh Afghah Team 2506.16623 null
2025-06-19 GoalLadder: Incremental Goal Discovery with Vision-Language Models Shimon Whiteson Team 2506.16396 null
2025-06-19 CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset Amith Adiraju Team 2506.16385 null
2025-06-19 FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models Tat-Seng Chua Team 2506.16218 null
2025-06-19 AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models Shanghang Zhang Team 2506.16112 null
2025-06-19 Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation Yansong Tang Team 2506.16058 null
2025-06-19 DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning Zongqing Lu Team 2506.16012 null
2025-06-18 VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics Michal Štefánik Team 2506.15903 null
2025-06-18 GenRecal: Generation after Recalibration from Large to Small Vision-Language Models Yueh-Hua Wu Team 2506.15681 null
2025-06-18 Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning Imran Razzak Team 2506.15649 null
2025-06-18 FindingDory: A Benchmark to Evaluate Memory in Embodied Agents Zsolt Kira Team 2506.15635 null
2025-06-18 WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts Rémi Lebret Team 2506.15594 link
2025-06-18 DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement Zhuang Li Team 2506.15583 link
2025-06-18 Context-Informed Grounding Supervision Minjoon Seo Team 2506.15480 link
2025-06-19 OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models Guotai Wang Team 2506.15318 null
2025-06-18 MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering Adrian K. Davision Team 2506.15298 null
2025-06-18 ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections Shin'ichi Satoh Team 2506.15180 null
2025-06-18 DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory Yue Gao Team 2506.15096 null
2025-06-18 An Empirical Study of Bugs in Data Visualization Libraries Chengnian Sun Team 2506.15084 link
2025-06-17 PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning Yeyun Gong Team 2506.14907 link
2025-06-17 RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills Chuang Gan Team 2506.14763 null
2025-06-17 Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models Yuke Zhu Team 2506.14727 null
2025-06-17 AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions Dacheng Tao Team 2506.14697 null
2025-06-17 Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models Jiaheng Wei Team 2506.14674 null
2025-06-17 StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery Michelle Pasco Team 2506.14670 null
2025-06-17 SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks Liang Lin Team 2506.14512 null
2025-06-17 Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? Soumik Sarkar Team 2506.14507 link
2025-06-17 Adapting Lightweight Vision Language Models for Radiological Visual Question Answering Chang Sun Team 2506.14451 null
2025-06-17 Causally Steered Diffusion for Automated Video Counterfactual Generation Sotirios A. Tsaftaris Team 2506.14404 null
2025-06-17 Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments Xuesu Xiao Team 2506.14233 null
2025-06-17 Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology Benjamin Kwan Team 2506.14136 null
2025-06-17 A Hierarchical Test Platform for Vision Language Model (VLM)-Integrated Real-World Autonomous Driving Ziran Wang Team 2506.14100 null
2025-06-16 Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation Hyeongwoo Kim Team 2506.14015 null
2025-06-16 GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics Mac Schwager Team 2506.14009 null
2025-06-16 Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography Alejandro Santos-Díaz Team 2506.13964 null
2025-06-16 HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment Abdul Bais Team 2506.13925 null
2025-06-16 Touch begins where vision ends: Generalizable policies for contact-rich manipulation Raunaq Bhirangi Team 2506.13762 null
2025-06-16 Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins Wei-Chiu Ma Team 2506.13761 null
2025-06-16 OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning Yonghang Tai Team 2506.13723 null
2025-06-16 ROSA: Harnessing Robot States for Vision-Language and Action Alignment Xiaoyan Sun Team 2506.13679 null
2025-06-16 DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models Hanspeter Pfister Team 2506.13638 null
2025-06-16 VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation Wei Pan Team 2506.13428 null
2025-06-16 Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation Marija Popović Team 2506.13367 null
2025-06-16 Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling Rei Kawakami Team 2506.13282 null
2025-06-16 Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments Ee-Chien Chang Team 2506.13205 null
2025-06-16 Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence Bernard Ghanem Team 2506.13187 null
2025-06-16 GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models Jun Wang Team 2506.13166 null
2025-06-16 Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs Byung-Hoon Kim Team 2506.13102 null
2025-06-16 PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue Siqi Liu Team 2506.13063 null
2025-06-17 HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs Xuezhi Cao Team 2506.13038 null
2025-06-15 CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making Zuozhu Liu Team 2506.12849 null
2025-06-15 Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models Chang D. Yoo Team 2506.12822 null
2025-06-15 Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models Wentao Zhang Team 2506.12776 null
2025-06-15 NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models Jitao Sang Team 2506.12706 null
2025-06-15 Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context Sandeep Singhal Team 2506.12683 null
2025-06-14 Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation Yuexian Zou Team 2506.12609 null
2025-06-13 Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale Minsu Cho Team 2506.12009 null
2025-06-13 How Visual Representations Map to Language Feature Space in Multimodal LLMs Neel Nanda Team 2506.11976 null
2025-06-13 Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation Kaifu Zhang Team 2506.11820 null
2025-06-13 MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space Jan Strich Team 2506.11684 null
2025-06-13 VLM@school -- Evaluation of AI image understanding on German middle school knowledge Vincent Tischler Team 2506.11604 null
2025-06-13 EasyARC: Evaluating Vision Language Models on True Visual Reasoning Aylin Akkus Team 2506.11595 null
2025-06-13 Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis Johannes Betz Team 2506.11526 null
2025-06-13 Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs Min-Yen Kan Team 2506.11515 null
2025-06-13 Taming Stable Diffusion for Computed Tomography Blind Super-Resolution Lichao Mou Team 2506.11496 null
2025-06-13 On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving Mert D. Pesé Team 2506.11472 null
2025-06-12 Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving Liam Paull Team 2506.11234 null
2025-06-12 AIR: Zero-shot Generative Model Adaptation with Iterative Refinement Ngai-Man Cheung Team 2506.10895 link
2025-06-13 RationalVLA: A Rational Vision-Language-Action Model with Dual System Haoang Li Team 2506.10826 null
2025-06-12 Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding Mir Feroskhan Team 2506.10756 null
2025-06-13 IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain Yefeng Zheng Team 2506.10730 link
2025-06-12 GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning Guan Huang Team 2506.10639 null
2025-06-12 Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning Yong Liu Team 2506.10575 null
2025-06-12 LLMs Are Not Yet Ready for Deepfake Image Detection Kristen Moore Team 2506.10474 null
2025-06-12 UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models Shuai Lu Team 2506.10342 null
2025-06-12 Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions Gaowei Chen Team 2506.10334 null
2025-06-12 HalLoc: Token-level Localization of Hallucinations for Vision Language Models Gunhee Kim Team 2506.10286 null
2025-06-11 Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval Francis Ferraro Team 2506.10202 null
2025-06-11 Improving Personalized Search with Regularized Low-Rank Parameter Updates Bryan Russell Team 2506.10182 null
2025-06-11 A Navigation Framework Utilizing Vision-Language Models Kaiyu tang Team 2506.10172 null
2025-06-11 One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence Marinka Zitnik Team 2506.10157 null
2025-06-11 ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs Lijuan Wang Team 2506.10128 null
2025-06-11 Test-Time Adaptation for Generalizable Task Progress Estimation Alessandra Russo Team 2506.10085 null
2025-06-11 Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing Tieniu Tan Team 2506.09965 link
2025-06-11 From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models Chen Feng Team 2506.09930 null
2025-06-11 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation Hyunjung Shim Team 2506.09883 link
2025-06-11 Adding simple structure at inference improves Vision-Language Compositionality Gorka Azkune Team 2506.09691 link
2025-06-11 FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models Liangqiong Qu Team 2506.09638 null
2025-06-11 Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Jaehyung Kim Team 2506.09522 link
2025-06-11 Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning Jia Li Team 2506.09473 null
2025-06-11 TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision Susmit Jha Team 2506.09445 null
2025-06-11 DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt Ge Li Team 2506.09353 null
2025-06-10 UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation Li Fei-Fei Team 2506.09284 null
2025-06-10 MultiNet: An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models Harshvardhan Sikka Team 2506.09172 null
2025-06-10 VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Zhenfei Yin Team 2506.09049 null
2025-06-11 Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs Yonatan Belinkov Team 2506.09047 null
2025-06-10 Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Jiaqi Wang Team 2506.09040 null
2025-06-10 Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models Liansheng Wang Team 2506.08990 null
2025-06-10 Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions Yejin Choi Team 2506.08927 null
2025-06-12 Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Shanghang Zhang Team 2506.08817 null
2025-06-10 Multimodal Representation Alignment for Cross-modal Information Retrieval Luis A. Leiva Team 2506.08774 null
2025-06-10 PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly Xiaodan Liang Team 2506.08708 null
2025-06-10 VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism Weijiang Yu Team 2506.08691 null
2025-06-10 ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction Taesup Kim Team 2506.08678 null
2025-06-10 Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs Ang Li Team 2506.08543 null
2025-06-10 Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring Jiaheng Wei Team 2506.08429 null
2025-06-11 SafeCoT: Improving VLM Safety with Minimal Reasoning Chaochao Lu Team 2506.08399 null
2025-06-10 SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding Jaeyoung Do Team 2506.08391 null
2025-06-09 A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks Matthias Bethge Team 2506.08227 null
2025-06-11 GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra Guha Balakrishnan Team 2506.08194 null
2025-06-09 Open World Scene Graph Generation using Vision Language Models Anuj Karpatne Team 2506.08189 null
2025-06-09 CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems Ramya Korlakai Vinayak Team 2506.08071 null
2025-06-10 Vision Transformers Don't Need Trained Registers Yossi Gandelsman Team 2506.08010 null
2025-06-09 Hidden in plain sight: VLMs overlook their visual representations Trevor Darrell Team 2506.08008 null
2025-06-09 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Tieniu Tan Team 2506.07961 null
2025-06-09 Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations Yiqing Shen Team 2506.07943 null
2025-06-09 Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models Zsolt Kira Team 2506.07936 null
2025-06-09 SAM2Auto: Auto Annotation Using FLASH Q. M. Jonathan Wu Team 2506.07850 null
2025-06-09 Image Reconstruction as a Tool for Feature Analysis Andrey Kuznetsov Team 2506.07803 null
2025-06-09 Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger Shiming Xiang Team 2506.07785 null
2025-06-09 Language-Vision Planner and Executor for Text-to-Visual Reasoning Ling Liu Team 2506.07778 null
2025-06-10 ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models Shuai Lu Team 2506.07739 null
2025-06-09 OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting Bastian Leibe Team 2506.07697 null
2025-06-09 Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline Idan Szpektor Team 2506.07631 null
2025-06-09 Event-Priori-Based Vision-Language Model for Efficient Visual Understanding Michele Magno Team 2506.07627 null
2025-06-10 SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems Zhengzhong Tu Team 2506.07564 null
2025-06-10 GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition Conghui He Team 2506.07553 null
2025-06-09 Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent Ting Yang Ling Team 2506.07509 null
2025-06-09 Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency Xinggang Wang Team 2506.07497 null
2025-06-09 CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization Hyun Myung Team 2506.07484 null
2025-06-09 LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments Josh Park Team 2506.07416 null
2025-06-09 MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems Tao Qi Team 2506.07399 null
2025-06-06 CoMemo: LVLMs Need Image Context with Image Memory Jifeng Dai Team 2506.06279 null
2025-06-06 Movie Facts and Fibs (MF $^2$ ): A Benchmark for Long Movie Understanding André F. T. Martins Team 2506.06275 null
2025-06-06 Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study Lena Maier-Hein Team 2506.06232 null
2025-06-06 GenIR: Generative Visual Feedback for Mental Image Retrieval James Davis Team 2506.06220 null
2025-06-06 STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving Horst Possegger Team 2506.06218 null
2025-06-06 WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management Zijian Wang Team 2506.06084 null
2025-06-06 Full Conformal Adaptation of Medical Vision-Language Models Jose Dolz Team 2506.06076 null
2025-06-06 BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning Rudolf Lioutikov Team 2506.06072 null
2025-06-06 MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks Yiren Song Team 2506.05982 null
2025-06-06 HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios Weihao Gu Team 2506.05883 null
2025-06-06 Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? Hitomi Yanaka Team 2506.05765 null
2025-06-06 MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory João Magalhães Team 2506.05696 null
2025-06-06 DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models Xianpeng Lang Team 2506.05667 null
2025-06-05 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Furong Huang Team 2506.05523 null
2025-06-05 Degradation-Aware Image Enhancement via Vision-Language Classification Zibo Meng Team 2506.05450 null
2025-06-06 Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs Xiaodan Liang Team 2506.05318 null
2025-06-05 MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm Xiang Bai Team 2506.05218 null
2025-06-05 Quantifying Cross-Modality Memorization in Vision-Language Models Chiyuan Zhang Team 2506.05198 null
2025-06-05 CIVET: Systematic Evaluation of Understanding in VLMs Giuseppe Riccardi Team 2506.05146 null
2025-06-05 PixCell: A generative foundation model for digital histopathology images Dimitris Samaras Team 2506.05127 null
2025-06-05 A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions Dung Nguyen Team 2506.05061 null
2025-06-05 Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System Moju Zhao Team 2506.05020 null
2025-06-05 ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT Mikołaj Koszowski Team 2506.04929 null
2025-06-05 SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs Dacheng Tao Team 2506.04743 null
2025-06-05 Robust Few-Shot Vision-Language Model Adaptation Shu Kong Team 2506.04713 null
2025-06-05 HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model Sung Ju Hwang Team 2506.04704 null
2025-06-05 SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents Yu-Wing Tai Team 2506.04606 null
2025-06-05 MuSciClaims: Multimodal Scientific Claim Verification Niranjan Balasubramanian Team 2506.04585 null
2025-06-05 Handle-based Mesh Deformation Guided By Vision Language Model Aniket Bera Team 2506.04562 null
2025-06-04 RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics Shanghang Zhang Team 2506.04308 null
2025-06-04 Image Editing As Programs with Diffusion Models Xinchao Wang Team 2506.04158 null
2025-06-04 Recent Advances in Medical Image Classification Ngoc Quoc Ly Team 2506.04129 null
2025-06-04 LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward Jing Li Team 2506.04070 null
2025-06-04 Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization Min Zhang Team 2506.04039 null
2025-06-04 Vocabulary-free few-shot learning for Vision-Language Models Christophe De Vleeschouwer Team 2506.04005 null
2025-06-04 DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models Anders Holst Team 2506.03933 null
2025-06-04 Zero-Shot Temporal Interaction Localization for Egocentric Videos Hesheng Wang Team 2506.03662 null
2025-06-04 Spatial Understanding from Videos: Structured Prompts Meet Simulation Data Liqiang Nie Team 2506.03642 null
2025-06-04 VLMs Can Aggregate Scattered Training Patches Chaochao Lu Team 2506.03614 null
2025-06-04 BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance Ngan Le Team 2506.03589 null
2025-06-04 MiMo-VL Technical Report Bingquan Xia Team 2506.03569 null
2025-06-04 Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation Yixin Zhang Team 2506.03521 null
2025-06-04 DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models Aliaksandr Siarohin Team 2506.03517 null
2025-06-04 POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning Weixin Yao Team 2506.03511 link
2025-06-03 Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views Hansaem Kim Team 2506.03371 null
2025-06-03 Robustness in Both Domains: CLIP Needs a Robust Text Encoder Volkan Cevher Team 2506.03355 null
2025-06-03 Grounded Vision-Language Interpreter for Integrated Task and Motion Planning Atsushi Hashimoto Team 2506.03270 null
2025-06-03 OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Li Yi Team 2506.03135 null
2025-06-03 EgoVLM: Policy Optimization for Egocentric Video Understanding Linshen Liu Team 2506.03097 null
2025-06-03 DPO Learning with LLMs-Judge Signal for Computer Use Agents Phillip Howard Team 2506.03095 null
2025-06-03 From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Demba Ba Team 2506.03093 null
2025-06-03 Text-guided Generation of Efficient Personalized Inspection Plans Aniket Bera Team 2506.02917 null
2025-06-04 FlySearch: Exploring how vision-language models explore Maciej Wołczyk Team 2506.02896 null
2025-06-03 Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights Tony Wu Team 2506.02865 null
2025-06-03 SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking Yiwei Wang Team 2506.02803 null
2025-06-04 Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning Arash Afkanpour Team 2506.02738 null
2025-06-03 Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation Toshihiko Yamasaki Team 2506.02708 null
2025-06-03 Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet Zhi Wang Team 2506.02671 null
2025-06-03 Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models Dong Seog Han Team 2506.02615 null
2025-06-03 Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models Farzan Farnia Team 2506.02557 null
2025-06-03 Sign Language: Towards Sign Understanding for Robot Autonomy David Hsu Team 2506.02556 null
2025-06-03 SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence Yueming Jin Team 2506.02555 null
2025-06-03 Rethinking Post-Unlearning Behavior of Large Vision-Language Models Kyomin Jung Team 2506.02541 null
2025-06-04 MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection Qingyao Wu Team 2506.02535 null
2025-06-03 VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments Yu Wang Team 2506.02387 null
2025-06-03 Auto-Labeling Data for Object Detection Jason J. Corso Team 2506.02359 null
2025-06-03 RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models Jianzong Wang Team 2506.02354 null
2025-05-30 ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL Lili Qiu Team 2505.24875 null
2025-05-30 ProxyThinker: Test-Time Guidance through Small Visual Reasoners Vicente Ordonez Team 2505.24872 null
2025-05-30 GenSpace: Benchmarking Spatially-Aware Image Generation Zhou Zhao Team 2505.24870 null
2025-05-30 Time Blindness: Why Video-Language Models Can't See What Humans Can? Mohamed Elhoseiny Team 2505.24867 null
2025-05-30 Conformal Prediction for Zero-Shot Models Jose Dolz Team 2505.24693 null
2025-05-30 BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models Khoa Luu Team 2505.24649 null
2025-05-30 SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition Wadii Boulila Team 2505.24600 null
2025-05-30 AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders Liang Ding Team 2505.24519 null
2025-05-30 CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation Thamar Solorio Team 2505.24456 null
2025-05-30 Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning Matthias Hein Team 2505.24424 null
2025-05-30 MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs Sophia Ananiadou Team 2505.24423 null
2025-05-30 Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering Fadoua Ghourabi Team 2505.24371 null
2025-05-30 KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval Yong Li Team 2505.24342 null
2025-05-30 ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving Songan Zhang Team 2505.24317 null
2025-05-30 Benchmarking Foundation Models for Zero-Shot Biometric Tasks Arun Ross Team 2505.24214 null
2025-05-30 Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap Baharan Mirzasoleiman Team 2505.24208 null
2025-05-30 DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? Xuegong Zhang Team 2505.24173 null
2025-05-30 CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs Xuchen Song Team 2505.24120 null
2025-05-29 mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation Zhengzhong Tu Team 2505.24073 null
2025-05-29 Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding Tinoosh Mohsenin Team 2505.23990 null
2025-05-29 ZeroGUI: Automating Online GUI Learning at Zero Human Cost Jifeng Dai Team 2505.23762 link
2025-05-29 Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint David M. Chan Team 2505.23759 link
2025-05-29 To Trust Or Not To Trust Your Vision-Language Model's Prediction Olga Fink Team 2505.23745 link
2025-05-29 LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization Jing Liao Team 2505.23740 null
2025-05-29 Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better Sergey Levine Team 2505.23705 null
2025-05-29 Grounded Reinforcement Learning for Visual Reasoning Katerina Fragkiadaki Team 2505.23678 null
2025-05-29 Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition Liangcai Gao Team 2505.23566 null
2025-05-30 Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information Weiping Li Team 2505.23558 link
2025-05-29 TRAP: Targeted Redirecting of Agentic Preferences Gagandeep Singh Team 2505.23518 null
2025-05-29 VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation Xu-Cheng Yin Team 2505.23484 link
2025-05-29 Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model Muzammil Behzad Team 2505.23358 null
2025-05-29 LADA: Scalable Label-Specific CLIP Adapter for Continual Learning Min-Ling Zhang Team 2505.23271 link
2025-05-29 VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation Panayiotis Kolios Team 2505.23267 null
2025-05-29 Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion Tao Xiang Team 2505.23266 null
2025-05-29 ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering Lei Wang Team 2505.23242 null
2025-05-29 PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents Jinjin Gu Team 2505.23130 null
2025-05-29 Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation Yu Cheng Team 2505.23043 link
2025-05-29 An Empirical Study of Federated Prompt Learning for Vision Language Model Mang Ye Team 2505.23024 null
2025-05-29 SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model Zhenwei Shi Team 2505.23010 null
2025-05-29 QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining Muhao Chen Team 2505.23004 link
2025-05-28 Zero-Shot Vision Encoder Grafting via LLM Surrogates Tom Goldstein Team 2505.22664 link
2025-05-28 Training Free Stylized Abstraction Vishal M. Patel Team 2505.22663 null
2025-05-28 VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models Dong Yu Team 2505.22654 null
2025-05-28 Sherlock: Self-Correcting Reasoning in Vision-Language Models Ruqi Zhang Team 2505.22651 null
2025-05-28 Hypothesis Testing in Imaging Inverse Problems Marcelo Pereyra Team 2505.22481 null
2025-05-28 Zero-Shot 3D Visual Grounding from Vision-Language Models Junwei Liang Team 2505.22429 null
2025-05-28 IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth Syed Masum Billah Team 2505.22305 null
2025-05-28 Investigating Mechanisms for In-Context Vision Language Binding Vineet Gandhi Team 2505.22200 null
2025-05-29 Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging Piji Li Team 2505.22150 null
2025-05-28 3D Question Answering via only 2D Vision-Language Models Qianru Sun Team 2505.22143 null
2025-05-28 Reinforced Reasoning for Embodied Planning Bo Jin Team 2505.22050 null
2025-05-28 Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization Xinlei Chen Team 2505.22038 null
2025-05-28 Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset Muhammad Abdul-Mageed Team 2505.21979 null
2025-05-29 DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation Xin Tan Team 2505.21969 null
2025-05-28 Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack Usman Naseem Team 2505.21967 null
2025-05-28 Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs Byonghyo Shim Team 2505.21955 null
2025-05-28 Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge Yi Xu Team 2505.21906 null
2025-05-28 Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation Christian Desrosiers Team 2505.21844 null
2025-05-27 MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning Vivek Gupta Team 2505.21771 null
2025-05-27 MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis Christian Wachinger Team 2505.21698 null
2025-05-27 ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models Yueting Zhuang Team 2505.21500 null
2025-05-27 AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery Qing Wang Team 2505.21499 null
2025-05-27 Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration Ziwei Zhu Team 2505.21472 null
2025-05-27 ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models Wentao Zhang Team 2505.21465 null
2025-05-27 LazyVLM: Neuro-Symbolic Approach to Video Analytics M. Tamer Özsu Team 2505.21459 null
2025-05-27 DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models Soumik Sarkar Team 2505.21382 null
2025-05-27 XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration Min Zhang Team 2505.21279 null
2025-05-27 Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation Yutao Yue Team 2505.21106 null
2025-05-27 DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response Naoto Yokoya Team 2505.21089 null
2025-05-27 LPOI: Listwise Preference Optimization for Vision Language Models Gunhee Kim Team 2505.21061 null
2025-05-27 RefAV: Towards Planning-Centric Scenario Mining Neehar Peri Team 2505.20981 null
2025-05-27 On VLMs for Diverse Tasks in Multimodal Meme Classification Jasabanta Patro Team 2505.20937 null
2025-05-27 A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models Bugeun Kim Team 2505.20901 null
2025-05-27 AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding Joon Son Chung Team 2505.20862 null
2025-05-27 Rendering-Aware Reinforcement Learning for Vector Graphics Generation Marco Pedersoli Team 2505.20793 null
2025-05-27 FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation Mir Feroskhan Team 2505.20783 null
2025-05-27 Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models Yao Yang Team 2505.20728 null
2025-05-27 ManiTaskGen: A Comprehensive Task Generator for Benchmarking and Improving Vision-Language Agents on Embodied Decision-Making Hao Su Team 2505.20726 null
2025-05-27 Automating eHMI Action Design with LLMs for Automated Vehicle Communication Takeo Igarashi Team 2505.20711 null
2025-05-27 GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning Sundong Kim Team 2505.20672 null
2025-05-26 Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models Naoto Yokoya Team 2505.20236 null
2025-05-26 Agentic 3D Scene Generation with Spatially Contextualized VLMs Chi-Keung Tang Team 2505.20129 null
2025-05-26 MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models James M. Rehg Team 2505.20122 null
2025-05-27 EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition Sören Auer Team 2505.20033 null
2025-05-26 ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers Elmar Rückert Team 2505.20032 null
2025-05-26 Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models Ernest K. Ryu Team 2505.20021 null
2025-05-26 Can Visual Encoder Learn to See Arrows? Hiroaki Ozaki Team 2505.19944 null
2025-05-26 Attention! You Vision Language Model Could Be Maliciously Manipulated Shudong Zhang Team 2505.19911 null
2025-05-26 Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement Muzammil Behzad Team 2505.19895 null
2025-05-26 One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP Kehuan Zhang Team 2505.19840 null
2025-05-26 TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning Dongbin Zhao Team 2505.19769 null
2025-05-26 Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality Alessandro Bruno Team 2505.19696 null
2025-05-26 Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs Shu-Tao Xia Team 2505.19678 null
2025-05-26 JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models Yingchun Wang Team 2505.19610 null
2025-05-26 What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation Rongrong Ji Team 2505.19569 null
2025-05-26 FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models Ruixuan Li Team 2505.19536 null
2025-05-26 Locality-Aware Zero-Shot Human-Object Interaction Detection Minsu Cho Team 2505.19503 null
2025-05-26 Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models Guoliang Kang Team 2505.19498 null
2025-05-26 Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model Yu Cheng Team 2505.19406 null
2025-05-27 DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving Hao Zhao Team 2505.19381 null
2025-05-26 DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models Fatemeh Afghah Team 2505.19373 null
2025-05-23 VideoGameBench: Can Vision-Language Models complete popular video games? Ofir Press Team 2505.18134 null
2025-05-23 One RL to See Them All: Visual Triple Unified Reinforcement Learning Junjie Yan Team 2505.18129 null
2025-05-23 CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays Edward Choi Team 2505.18087 null
2025-05-23 FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation Shibiao Xu Team 2505.18053 null
2025-05-23 Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation Bogdan Sorin Coseriu Team 2505.18039 null
2025-05-23 Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling Mun Yong Yi Team 2505.17982 null
2025-05-23 VLM Models and Automated Grading of Atopic Dermatitis Hamed Ghodrati Team 2505.17835 null
2025-05-23 Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations Chao Shen Team 2505.17812 null
2025-05-23 U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding Hongcheng Guo Team 2505.17779 null
2025-05-23 SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain Yu Li Team 2505.17727 null
2025-05-23 Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek Xiangdong Zhou Team 2505.17702 null
2025-05-23 Towards General Continuous Memory for Vision-Language Models Biwei Huang Team 2505.17670 null
2025-05-23 EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications Min Yang Team 2505.17654 null
2025-05-23 HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning Jianfei Yang Team 2505.17645 null
2025-05-23 Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports Takahiro Omi Team 2505.17625 null
2025-05-23 CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment Zeng-Guang Hou Team 2505.17619 null
2025-05-23 Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving Wangmeng Zuo Team 2505.17609 null
2025-05-23 A Unified Multi-Scale Attention-Based Network for Automatic 3D Segmentation of Lung Parenchyma & Nodules In Thoracic CT Images Furqan Shaukat Team 2505.17602 null
2025-05-23 Multimodal Conversation Structure Understanding David Bamman Team 2505.17536 null
2025-05-23 Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding Sungzoon Cho Team 2505.17529 null
2025-05-22 Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models Mike Zheng Shou Team 2505.16854 link
2025-05-23 LaViDa: A Large Diffusion Language Model for Multimodal Understanding Aditya Grover Team 2505.16839 link
2025-05-22 From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization Huaxiu Yao Team 2505.16832 link
2025-05-22 Perceptual Quality Assessment for Embodied AI Guangtao Zhai Team 2505.16815 link
2025-05-22 SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving Hongsheng Li Team 2505.16805 null
2025-05-22 REOBench: Benchmarking Robustness of Earth Observation Foundation Models Tianjin Huang Team 2505.16793 link
2025-05-22 Single Domain Generalization for Few-Shot Counting via Universal Representation Matching Xinghao Chen Team 2505.16778 link
2025-05-22 IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models AiTi Aw Team 2505.16774 link
2025-05-22 Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation Jianbing Shen Team 2505.16763 null
2025-05-22 SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images Mahsa Baktashmotlagh Team 2505.16659 null
2025-05-22 Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models Pål Halvorsen Team 2505.16647 null
2025-05-22 MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation Zongqing Lu Team 2505.16602 null
2025-05-22 ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models Xiuying Chen Team 2505.16517 null
2025-05-22 Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models Yaochu Jin Team 2505.16446 null
2025-05-22 Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models Kai Han Team 2505.16416 link
2025-05-22 Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression Souvik Kundu Team 2505.16411 link
2025-05-22 VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving Samuel Labi Team 2505.16377 null
2025-05-22 MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing Xinhan Di Team 2505.16279 null
2025-05-22 When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification Jiaheng Wei Team 2505.16149 null
2025-05-22 Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation Junfeng Fang Team 2505.16146 null
2025-05-21 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition Xue Yang Team 2505.15818 null
2025-05-21 From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems Soujanya Poria Team 2505.15685 null
2025-05-21 FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models Qian Wang Team 2505.15644 null
2025-05-21 Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models Ya Wang Team 2505.15576 link
2025-05-21 TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving Abdallah Shami Team 2505.15564 null
2025-05-21 Clapper: Compact Learning and Video Representation in VLMs Fuzheng Zhang Team 2505.15529 null
2025-05-21 Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Ken Goldberg Team 2505.15517 null
2025-05-21 Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought Libo Qin Team 2505.15510 null
2025-05-21 Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts Soma Biswas Team 2505.15506 link
2025-05-21 Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification Irwin King Team 2505.15504 null
2025-05-21 Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models Bryan Hooi Team 2505.15489 null
2025-05-21 Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL Qing Li Team 2505.15436 null
2025-05-21 TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models Keze Wang Team 2505.15435 null
2025-05-21 On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable? Mohammad Yaqub Team 2505.15425 null
2025-05-21 Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study Hwanjo Yu Team 2505.15389 null
2025-05-21 RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation Farshad Khorrami Team 2505.15373 null
2025-05-21 Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition Youngsook Song Team 2505.15367 null
2025-05-21 AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving Diange Yang Team 2505.15298 null
2025-05-21 Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs Zibin Zheng Team 2505.15265 null
2025-05-21 Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation Kyomin Jung Team 2505.15249 null
2025-05-20 UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens Wentao Zhang Team 2505.14671 null
2025-05-20 CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation Faez Ahmed Team 2505.14646 null
2025-05-20 Debating for Better Reasoning: An Unsupervised Multimodal Approach Mirella Lapata Team 2505.14627 null
2025-05-21 PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models Wenjia Zhang Team 2505.14481 null
2025-05-20 RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding Serge Belongie Team 2505.14462 link
2025-05-20 SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation Masafumi Oyamada Team 2505.14381 null
2025-05-20 Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds Agnieszka Wykowska Team 2505.14366 null
2025-05-20 DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning Xing Yu Team 2505.14362 link
2025-05-20 Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives Gui-Song Xia Team 2505.14361 null
2025-05-20 Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey Dongwoo Kim Team 2505.14340 null
2025-05-20 Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models Chong Feng Team 2505.14257 null
2025-05-20 Visual Agentic Reinforcement Fine-Tuning Jiaqi Wang Team 2505.14246 link
2025-05-20 VoQA: Visual-only Question Answering Lei Huang Team 2505.14227 null
2025-05-20 Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models Matthew Purver Team 2505.14160 null
2025-05-20 Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent Xuming Hu Team 2505.14141 null
2025-05-20 NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI Benedikt Wiestler Team 2505.14064 null
2025-05-20 ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs Minlie Huang Team 2505.14035 null
2025-05-20 Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models Yalin Wang Team 2505.13973 null
2025-05-20 APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight Ambuj Singh Team 2505.13921 link
2025-05-20 InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning Jingkuan Song Team 2505.13888 null
2025-05-19 ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models Greg Durrett Team 2505.13444 null
2025-05-19 G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning Baobao Chang Team 2505.13426 link
2025-05-19 Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots Shreyas Kousik Team 2505.13376 null
2025-05-20 Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning Lan-Zhe Guo Team 2505.13317 null
2025-05-19 I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models R. Maria del Rio-Chanona Team 2505.13302 link
2025-05-19 Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts Sashank Varma Team 2505.13281 null
2025-05-19 From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection Jian Liang Team 2505.13233 link
2025-05-19 ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models Pekka Marttinen Team 2505.13180 link
2025-05-19 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model Dong Yu Team 2505.13062 null
2025-05-20 3D Visual Illusion Depth Estimation Yunde Jia Team 2505.13061 link
2025-05-19 MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO Ying Shan Team 2505.13031 link
2025-05-19 Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption Tomoki Hamagami Team 2505.12912 link
2025-05-19 TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks Jin Dong Team 2505.12884 null
2025-05-19 FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models Renxin Zhong Team 2505.12835 null
2025-05-19 VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection Ransalu Senanayake Team 2505.12715 null
2025-05-19 TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning Soodeh Nikan Team 2505.12670 null
2025-05-19 Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps Miguel P. Eckstein Team 2505.12660 null
2025-05-19 AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use Fei Wei Team 2505.12650 link
2025-05-19 Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency Zhengyu Zhao Team 2505.12644 null
2025-05-19 Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents Honglak Lee Team 2505.12632 null
2025-05-16 Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner Hong Bu Team 2505.11404 null
2025-05-16 Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild Guillaume Sartoretti Team 2505.11350 null
2025-05-16 Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models Joyce Chai Team 2505.11326 null
2025-05-16 Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation Chang D. Yoo Team 2505.11221 null
2025-05-16 Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing Begüm Demir Team 2505.11121 null
2025-05-16 CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs Natalia Díaz-Rodríguez Team 2505.11060 null
2025-05-16 Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere Prashant Singh Team 2505.11029 null
2025-05-16 On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating Alessandro Rinaldo Team 2505.10860 null
2025-05-16 Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities Shan Lin Team 2505.10764 null
2025-05-15 GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data? Tanwi Mallick Team 2505.10714 null
2025-05-15 MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation Muzammil Behzad Team 2505.10672 null
2025-05-15 CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier Ziyang Ou Team 2505.10664 null
2025-05-15 Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding Chong Feng Team 2505.10634 null
2025-05-15 MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly Mark Steedman Team 2505.10610 null
2025-05-18 MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models Vithursan Thangarasa Team 2505.10526 null
2025-05-16 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges Manoj Karkee Team 2505.10468 null
2025-05-15 Vision language models have difficulty recognizing virtual objects J. G. Trafton Team 2505.10453 null
2025-05-15 MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models Xiaodong Gu Team 2505.10088 link
2025-05-15 AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection Chengjie Wang Team 2505.09926 link
2025-05-14 Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling Nikolaus Correll Team 2505.09731 null
2025-05-14 ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation Daniel Seita Team 2505.09698 null
2025-05-14 LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models Yanan Sun Team 2505.09659 link
2025-05-14 Variational Visual Question Answering Marcus Rohrbach Team 2505.09591 null
2025-05-14 VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation Shuo Wang Team 2505.09577 null
2025-05-14 Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput Lin Ma Team 2505.09498 null
2025-05-14 Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition Muzammil Behzad Team 2505.09336 null
2025-05-14 MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning Bin-Bin Gao Team 2505.09265 null
2025-05-14 Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models Ross Greer Team 2505.09139 null
2025-05-14 Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning Qing Li Team 2505.09118 null
2025-05-14 OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions Hao Zhou Team 2505.09092 link
2025-05-13 Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training Heng Ji Team 2505.08971 link
2025-05-15 Behind Maya: Building a Multilingual Vision Language Model Alham Fikri Aji Team 2505.08910 link
2025-05-12 Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare Imon Banerjee Team 2505.08818 null
2025-05-13 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving Xiang Bai Team 2505.08725 link
2025-05-13 OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Yu Cheng Team 2505.08617 link
2025-05-13 From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Jianye Hao Team 2505.08548 null
2025-05-13 Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? Jimmy Huang Team 2505.08468 link
2025-05-13 MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos Wei Zhang Team 2505.08367 null
2025-05-13 Removing Watermarks with Partial Regeneration using Semantic Information Michael W. Mahoney Team 2505.08234 link
2025-05-13 CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding Shuo Wang Team 2505.08194 null
2025-05-13 DSADF: Thinking Fast and Slow for Decision Making Shufei Zhang Team 2505.08189 null
2025-05-12 Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models Jia-Bin Huang Team 2505.07815 null
2025-05-12 Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction Andrew Yates Team 2505.07730 null
2025-05-12 Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images Vasily Konovalov Team 2505.07704 null
2025-05-12 Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models Yihong Gong Team 2505.07690 null
2025-05-12 Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ ptimization Sung Ju Hwang Team 2505.07675 null
2025-05-12 Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning Hanwang Zhang Team 2505.07538 null
2025-05-12 AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography Xiaomeng Li Team 2505.07347 null
2025-05-12 Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Yahui Zhou Team 2505.07263 null
2025-05-12 Incomplete In-context Learning Yangshijie Zhang Team 2505.07251 null
2025-05-12 UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning Dzmitry Tsetserukou Team 2505.07236 null
2025-05-12 Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection Ningjiang Chen Team 2505.07219 link
2025-05-12 Internet of Agents: Fundamentals, Applications, and Challenges Dusit Niyato Team 2505.07176 null
2025-05-12 Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning Weiping Wang Team 2505.07172 null
2025-05-12 EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis Eunil Park Team 2505.07164 null
2025-05-11 A Vision-Language Foundation Model for Leaf Disease Identification Luyl-Da Quach Team 2505.07019 null
2025-05-11 Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models Binod Bhattarai Team 2505.07001 null
2025-05-11 UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms Zhenze Liu Team 2505.06832 null
2025-05-10 STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation Jean Oh Team 2505.06729 null
2025-05-10 METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection Shuo Yang Team 2505.06663 link
2025-05-10 Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation Nancy F. Chen Team 2505.06594 null
2025-05-09 MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks Bo Yan Team 2505.06152 link
2025-05-09 Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI Dominik Bollmann Team 2505.05895 null
2025-05-09 Describe Anything in Medical Images Min Xu Team 2505.05804 null
2025-05-09 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks Farshad Khorrami Team 2505.05800 null
2025-05-08 Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos Nina S. T. Hirata Team 2505.05681 null
2025-05-08 X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP James Bailey Team 2505.05528 link
2025-05-08 Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging Junxian He Team 2505.05464 link
2025-05-08 SITE: towards Spatial Intelligence Thorough Evaluation Boqing Gong Team 2505.05456 null
2025-05-08 DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning Jun Ma Team 2505.05360 null
2025-05-08 Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization Joon Son Chung Team 2505.05343 link
2025-05-08 Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects Matteo Matteucci Team 2505.05318 null
2025-05-08 Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models Meng Zhang Team 2505.05189 null
2025-05-08 OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning Qingming Huang Team 2505.05180 link
2025-05-08 Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models Joachim Denzler Team 2505.05163 null
2025-05-08 CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models Furao Shen Team 2505.05130 null
2025-05-08 X-Driver: Explainable Autonomous Driving with Vision-Language Models Zengfeng Zeng Team 2505.05098 null
2025-05-08 Image-Text Relation Prediction for Multilingual Tweets Edison Marrese-Taylor Team 2505.05040 null
2025-05-09 G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness Youngjae Yu Team 2505.05026 null
2025-05-08 Split Matching for Inductive Zero-shot Semantic Segmentation Daisuke Deguchi Team 2505.05023 null
2025-05-08 LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture Tatsuya Suzuki Team 2505.04980 null
2025-05-07 Vision-Language-Action Models: Concepts, Progress, Applications and Challenges Manoj Karkee Team 2505.04769 null
2025-05-07 "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments Xinlei He Team 2505.04488 null
2025-05-07 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception Zhuotao Tian Team 2505.04410 link
2025-05-07 CM1 -- A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models Gernot A. Fink Team 2505.04214 null
2025-05-07 R^3-VQA: "Read the Room" by Video Social Reasoning Lifeng Fan Team 2505.04147 null
2025-05-06 X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains Hoifung Poon Team 2505.03981 null
2025-05-06 Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning Victor Amblard Team 2505.03703 null
2025-05-06 Distribution-Conditional Generation: From Class Distribution to Creative Generation Xin Geng Team 2505.03667 null
2025-05-06 Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images Zhenan Sun Team 2505.03611 null
2025-05-06 Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection Ming-Hsuan Yang Team 2505.03610 null
2025-05-06 Mitigating Image Captioning Hallucinations in Vision-Language Models Xi Li Team 2505.03420 null
2025-05-07 Enhancing Target-unspecific Tasks through a Features Matrix Jun Yu Team 2505.03414 null
2025-05-06 Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models Aiden Doherty Team 2505.03374 null
2025-05-06 A Vision-Language Model for Focal Liver Lesion Classification Chen Yen-Wei Team 2505.03350 null
2025-05-06 From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection Rong Xiao Team 2505.03334 null
2025-05-06 Seeing the Abstract: Translating the Abstract Language for Vision Language Models Yiming Wang Team 2505.03242 link
2025-05-06 VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making Juan Carlos Niebles Team 2505.03181 null
2025-05-06 Robust Fairness Vision-Language Learning for Medical Image Analysis Shu Hu Team 2505.03153 link
2025-05-05 Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation Manish Dhakal Team 2505.02971 null
2025-05-05 LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery David M. Chan Team 2505.02829 null
2025-05-05 HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction Dzmitry Tsetserukou Team 2505.02569 null
2025-05-05 Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality Jimmy Lin Team 2505.02466 null
2025-05-05 Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey Songcan Chen Team 2505.02448 null
2025-05-05 SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing Sijie Zhu Team 2505.02370 link
2025-05-05 TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment Xinwei He Team 2505.02325 null
2025-05-04 Compositional Image-Text Matching and Retrieval by Grounding Entities Jana Košecká Team 2505.02278 null
2025-05-04 Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin Xinyang Chen Team 2505.02056 null
2025-05-04 A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models Xinya Du Team 2505.01958 null
2025-05-03 PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications Santosh Patapati Team 2505.01881 null
2025-05-03 Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos Anett Hoppe Team 2505.01790 null
2025-05-03 An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding Guoliang Xing Team 2505.01743 null
2025-05-03 Vision and Intention Boost Large Language Model in Long-Term Action Anticipation Yanning Zhang Team 2505.01713 null
2025-05-03 RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation Xiaodan Liang Team 2505.01709 null
2025-05-03 Topology-Aware CLIP Few-Shot Learning Dazhi Huang Team 2505.01694 null
2025-05-02 TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Jenq-Neng Hwang Team 2505.01583 null
2025-05-02 Grounding Task Assistance with Multimodal Cues from a Single Demonstration Andrew D. Wilson Team 2505.01578 null
2025-05-02 Dynamic Robot Tool Use with Vision Language Models Ahmed H. Qureshi Team 2505.01399 null
2025-05-02 Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages Valerio Guarrasi Team 2505.01096 null
2025-05-02 Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation Valerio Guarrasi Team 2505.01091 null
2025-05-02 Transferable Adversarial Attacks on Black-Box Vision-Language Models Matt Fredrikson Team 2505.01050 null
2025-04-30 Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis Alexei Kaltchenko Team 2505.00746 null
2025-05-01 Robotic Visual Instruction Xianzheng Ma Team 2505.00693 null
2025-05-01 Visual Test-time Scaling for GUI Agent Grounding Honglak Lee Team 2505.00684 null
2025-05-01 DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation Yang Gao Team 2505.00527 null
2025-05-01 LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving Henry X. Liu Team 2505.00284 null
2025-05-01 AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care Tianming Liu Team 2505.00275 null
2025-04-30 V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving Markus Lienkamp Team 2505.00156 null
2025-04-30 Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models Xintao Wu Team 2505.00150 null
2025-04-30 Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design Mahdi S. Hosseini Team 2505.00134 null
2025-04-30 Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization Ganesh Ramakrishnan Team 2504.21831 null
2025-04-30 Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models Lin Lee Cheong Team 2504.21559 null
2025-04-30 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors Zhou Zhao Team 2504.21530 null
2025-04-30 Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection William Hsu Team 2504.21344 null
2025-04-29 MemeBLIP2: A novel lightweight multimodal system to detect harmful memes Lisha Xu Team 2504.21226 null
2025-04-29 GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model Yue Zhao Team 2504.21186 null
2025-04-29 Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization Xiaojun Chang Team 2504.21063 null
2025-04-29 Real-Time Wayfinding Assistant for Blind and Low-Vision Users Farhan Sadaf Team 2504.20976 null
2025-04-29 FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models Elisa Ricci Team 2504.20860 null
2025-04-29 In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer Yi Yang Team 2504.20690 null
2025-04-29 SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data Freda Shi Team 2504.20648 null
2025-04-29 PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations Xuguang Lan Team 2504.20520 null
2025-04-29 Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception Xiaoqiang Li Team 2504.20468 null
2025-04-29 Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks Dimitrios K. Nasiopoulos Team 2504.20419 null
2025-04-29 FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding Bo Zheng Team 2504.20384 null
2025-04-28 A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports Christoph M. Friedrich Team 2504.20220 null
2025-04-28 Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains Rui Yan Team 2504.20199 null
2025-04-28 SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning Alan Yuille Team 2504.20024 null
2025-04-28 EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia Diego Marcos Team 2504.19742 null
2025-04-28 Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model Guoying Zhao Team 2504.19739 null
2025-04-28 VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning Xiaobo Xia Team 2504.19627 null
2025-04-28 LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning Aimin Yang Team 2504.19524 null
2025-04-27 DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning Shini Han Team 2504.19127 null
2025-04-27 Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction Jian Liu Team 2504.19086 null
2025-04-26 Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation Arif Mahmood Team 2504.18856 null
2025-04-26 Video CLIP Model for Multi-View Echocardiography Interpretation Norihiko Takeda Team 2504.18800 null
2025-04-25 A Review of 3D Object Detection with Vision-Language Models Manoj Karkee Team 2504.18738 null
2025-04-25 Proof-of-TBI -- Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction Donna Broshek Team 2504.18671 null
2025-04-25 Generalization Capability for Imitation Learning Yixiao Wang Team 2504.18538 null
2025-04-25 Fast-Slow Thinking for Large Vision-Language Model Reasoning Fei Wu Team 2504.18458 null
2025-04-25 Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation Guang Yang Team 2504.18453 null
2025-04-25 Revisiting Data Auditing in Large Vision-Language Models Zhuosheng Zhang Team 2504.18349 null
2025-04-25 A Large Vision-Language Model based Environment Perception System for Visually Impaired People Shiguo Lian Team 2504.18027 null
2025-04-24 CAMU: Context Augmentation for Meme Understanding Aditya Joshi Team 2504.17902 null
2025-04-24 FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model Waikeung Wong Team 2504.17826 null
2025-04-25 Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction Weiyan Wen Team 2504.17671 null
2025-04-24 SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting Qingming Huang Team 2504.17395 null
2025-04-24 M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction Tatsunori Mori Team 2504.17353 null
2025-04-24 DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model Hao Yang Team 2504.17315 null
2025-04-24 Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning Khimya Khetarpal Team 2504.17282 null
2025-04-24 Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation Minhyuk Sung Team 2504.17207 null
2025-04-23 Distilling semantically aware orders for autoregressive image generation Marco Pedersoli Team 2504.17069 null
2025-04-23 DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs Ran Xu Team 2504.17040 null
2025-04-24 V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations Yi R. Fung Team 2504.16727 null
2025-04-23 Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes Giovanni Fusco Team 2504.16538 null
2025-04-23 TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance Jiaya Jia Team 2504.16505 null
2025-04-23 FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing Biplab Banerjee Team 2504.16433 null
2025-04-22 CLIP-IT: CLIP-based Pairing for Histology Images Classification Eric Granger Team 2504.16181 null
2025-04-22 MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Lili Qiu Team 2504.16083 null
2025-04-22 MR. Video: "MapReduce" is the Principle for Long Video Understanding Yu-Xiong Wang Team 2504.16082 null
2025-04-22 Describe Anything: Detailed Localized Image and Video Captioning Yin Cui Team 2504.16072 null
2025-04-22 Vision language models are unreliable at trivial spatial cognition J. Gregory Trafton Team 2504.16061 null
2025-04-22 Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Joyce Chai Team 2504.16060 null
2025-04-22 Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis Judy Gichoya Team 2504.16047 null
2025-04-22 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Mike Zheng Shou Team 2504.16030 null
2025-04-24 Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models Tolga Çukur Team 2504.15929 null
2025-04-21 CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting Mohit Bansal Team 2504.15485 null
2025-04-21 Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Guilin Liu Team 2504.15271 null
2025-04-21 KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking Kijung Shin Team 2504.15135 link
2025-04-21 Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation Serge Belongie Team 2504.14988 link
2025-04-21 VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform Kun Gai Team 2504.14904 null
2025-04-21 Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation Yunji Chen Team 2504.14848 null
2025-04-20 OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding Zuozhu Liu Team 2504.14692 null
2025-04-20 NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation Juho Kannala Team 2504.14638 null
2025-04-20 LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation Yongsheng Gao Team 2504.14467 null
2025-04-20 Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability Sandra Avila Team 2504.14446 null
2025-04-19 Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models Nathaniel D. Bastian Team 2504.14395 null
2025-04-19 How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? James Zou Team 2504.14391 null
2025-04-19 A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling Adriana Kovashka Team 2504.14359 null
2025-04-19 Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses Chau Yuen Team 2504.14326 null
2025-04-19 Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization Xu Yang Team 2504.14200 null
2025-04-19 Bayesian Principles Improve Prompt Learning In Vision-Language Models Mijung Park Team 2504.14123 null
2025-04-19 PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models Ozlem Ozmen Garibay Team 2504.14117 null
2025-04-21 Analysing the Robustness of Vision-Language-Models to Common Corruptions Umair Bin Mansoor Team 2504.13690 null
2025-04-18 EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model Beng Chin Ooi Team 2504.13650 link
2025-04-18 PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting Miao Yu Team 2504.13624 null
2025-04-18 Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization Huadong Ma Team 2504.13460 null
2025-04-18 Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety Ross Greer Team 2504.13399 null
2025-04-17 VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture Yanbo Huang Team 2504.13365 null
2025-04-17 Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models Jacky Liang Team 2504.13351 null
2025-04-17 WildFireCan-MMD: A Multimodal dataset for Classification of User-generated Content During Wildfires in Canada Marzieh Amini Team 2504.13231 null
2025-04-17 PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Christoph Feichtenhofer Team 2504.13180 null
2025-04-17 Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling David M. Chan Team 2504.13169 link
2025-04-17 Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training Zhanhui Kang Team 2504.13123 null
2025-04-17 Probing and Inducing Combinational Creativity in Vision-Language Models Zilong Zheng Team 2504.13120 null
2025-04-17 Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration Yong Hong Kuo Team 2504.13119 null
2025-04-17 Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development Christoph Csallner Team 2504.13069 null
2025-04-17 NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Michael Qizhe Shieh Team 2504.13055 null
2025-04-17 Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning Wenwu Zhu Team 2504.12680 link
2025-04-17 VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization Siheng Chen Team 2504.12661 null
2025-04-16 Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation Éric Granger Team 2504.12436 link
2025-04-16 FLIP Reasoning Challenge Roger Wattenhofer Team 2504.12256 null
2025-04-16 Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models - Hanno Gottschalk Team 2504.12137 null
2025-04-17 Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions Zhi-Qi Cheng Team 2504.11967 null
2025-04-16 Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning Yi Chang Team 2504.11930 null
2025-04-16 A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification Janis Keuper Team 2504.11838 null
2025-04-17 DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment Moncef Gabbouj Team 2504.11733 null
2025-04-16 Interpreting the Linear Structure of Vision-language Model Embedding Spaces Stephanie Gil Team 2504.11695 null
2025-04-16 VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps Mariano Ceccato Team 2504.11675 null
2025-04-15 Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation Majid Mirmehdi Team 2504.11669 null
2025-04-17 PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage Lina Wang Team 2504.11509 null
2025-04-15 From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation Jungong Han Team 2504.11368 null
2025-04-17 UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis Yan Lu Team 2504.11257 null
2025-04-15 R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning Ran He Team 2504.11195 null
2025-04-15 Benchmarking Vision Language Models on German Factual Data Vincent Tischler Team 2504.11108 null
2025-04-16 Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR Gongshen Liu Team 2504.11101 null
2025-04-15 QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models Yu Wang Team 2504.11038 null
2025-04-15 Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles Ross Greer Team 2504.10873 null
2025-04-15 LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation Mohsen Imani Team 2504.10854 null
2025-04-15 Enhancing Features in Long-tailed Data Using Large Vision Mode Xuesong Li Team 2504.10852 null
2025-04-14 ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models Lifeng Zhou Team 2504.10757 null
2025-04-14 AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark Yu-Xiong Wang Team 2504.10568 null
2025-04-14 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Jiashi Feng Team 2504.10465 null
2025-04-15 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents Run Luo Team 2504.10458 null
2025-04-14 SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model Yanning Zhang Team 2504.10320 null
2025-04-15 Breaking the Data Barrier -- Building GUI Agents Through Task Generalization Junxian He Team 2504.10127 null
2025-04-14 AGO: Adaptive Grounding for Open World 3D Occupancy Prediction Andreas Zell Team 2504.10117 null
2025-04-14 CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography Jun-Cheng Chen Team 2504.10090 null
2025-04-14 Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure Frédéric Dufaux Team 2504.10049 null
2025-04-14 Aligning Anime Video Generation with Human Feedback Zuxuan Wu Team 2504.10044 null
2025-04-14 KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks Takamitsu Matsubara Team 2504.10011 null
2025-04-14 GenTe: Generative Real-world Terrains for General Legged Robot Locomotion Control Xiaoqiang Ji Team 2504.09997 null
2025-04-14 Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models Keisuke Ozawa Team 2504.09979 null
2025-04-14 Can VLMs Assess Similarity Between Graph Visualizations? Jinwook Seo Team 2504.09859 null
2025-04-14 VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents Jun Suzuki Team 2504.09795 null
2025-04-13 A Survey on Efficient Vision-Language Models Nirmalya Roy Team 2504.09724 null
2025-04-13 Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference Tadahiro Taniguchi Team 2504.09620 null
2025-04-13 DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning Mukesh Prasad Team 2504.09598 null
2025-04-13 Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation Yunhong Wang Team 2504.09480 null
2025-04-13 Identity-Aware Vision-Language Model for Explainable Face Forgery Detection Yu-Gang Jiang Team 2504.09439 null
2025-04-13 BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning Boqing Gong Team 2504.09426 null
2025-04-12 PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks Yang Liu Team 2504.09258 null
2025-04-11 AstroLLaVA: towards the unification of astronomical data and natural language Dimitrios Tanoglidis Team 2504.08583 null
2025-04-11 EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models Jinwoo Kim Team 2504.08205 null
2025-04-10 Investigating Vision-Language Model for Point Cloud-based Vehicle Classification Camille Kamga Team 2504.08154 null
2025-04-10 The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search David Ha Team 2504.08066 null
2025-04-10 VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Feng Zhao Team 2504.07956 null
2025-04-10 SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos Yuhao Chen Team 2504.07867 null
2025-04-10 CollEX -- A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections Chris Biemann Team 2504.07643 null
2025-04-10 VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Tiancheng Zhao Team 2504.07615 link
2025-04-10 TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs Xuezhi Cao Team 2504.07556 null
2025-04-10 Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models Xian-Sheng Hua Team 2504.07521 link
2025-04-10 Kimi-VL Technical Report Ziwei Chen Team 2504.07491 link
2025-04-09 Perception in Reflection Vishal M. Patel Team 2504.07165 null
2025-04-09 Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation Marzieh Fadaee Team 2504.07072 null
2025-04-09 Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition Aythami Morales Team 2504.06925 null
2025-04-09 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking Hesheng Wang Team 2504.06863 null
2025-04-09 ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models Namhoon Lee Team 2504.06838 null
2025-04-09 LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding Bo XU Team 2504.06835 null
2025-04-08 PromptHMR: Promptable Human Mesh Recovery Muhammed Kocabas Team 2504.06397 null
2025-04-08 SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation Zhaozheng Yin Team 2504.06389 null
2025-04-08 OmniSVG: A Unified Scalable Vector Graphics Generation Model Yu-Gang Jiang Team 2504.06263 null
2025-04-08 Latent Multimodal Reconstruction for Misinformation Detection Panagiotis C. Petrantonakis Team 2504.06010 link
2025-04-08 Measuring Déjà vu Memorization Efficiently Kamalika Chaudhuri Team 2504.05651 null
2025-04-08 A Lightweight Large Vision-language Model for Multimodal Medical Images Navid Toosy Saidy Team 2504.05575 null
2025-04-10 ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering Shafiq Joty Team 2504.05506 null
2025-04-07 Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models Aliasghar Arab Team 2504.05477 null
2025-04-07 Taxonomy-Aware Evaluation of Vision-Language Models Stella Frank Team 2504.05457 null
2025-04-07 Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly Anamaria Crisan Team 2504.05445 null
2025-04-07 InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Dimitrios Tzionas Team 2504.05303 null
2025-04-07 SmolVLM: Redefining small and efficient multimodal models Thomas Wolf Team 2504.05299 null
2025-04-07 A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? Ismail Ben Ayed Team 2504.05227 null
2025-04-07 Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation Wei Zhang Team 2504.05225 null
2025-04-08 A Taxonomy of Self-Handover Katsushi Ikeuchi Team 2504.04939 null
2025-04-07 SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models Lorenz Hufe Team 2504.04893 null
2025-04-07 Don't Lag, RAG: Training-Free Adversarial Detection Using RAG Ofer Hadar Team 2504.04858 null
2025-04-07 OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance Xinhan Di Team 2504.04781 null
2025-04-07 Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding Zahir Alsulaimawi Team 2504.04772 null
2025-04-07 Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions Yue Wang Team 2504.04744 null
2025-04-07 Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data Venkatesh Saligrama Team 2504.04740 null
2025-04-06 M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models Ruixiang Tang Team 2504.04633 null
2025-04-06 Foundation Models for Software Engineering of Cyber-Physical Systems: the Road Ahead Shaukat Ali Team 2504.04630 null
2025-04-06 Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection Xiaomeng Huang Team 2504.04517 link
2025-04-06 OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning Jose M. Alvarez Team 2504.04348 null
2025-04-06 MedM-VL: What Makes a Good Medical LVLM? Ji Wu Team 2504.04323 null
2025-04-05 GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill Siyuan Huang Team 2504.04191 null
2025-04-05 LATTE: Lightweight Attention-based Traffic Accident Anticipation Engine Zhenning Li Team 2504.04103 null
2025-04-05 TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection Xiaohua Xu Team 2504.04099 null
2025-04-04 VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Anelia Angelova Team 2504.03970 null
2025-04-04 Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models Matias Valdenegro-Toro Team 2504.03440 null
2025-04-04 SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding Naoto Yokoya Team 2504.03254 null
2025-04-04 Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators Lawson L. S. Wong Team 2504.03245 null
2025-04-04 Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation Robby T. Tan Team 2504.03193 null
2025-04-04 NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving Zhengzhong Tu Team 2504.03164 null
2025-04-04 TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference Xianpeng Lang Team 2504.03154 null
2025-04-04 MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories Arvind Ramanathan Team 2504.03153 null
2025-04-03 QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding Bryan Wang Team 2504.02971 null
2025-04-03 STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Naoufel Werghi Team 2504.02823 null
2025-04-03 Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Zeynep Akata Team 2504.02821 null
2025-04-03 Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence Serena Yeung-Levy Team 2504.02799 null
2025-04-03 Robot-Led Vision Language Model Wellbeing Assessment of Children Hatice Gunes Team 2504.02765 null
2025-04-04 Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Pengfei Liu Team 2504.02587 null
2025-04-03 Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision Shibiao Xu Team 2504.02477 null
2025-04-03 Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation Rui Yan Team 2504.02438 null
2025-04-03 ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback Hailong Wang Team 2504.02357 null
2025-04-03 Large (Vision) Language Models are Unsupervised In-Context Learners Maria Brbic Team 2504.02349 link
2025-04-03 Re-thinking Temporal Search for Long-Form Video Understanding Manling Li Team 2504.02259 null
2025-04-03 SocialGesture: Delving into Multi-person Gesture Understanding James M. Rehg Team 2504.02244 null
2025-04-02 FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs Fatima Albreiki Team 2504.01916 link
2025-04-02 Is Temporal Prompting All We Need For Limited Labeled Action Recognition? Xiaobo Jin Team 2504.01890 null
2025-04-02 Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images Abdullah-Al-Zubaer Imran Team 2504.01838 link
2025-04-02 BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing Leonidas Guibas Team 2504.01786 null
2025-04-02 AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization Linli Xu Team 2504.01735 null
2025-04-02 Reasoning LLMs for User-Aware Multimodal Conversational Agents Mohamed Chetouani Team 2504.01700 null
2025-04-02 CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition Hamzah Luqman Team 2504.01666 link
2025-04-02 BioAtt: Anatomical Prior Driven Low-Dose CT Denoising UiHyun Cho Team 2504.01662 null
2025-04-02 Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models Ming-Hsuan Yang Team 2504.01589 null

(back to top)

VLA

Publish Date Title Authors PDF Code
2025-07-23 InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation Jiangmiao Pang Team 2507.17520 null
2025-07-23 ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents Hesheng Wang Team 2507.17462 null
2025-07-23 Confidence Calibration in Vision-Language-Action Models Richard Zemel Team 2507.17383 null
2025-07-23 VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback Harold Soh Team 2507.17294 null
2025-07-22 ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning Fu-En Yang Team 2507.16815 null
2025-07-21 Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos Zongqing Lu Team 2507.15597 null
2025-07-22 GR-3 Technical Report Yichu Yang Team 2507.15493 null
2025-07-18 EdgeVLA: Efficient Vision-Language-Action Models Benjamin Bolte Team 2507.14049 null
2025-07-21 LaViPlan : Language-Guided Visual Path Planning with RLVR Hayeon Oh Team 2507.12911 null
2025-07-17 AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation Jun Zhu Team 2507.12768 null
2025-07-18 EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos Xiaolong Wang Team 2507.12440 null
2025-07-14 Vision Language Action Models in Robotic Manipulation: A Systematic Review Irfan Hussain Team 2507.10672 null
2025-07-12 Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization Yang Gao Team 2507.09160 null
2025-07-09 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds Nick Haber Team 2507.06484 null
2025-07-07 NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving Cheng Lu Team 2507.05227 null
2025-07-10 VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting Yanzhi Wang Team 2507.05116 null
2025-07-17 DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge Xin Jin Team 2507.04447 null
2025-07-06 Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties Yunxin Liu Team 2507.04227 null
2025-07-03 DexVLG: Dexterous Vision-Language-Grasp Model at Scale He Wang Team 2507.02747 null
2025-07-02 cVLA: Towards Efficient Camera-Space VLAs Thomas Brox Team 2507.02190 null
2025-07-02 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Yaodong Yang Team 2507.01925 null
2025-07-02 MoIRA: Modular Instruction Routing Architecture for Multi-Task Robotics Nadiya Shvai Team 2507.01843 null
2025-07-03 TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control Yanwei Fu Team 2507.01424 null
2025-07-01 VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers Tong He Team 2507.01016 null
2025-07-01 Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding Bo Zhao Team 2507.00416 null
2025-06-30 A Survey on Vision-Language-Action Models for Autonomous Driving Lijun Sun Team 2506.24044 null
2025-06-27 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration Li Zhang Team 2506.22242 null
2025-06-26 WorldVLA: Towards Autoregressive Action World Model Hao Chen Team 2506.21539 null
2025-06-26 Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends Zeng-Guang Hou Team 2506.20966 null
2025-06-24 Unified Vision-Language-Action Model Zhaoxiang Zhang Team 2506.19850 null
2025-06-24 CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation Jiangmiao Pang Team 2506.19816 null
2025-07-07 RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models Marco Pavone Team 2506.17811 null
2025-06-21 RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models Xiao Li Team 2506.17639 null
2025-06-21 VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models Lin Shao Team 2506.17561 null
2025-06-19 CapsDT: Diffusion-Transformer for Capsule Robot Manipulation Hongliang Ren Team 2506.16263 null
2025-06-19 ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models Siyuan Huang Team 2506.16211 null
2025-06-19 ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes Hao Dong Team 2506.14317 null
2025-06-16 GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics Mac Schwager Team 2506.14009 null
2025-06-16 AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning Jiaqi Ma Team 2506.13757 link
2025-06-19 LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction Shankar Sastry Team 2506.13751 null
2025-06-16 CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding Haoang Li Team 2506.13725 null
2025-06-16 ROSA: Harnessing Robot States for Vision-Language and Action Alignment Xiaoyan Sun Team 2506.13679 null
2025-06-16 Block-wise Adaptive Caching for Accelerating Diffusion Policy Zhi Wang Team 2506.13456 null
2025-06-19 A Comprehensive Survey on Continual Learning in Generative Models Cheng-Lin Liu Team 2506.13045 link
2025-06-19 SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration Wenwu Zhu Team 2506.12723 null
2025-06-13 RationalVLA: A Rational Vision-Language-Action Model with Dual System Haoang Li Team 2506.10826 null
2025-06-11 EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models Linfeng Zhang Team 2506.10100 null
2025-06-11 SAFE: Multitask Failure Detection for Vision-Language-Action Models Florian Shkurti Team 2506.09937 null
2025-06-11 From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models Chen Feng Team 2506.09930 null
2025-06-17 An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models Harshvardhan Sikka Team 2506.09172 null
2025-06-10 FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency Jian Tang Team 2506.08822 null
2025-06-10 Hybrid Reasoning for Perception, Explanation, and Autonomous Action in Manufacturing Sebastian W. Pattinson Team 2506.08462 null
2025-06-11 TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization Qi Wang Team 2506.08440 null
2025-06-11 HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation Cong Wang Team 2506.08296 null
2025-06-14 Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework Jason H. Moore Team 2506.08185 link
2025-06-09 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Tieniu Tan Team 2506.07961 null
2025-06-09 Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse Chris Xiaoxuan Lu Team 2506.07639 null
2025-06-09 BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation Xilin Chen Team 2506.07530 link
2025-06-09 Real-Time Execution of Action Chunking Flow Policies Sergey Levine Team 2506.07339 null
2025-06-12 Robotic Policy Learning via Human-assisted Action Preference Optimization Di Hu Team 2506.07127 null
2025-06-07 RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation Si Liu Team 2506.06677 null
2025-06-06 MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping Farshad Khorrami Team 2506.06535 null
2025-06-06 DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models Xianpeng Lang Team 2506.05667 null
2025-06-04 SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models Jian Tang Team 2506.03574 null
2025-06-03 Adversarial Attacks on Robotic Vision Language Action Models J. Zico Kolter Team 2506.03350 link
2025-06-02 Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning Pheng-Ann Heng Team 2506.01953 null
2025-06-02 SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics Remi Cadene Team 2506.01844 link
2025-06-02 MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments Jun Zhu Team 2506.01616 null
2025-06-02 ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding Huaxiu Yao Team 2506.01300 null
2025-06-01 OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation Valts Blukis Team 2506.01196 null
2025-05-31 LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks Zhijie Deng Team 2506.00411 null
2025-05-30 Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction Xuelong Li Team 2505.24156 null
2025-05-29 Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models Hao Zhao Team 2505.23757 link
2025-05-29 Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better Sergey Levine Team 2505.23705 null
2025-05-29 Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents Lichao Sun Team 2505.23450 null
2025-05-29 TrackVLA: Embodied Visual Tracking in the Wild He Wang Team 2505.23189 null
2025-05-28 ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation Wenqiang Zhang Team 2505.22159 null
2025-05-29 ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge Yi Xu Team 2505.21906 null
2025-05-27 EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models Xiang Chen Team 2505.21567 null
2025-06-02 Hume: Introducing System-2 Thinking in Visual-Language-Action Model Xuelong Li Team 2505.21432 null
2025-05-27 Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models Tao Chen Team 2505.21200 null
2025-05-26 Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review Goldie Nejat Team 2505.20503 null
2025-05-26 What Can RL Bring to VLA Generalization? An Empirical Study Yu Wang Team 2505.19789 null
2025-05-26 RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback Yongtao Wang Team 2505.19767 null
2025-05-25 ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning Minh Nhat Vu Team 2505.19080 null
2025-05-24 Genie Centurion: Accelerating Scalable Real-World Robot Training with Human Rewind-and-Refine Guidance Maoqing Yao Team 2505.18793 null
2025-05-24 VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning Ziwei Wang Team 2505.18719 link
2025-05-22 ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems Farhad Imani Team 2505.17295 null
2025-05-22 Interactive Post-Training for Vision-Language-Action Models Philipp Krähenbühl Team 2505.17016 null
2025-05-22 Perceptual Quality Assessment for Embodied AI Guangtao Zhai Team 2505.16815 link
2025-05-22 BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization Lichao Sun Team 2505.16640 null
2025-05-22 DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving Junchi Yan Team 2505.16278 null
2025-05-21 From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems Soujanya Poria Team 2505.15685 link
2025-05-24 Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization Junwei Liang Team 2505.15660 link
2025-05-21 FLARE: Robot Learning with Implicit World Modeling Linxi Fan Team 2505.15659 null
2025-05-21 Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control Jungwook Choi Team 2505.15304 null
2025-05-21 EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy Hongliang Ren Team 2505.15206 null
2025-05-21 Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation Xiaodong He Team 2505.15098 null
2025-05-20 AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory Ping Luo Team 2505.14030 null
2025-05-22 InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning Jingkuan Song Team 2505.13888 link
2025-05-25 RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction Bo Zhao Team 2505.12224 null
2025-05-17 OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning Yang Gao Team 2505.11917 null
2025-05-16 Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions Donglin Wang Team 2505.11214 null
2025-05-16 Conditioning Matters: Training Diffusion Policies is Faster Than You Think Jianye Hao Team 2505.11123 null
2025-05-14 Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware Ken Goldberg Team 2505.09601 null
2025-05-14 RT-cache: Efficient Robot Trajectory Retrieval System Amir Barati Farimani Team 2505.09040 null
2025-05-13 From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Jianye Hao Team 2505.08548 null
2025-05-17 Training Strategies for Efficient Embodied Reasoning Sergey Levine Team 2505.08243 null
2025-05-12 Pixel Motion as Universal Representation for Robot Control Michael S Ryoo Team 2505.07817 null
2025-05-12 ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning Donglin Wang Team 2505.07395 null
2025-05-15 UniVLA: Learning to Act Anywhere with Task-centric Latent Actions Hongyang Li Team 2505.06111 link
2025-05-09 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks Farshad Khorrami Team 2505.05800 null
2025-05-08 Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments Harshvardhan Sikka Team 2505.05540 link
2025-05-07 Vision-Language-Action Models: Concepts, Progress, Applications and Challenges Manoj Karkee Team 2505.04769 null
2025-05-06 OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation Donglin Wang Team 2505.03912 link
2025-05-16 Task Reconstruction and Extrapolation for $π_0$ using Text Latent Quanyi Li Team 2505.03500 null
2025-05-06 GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data He Wang Team 2505.03233 null
2025-05-06 Automated Data Curation Using GPS & NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets Ross Greer Team 2505.03174 null
2025-05-04 CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation Hao Dong Team 2505.02166 null
2025-05-04 Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions Mingyu Ding Team 2505.02152 null
2025-04-28 NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks Soujanya Poria Team 2504.19854 null
2025-04-22 $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization Ury Zhilinsky Team 2504.16054 null
2025-04-22 Few-Shot Vision-Language Action-Incremental Policy Learning Weili Guan Team 2504.15517 null
2025-04-18 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents Xiaobo Xia Team 2504.10458 null
2025-04-09 OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning Tyler Fenstermaker Team 2504.06538 null
2025-04-02 Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning Roozbeh Mottaghi Team 2504.00907 null
2025-03-30 OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model Alois C. Knoll Team 2503.23463 link
2025-03-27 CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models Tsung-Yi Lin Team 2503.22020 null
2025-04-14 MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation Shanghang Zhang Team 2503.20384 null
2025-03-25 Gemini Robotics: Bringing AI into the Physical World Yuxiang Zhou Team 2503.20020 null
2025-03-25 Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy Yuntao Chen Team 2503.19757 null
2025-03-25 DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data Lin Ma Team 2503.19516 null
2025-03-27 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots Yuke Zhu Team 2503.14734 null
2025-03-15 ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis Mingyu Ding Team 2503.14526 null
2025-03-17 MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation Haibin Yan Team 2503.13446 null
2025-03-17 HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model Shanghang Zhang Team 2503.10631 null
2025-03-12 CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games Bo Zheng Team 2503.09527 null
2025-03-11 MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models Zongyuan Ge Team 2503.08007 null
2025-03-10 PointVLA: Injecting the 3D World into Vision-Language-Action Models Yichen Zhu Team 2503.07511 null
2025-03-06 Refined Policy Distillation: From VLA Generalists to RL Experts Florian Walter Team 2503.05833 null
2025-03-06 VLA Model-Expert Collaboration for Bi-directional Manipulation Learning Zeng-Guang Hou Team 2503.04163 null
2025-03-26 OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction Pieter Abbeel Team 2503.03734 null
2025-03-05 SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning Yaodong Yang Team 2503.03480 null
2025-03-04 Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding Haoang Li Team 2503.02310 null
2025-03-03 CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs Dzmitry Tsetserukou Team 2503.01378 null

(back to top)

Humanoid

Publish Date Title Authors PDF Code
2025-07-22 Humanoid Robot Whole-body Geometric Calibration with Embedded Sensors and a Single Plane Florent Lamiraux Team 2507.16369 null
2025-07-20 Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture Lisa Dargasz Team 2507.15895 null
2025-07-21 EMP: Executable Motion Prior for Humanoid Robot Standing Upper-body Motion Imitation Rong Xiong Team 2507.15649 null
2025-07-16 Robot Drummer: Learning Rhythmic Skills for Humanoid Drumming Loris Roveda Team 2507.11498 null
2025-07-15 From Production Logistics to Smart Manufacturing: The Vision for a New RoboCup Industrial League Shohei Yasuda Team 2507.11402 null
2025-07-14 Physics-Informed Neural Networks with Unscented Kalman Filter for Sensorless Joint Torque Estimation in Humanoid Robots Daniele Pucci Team 2507.10105 null
2025-07-11 Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots Yue Gao Team 2507.08303 null
2025-07-10 UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots Weinan Zhang Team 2507.07356 null
2025-07-09 ULC: A Unified and Fine-Grained Controller for Humanoid Loco-Manipulation Zongwu Xie Team 2507.06905 null
2025-07-08 Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction Alessio Del Bue Team 2507.06404 null
2025-07-05 Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning Sangbae Kim Team 2507.04140 null
2025-07-01 HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning Chenjia Bai Team 2507.00833 null
2025-06-30 Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation Dennis Hong Team 2507.00273 null
2025-07-02 DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover Yuexin Ma Team 2506.23152 null
2025-06-29 Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots Yue Gao Team 2506.23125 null
2025-07-10 Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation Navid Azizan Team 2506.22827 null
2025-06-20 Unsupervised Discovery of Behavioral Primitives from Sensorimotor Dynamic Functional Connectivity Matej Hoffmann Team 2506.22473 null
2025-07-14 Ark: An Open-source Python-based Framework for Robot Learning Haitham Bou-Ammar Team 2506.21628 null
2025-07-18 A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots Wenjun Zeng Team 2506.20487 null
2025-06-19 DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning Zongqing Lu Team 2506.16012 link
2025-06-18 TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality Eiichi Yoshida Team 2506.15146 null
2025-06-18 Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion Mingguo Zhao Team 2506.15132 link
2025-06-17 GMT: General Motion Tracking for Humanoid Whole-Body Control Xiaolong Wang Team 2506.14770 null
2025-06-17 Whole-Body Control Framework for Humanoid Robots with Heavy Limbs: A Model-Based Approach Yun-Hui Liu Team 2506.14278 null
2025-06-15 KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills Xuelong Li Team 2506.12851 null
2025-06-19 From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots Zongqing Lu Team 2506.12779 null
2025-06-15 RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control Zongqing Lu Team 2506.12769 null
2025-06-14 Explosive Output to Enhance Jumping Ability: A Variable Reduction Ratio Design Paradigm for Humanoid Robots Knee Joint Qiang Huang Team 2506.12314 null
2025-06-13 mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity Robert K. Katzschmann Team 2506.11916 null
2025-06-11 Exploring EEG Responses during Observation of Actions Performed by Human Actor and Humanoid Robot Michelle J. Johnson Team 2506.10170 null
2025-06-11 Locomotion on Constrained Footholds via Layered Architectures and Model Predictive Control Aaron D. Ames Team 2506.09979 null
2025-06-11 Attention-Based Map Encoding for Learning Generalized Legged Locomotion Marco Hutter Team 2506.09588 null
2025-06-11 Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations Yanan Sui Team 2506.09383 null
2025-06-11 SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending Yue Wang Team 2506.09366 link
2025-06-10 Fast Estimation of Globally Optimal Independent Contact Regions for Robust Grasping and Manipulation Nancy S. Pollard Team 2506.08856 null
2025-06-12 MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains Xuelong Li Team 2506.08840 null
2025-06-10 Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots Lijun Zhu Team 2506.08416 null
2025-06-05 Realizing Text-Driven Motion Generation on NAO Robot: A Reinforcement Learning-Optimized Control Pipeline Qijun Chen Team 2506.05117 link
2025-06-04 Phase-based Nonlinear Model Predictive Control for Humanoid Walking Stabilization with Single and Double Support Time Adjustments Jaeheung Park Team 2506.03856 null
2025-06-03 AURA: Agentic Upskilling via Reinforced Abstractions Dennis Hong Team 2506.02507 null
2025-06-02 Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation Ayonga Hereid Team 2506.02206 null
2025-06-02 Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics Matej Hoffmann Team 2506.01756 null
2025-06-05 Hierarchical Intention-Aware Expressive Motion Generation for Humanoid Robots Chengxu Zhou Team 2506.01563 null
2025-06-01 Humanoid World Models: Open World Foundation Models for Humanoid Robotics Mohammad Al-Sharman Team 2506.01182 null
2025-06-01 iRonCub 3: The Jet-Powered Flying Humanoid Robot Daniele Pucci Team 2506.01125 null
2025-05-30 Learning Aerodynamics for the Control of Flying Humanoid Robots Daniele Pucci Team 2506.00305 null
2025-05-30 Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives -- A Survey Rania Rayyes Team 2506.00098 null
2025-06-05 SignBot: Learning Human-to-Humanoid Sign Language Interaction Guiliang Liu Team 2505.24266 null
2025-05-30 Humanoid Loco-Manipulations Pattern Generation and Stabilization Control Abderrahmane Kheddar Team 2505.24116 null
2025-05-29 Humanoid Loco-manipulation Planning based on Graph Search and Reachability Maps Abderrahmane Kheddar Team 2505.23505 null
2025-05-29 Centroidal Trajectory Generation and Stabilization based on Preview Control for Humanoid Multi-contact Motion Fumio Kanehiro Team 2505.23499 link
2025-06-01 FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control Pieter Abbeel Team 2505.22642 null
2025-05-27 Learning Unified Force and Position Control for Legged Loco-Manipulation Siyuan Huang Team 2505.20829 null
2025-05-27 Gait-Conditioned Reinforcement Learning with Multi-Phase Curriculum for Humanoid Locomotion CHengxu Zhou Team 2505.20619 null
2025-05-26 Integrating emotional intelligence, memory architecture, and gestures to achieve empathetic humanoid robot interaction in an educational setting Paul Craig Team 2505.19803 null
2025-05-26 Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning Jean-Baptiste Mouret Team 2505.19717 null
2025-05-26 Whole-body Multi-contact Motion Control for Humanoid Robots Based on Distributed Tactile Sensors Eiichi Yoshida Team 2505.19580 link
2025-05-26 Heavy lifting tasks via haptic teleoperation of a wheeled humanoid Joao Ramos Team 2505.19530 null
2025-05-26 SMAP: Self-supervised Motion Adaptation for Physically Plausible Humanoid Whole-body Control Junting Dong Team 2505.19463 null
2025-05-25 Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP) Libo Wang Team 2505.19339 link
2025-05-25 Staircase Recognition and Location Based on Polarization Vision Zhiying Tan Team 2505.19026 null
2025-05-23 DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation Ruqi Huang Team 2505.18078 null
2025-05-22 Unified Multi-Rate Model Predictive Control for a Jet-Powered Humanoid Robot Daniele Pucci Team 2505.16478 null
2025-05-19 TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion Minh Nhat Vu Team 2505.13549 null
2025-05-19 DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories Linxi Fan Team 2505.12705 null
2025-05-19 Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion Qi Wu Team 2505.12679 null
2025-05-16 Bracing for Impact: Robust Humanoid Push Recovery and Locomotion with Reduced Order Models Aaron D. Ames Team 2505.11495 null
2025-05-16 X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation Xiaohan Yu Team 2505.11146 link
2025-05-15 NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance Jiangmiao Pang Team 2505.08712 null
2025-05-13 Rethink Repeatable Measures of Robot Performance with Statistical Query Dylan Khor Team 2505.08216 null
2025-05-14 Neural Brain: A Neuroscience-inspired Framework for Embodied Agents Lin Wang Team 2505.07634 link
2025-05-12 HuB: Learning Extreme Humanoid Balance Yang Gao Team 2505.07294 null
2025-05-11 Dynamic Safety in Complex Environments: Synthesizing Safety Filters with Poisson's Equation Aaron D. Ames Team 2505.06794 null
2025-05-10 JAEGER: Dual-Level Humanoid Whole-Body Controller Zongqing Lu Team 2505.06584 null
2025-05-09 Let Humanoids Hike! Integrative Skill Development on Complex Trails Stella X. Yu Team 2505.06218 null
2025-05-09 Safe-EF: Error Feedback for Nonsmooth Constrained Optimization Ilyas Fatkhullin Team 2505.06053 null
2025-05-09 Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination Zhi Li Team 2505.05773 null
2025-05-07 Vision-Language-Action Models: Concepts, Progress, Applications and Challenges Manoj Karkee Team 2505.04769 null
2025-05-06 AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control Xiaolong Wang Team 2505.03738 null
2025-05-13 Visual Imitation Enables Contextual Humanoid Control Angjoo Kanazawa Team 2505.03729 null
2025-05-05 TWIST: Teleoperated Whole-Body Imitation System C. Karen Liu Team 2505.02833 null
2025-04-30 LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning Koushil Sreenath Team 2504.21738 null
2025-04-29 SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings Jianwei Zhang Team 2504.20808 null
2025-04-27 Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems Jairaj Singh Shaktawat Team 2504.20109 null
2025-04-24 Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot Koushil Sreenath Team 2504.17249 null
2025-04-20 ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training Jiahao Chen Team 2504.14477 null
2025-04-19 Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning Xuelong Li Team 2504.14305 null
2025-04-18 Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning Fumio Kanehiro Team 2504.13619 link
2025-04-16 EmoACT: a Framework to Embed Emotions into Artificial Agents Based on Affect Control Theory Carmine Tommaso Recchiuto Team 2504.12125 null
2025-04-14 Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain Zhengtao Zhang Team 2504.10390 null
2025-04-14 PreCi: Pretraining and Continual Improvement of Humanoid Locomotion via Model-Assumption-Based Regularization Sehoon Ha Team 2504.09833 null
2025-04-13 Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation Yi Fang Team 2504.09532 null
2025-04-11 Spectral Normalization for Lipschitz-Constrained Policies on Learning Humanoid Locomotion Jaeheung Park Team 2504.08246 null
2025-04-07 MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond Xun Cao Team 2504.05046 null
2025-04-07 A High-Force Gripper with Embedded Multimodal Sensing for Powerful and Perception Driven Grasping Nikos G. Tsagarakis Team 2504.04970 null
2025-04-06 Public speech recognition transcripts as a configuring parameter Christian Licoppe Team 2504.04488 null
2025-04-02 The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction Matthew K. X. J Pan Team 2504.01260 null
2025-04-01 Extended Hybrid Zero Dynamics for Bipedal Walking of the Knee-less Robot SLIDER Petar Kormushev Team 2504.01165 null
2025-04-11 Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs Masaya Kinoshita Team 2504.00614 null
2025-03-30 Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework Ysobel Sims Team 2503.23601 null
2025-03-28 Control of Humanoid Robots with Parallel Mechanisms using Kinematic Actuation Models Nicolas Mansard Team 2503.22459 null
2025-03-28 FLAM: Foundation Model-Based Body Stabilization for Humanoid Locomotion and Manipulation Debin Zhao Team 2503.22249 null
2025-03-27 OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation Wanting Li Team 2503.21257 null
2025-03-26 Anti Robot Speciesism Miklos Sarvary Team 2503.20842 null
2025-03-25 Can Vision-Language Models Answer Face to Face Questions in the Real-World? Roland Memisevic Team 2503.19356 null
2025-03-19 StyleLoco: Generative Adversarial Distillation for Natural Humanoid Robot Locomotion Siyuan Huang Team 2503.15082 null
2025-03-27 GR00T N1: An Open Foundation Model for Generalist Humanoid Robots Yuke Zhu Team 2503.14734 null
2025-03-24 Humanoid Policy ~ Human Policy Xiaolong Wang Team 2503.13441 null
2025-03-17 Humanoids in Hospitals: A Technical Study of Humanoid Surrogates for Dexterous Medical Interventions Michael Yip Team 2503.12725 null
2025-03-16 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills Zongqing Lu Team 2503.12533 null
2025-03-14 Fast and Robust Localization for Humanoid Soccer Robot via Iterative Landmark Matching Dennis W. Hong Team 2503.11020 null
2025-03-13 NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models Michael Black Team 2503.10626 null
2025-03-13 NuExo: A Wearable Exoskeleton Covering all Upper Limb ROM for Outdoor Data Collection and Teleoperation of Humanoid Robots Huimin Lu Team 2503.10554 null
2025-03-12 Natural Humanoid Robot Locomotion with Generative Motion Prior Rong Xiong Team 2503.09015 null
2025-03-13 HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots Renjing Xu Team 2503.09010 null
2025-03-11 LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures Renjing Xu Team 2503.08349 null

(back to top)

Dexterous

Publish Date Title Authors PDF Code
2025-07-19 A 21-DOF Humanoid Dexterous Hand with Hybrid SMA-Motor Actuation: CYJ Hand-0 Erbao Dong Team 2507.14538 null
2025-07-18 Improving Low-Cost Teleoperation: Augmenting GELLO with Force Kai Arulkumaran Team 2507.13602 null
2025-07-16 The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey Jiming Chen Team 2507.11840 null
2025-07-14 Demonstrating the Octopi-1.5 Visual-Tactile-Language Model Harold Soh Team 2507.09985 null
2025-07-09 Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand Xinjun Sheng Team 2507.06822 null
2025-07-07 A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation Russ Tedrake Team 2507.05331 null
2025-07-06 SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training Hao Dong Team 2507.04452 null
2025-07-03 DexVLG: Dexterous Vision-Language-Grasp Model at Scale He Wang Team 2507.02747 null
2025-07-02 TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types Wei-Shi Zheng Team 2507.01857 null
2025-07-01 HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning Chenjia Bai Team 2507.00833 null
2025-06-26 Lightweight Fingernail Haptic Device: Unobstructed Fingerpad Force and Vibration Feedback for Enhanced Virtual Dexterous Manipulation Shoichi Hasegawa Team 2506.21417 null
2025-06-24 Scaffolding Dexterous Manipulation with Vision-Language Models Dorsa Sadigh Team 2506.19212 null
2025-06-24 The MOTIF Hand: A Robotic Hand for Multimodal Observations with Thermal, Inertial, and Force Sensors Daniel Seita Team 2506.19201 null
2025-06-21 VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models Lin Shao Team 2506.17561 null
2025-06-20 Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation Xiaolong Wang Team 2506.17198 null
2025-06-19 ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation Jitendra Malik Team 2506.15953 null
2025-06-17 Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation Mustafa Mukadam Team 2506.14754 null
2025-06-16 CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding Haoang Li Team 2506.13725 null
2025-06-13 ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation Nima Fazeli Team 2506.12239 null
2025-06-13 ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations Maria Bauza Villalonga Team 2506.11775 null
2025-06-30 Adaptive event-triggered robust tracking control of soft robots Marios M. Polycarpou Team 2506.09523 null
2025-06-11 Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation Xiang Li Team 2506.09384 null
2025-06-09 TensorTouch: Calibration of Tactile Sensors for High Resolution Stress Tensor and Deformation for Dexterous Manipulation Monroe Kennedy III Team 2506.08291 null
2025-06-09 RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy Hui Cheng Team 2506.07490 null
2025-06-05 GEX: Democratizing Dexterity with Fully-Actuated Dexterous Hand and Exoskeleton Glove Zelin Deng Team 2506.04982 link
2025-06-06 ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning Jian Tang Team 2506.04941 null
2025-06-03 Reachability Weighted Offline Goal-conditioned Resampling Joni Pajarinen Team 2506.02577 null
2025-05-30 Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives -- A Survey Rania Rayyes Team 2506.00098 null
2025-05-30 DexMachina: Functional Retargeting for Bimanual Dexterous Manipulation Shuran Song Team 2505.24853 null
2025-05-28 ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation Wenqiang Zhang Team 2505.22159 null
2025-05-29 DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation Shuran Song Team 2505.21864 null
2025-05-27 Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt Jianyu Chen Team 2505.20795 null
2025-05-25 MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation Xue Bin Peng Team 2505.19086 null
2025-05-24 Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos Mario Bijelic Team 2505.18899 link
2025-05-24 DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets Dzmitry Tsetserukou Team 2505.18876 null
2025-05-27 GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning Ye Shi Team 2505.18763 null
2025-05-22 TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation Hengdi Zhang Team 2505.16289 null
2025-05-21 Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation Xiaodong He Team 2505.15098 null
2025-05-20 Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation Hao Dong Team 2505.13982 null
2025-05-19 Approximating Global Contact-Implicit MPC via Sampling and Local Complementarity Michael Posa Team 2505.13350 null
2025-05-19 TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation Jiangmiao Pang Team 2505.12748 null
2025-05-18 PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis Zhipong Cai Team 2505.12294 null
2025-05-17 OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning Yang Gao Team 2505.11917 null
2025-05-16 EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video Jian Zhang Team 2505.11709 null
2025-05-16 Self-supervised perception for tactile skin covered dexterous hands Mustafa Mukadam Team 2505.11420 null
2025-05-16 Learning Multimodal AI Algorithms for Amplifying Limited User Input into High-dimensional Control Space Reza Abiri Team 2505.11366 link
2025-05-16 Estimating Deformable-Rigid Contact Interactions for a Deformable Tool via Learning and Model-Based Optimization Nima Fazeli Team 2505.10884 null
2025-05-15 SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning Axel Krieger Team 2505.10251 null
2025-05-13 HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands Yunhui Liu Team 2505.08213 null
2025-05-12 DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies Deepak Pathak Team 2505.07813 null
2025-05-08 Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation Georgia Chalvatzaki Team 2505.05287 null
2025-05-04 Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning Sven Behnke Team 2505.02232 null
2025-05-04 KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation Yang Gao Team 2505.01974 null
2025-05-02 DexFlow: A Unified Approach for Dexterous Hand Pose Retargeting and Interaction Miao Li Team 2505.01083 null
2025-05-02 DexCtrl: Towards Sim-to-Real Dexterity with Adaptive Controller Learning Masayoshi Tomizuka Team 2505.00991 null
2025-04-30 Multi-Goal Dexterous Hand Manipulation using Probabilistic Model-based Reinforcement Learning Yunduan Cui Team 2504.21585 null
2025-04-27 PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies Edward Adelson Team 2504.19341 null
2025-04-23 PP-Tac: Paper Picking Using Tactile Feedback in Dexterous Robotic Hands Ziyuan Jiao Team 2504.16649 null
2025-04-22 $π_{0.5}$ : a Vision-Language-Action Model with Open-World Generalization Ury Zhilinsky Team 2504.16054 null
2025-04-21 LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning Boyuan Chen Team 2504.15472 null
2025-04-21 SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks Animesh Garg Team 2504.14857 null
2025-04-20 BiDexHand: Design and Evaluation of an Open-Source 16-DoF Biomimetic Dexterous Hand Zhengyang Kris Weng Team 2504.14712 null
2025-04-18 On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting Jan Peters Team 2504.13618 null
2025-04-17 RUKA: Rethinking the Design of Humanoid Hands with Learning Lerrel Pinto Team 2504.13165 null
2025-04-17 Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator E. Witrant Team 2504.13056 null
2025-04-17 Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation Iman Soltani Team 2504.12967 null
2025-04-22 Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration Jeannette Bohg Team 2504.12609 null
2025-04-14 Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation Guoying Gu Team 2504.10280 null
2025-04-08 Functionally graded keratin facilitates tactile sensing in elephant whiskers Katherine J. Kuchenbecker Team 2504.07143 null
2025-04-08 ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface Rui Chen Team 2504.06156 null
2025-04-06 DexTOG: Learning Task-Oriented Dexterous Grasp with Language Cewu Lu Team 2504.04573 null
2025-04-06 DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments Lin Shao Team 2504.04516 null
2025-04-05 ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning Robert K. Katzschmann Team 2504.04259 null
2025-04-24 Dexterous Manipulation through Imitation Learning: A Survey Hong Zhang Team 2504.03515 null
2025-03-29 Dexterous Non-Prehensile Manipulation for Ungraspable Object via Extrinsic Dexterity Yuanpei Chen Team 2503.23120 null
2025-03-27 ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning Siyuan Huang Team 2503.21860 null
2025-03-25 G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation Ruizhen Hu Team 2503.19457 null
2025-03-16 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills Zongqing Lu Team 2503.12533 null
2025-03-14 Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping Haruki Nishimura Team 2503.10966 null
2025-03-12 Sequential Multi-Object Grasping with One Dexterous Hand Daniel Seita Team 2503.09078 null
2025-03-16 DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness Yuexin Ma Team 2503.08257 link
2025-03-13 AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems Jianchao Zhu Team 2503.06669 link
2025-03-08 ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features Hong Zhang Team 2503.05995 link
2025-03-07 Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction Bin He Team 2503.05231 null
2025-03-06 Dexterous Hand Manipulation via Efficient Imitation-Bootstrapped Online Reinforcement Learning Xiaodong He Team 2503.04014 null
2025-03-05 LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation Alois Knoll Team 2503.03890 null
2025-03-05 Selective Tweezing and Immobilization of Colloids for Dexterous Manipulation of Biological Materials Kimani C. Toussaint Jr Team 2503.03102 null
2025-03-03 TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-to-Robot Skill Transfer Mark R. Cutkosky Team 2503.01789 null
2025-03-03 RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation Jun Ma Team 2503.01616 null
2025-03-03 Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning Wenbo Ding Team 2503.01543 null
2025-03-03 KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands Jeffrey Ichnowski Team 2503.01078 null
2025-02-27 Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids Yuke Zhu Team 2502.20396 null
2025-02-28 ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration Feifei Feng Team 2502.19250 null
2025-02-26 Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand Yuanpei Chen Team 2502.18423 null

(back to top)

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%