Usage instructions: here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-07-23 | ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents | Hesheng Wang Team | 2507.17462 | null |
| 2025-07-23 | Ctx2TrajGen: Traffic Context-Aware Microscale Vehicle Trajectories using Generative Adversarial Imitation Learning | Byeongjoon Noh Team | 2507.17418 | null |
| 2025-07-23 | Confounded Causal Imitation Learning with Instrumental Variables | Zhi Geng Team | 2507.17309 | null |
| 2025-07-23 | Prolonging Tool Life: Learning Skillful Use of General-purpose Tools through Lifespan-guided Reinforcement Learning | Takamitsu Matsubara Team | 2507.17275 | null |
| 2025-07-23 | Towards Human-level Intelligence via Human-like Whole-Body Manipulation | Zhaohui An Team | 2507.17141 | null |
| 2025-07-22 | Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots | Aitor Arrieta Team | 2507.17049 | null |
| 2025-07-19 | Sensor-Space Based Robust Kinematic Control of Redundant Soft Manipulator by Learning | Charlie C. L. Wang Team | 2507.16842 | null |
| 2025-07-22 | ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning | Fu-En Yang Team | 2507.16815 | null |
| 2025-07-22 | Equivariant Goal Conditioned Contrastive Reinforcement Learning | Robert Platt Team | 2507.16139 | null |
| 2025-07-21 | Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers | Iman Soltani Team | 2507.15833 | null |
| 2025-07-21 | Strong, Accurate, and Low-Cost Robot Manipulator | Donghyun Kim Team | 2507.15693 | null |
| 2025-07-21 | Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos | Zongqing Lu Team | 2507.15597 | null |
| 2025-07-22 | GR-3 Technical Report | Yichu Yang Team | 2507.15493 | null |
| 2025-07-20 | Learning-Based Modeling of a Magnetically Steerable Soft Suction Device for Endoscopic Endonasal Interventions | Eric Diller Team | 2507.15155 | null |
| 2025-07-20 | Reinforcement Learning for Flow-Matching Policies | Somayeh Sojoudi Team | 2507.15073 | null |
| 2025-07-20 | Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper | Yunzhu Li Team | 2507.15062 | null |
| 2025-07-20 | LLM-Enhanced Multi-Agent Reinforcement Learning with Expert Workflow for Real-Time P2P Energy Trading | Lu Zhang Team | 2507.14995 | null |
| 2025-07-20 | Heterogeneous object manipulation on nonlinear soft surface through linear controller | Andres Faiña Team | 2507.14967 | null |
| 2025-07-20 | KGN-Pro: Keypoint-Based Grasp Prediction through Probabilistic 2D-3D Correspondence Learning | Guangyao Zhai Team | 2507.14820 | null |
| 2025-07-19 | BT-TL-DMPs: A Novel Robot TAMP Framework Combining Behavior Tree, Temporal Logic and Dynamical Movement Primitives | Yongchun Fang Team | 2507.14582 | null |
| 2025-07-18 | Improving Low-Cost Teleoperation: Augmenting GELLO with Force | Kai Arulkumaran Team | 2507.13602 | null |
| 2025-07-17 | The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner | Kai Chen Team | 2507.13332 | null |
| 2025-07-17 | ZipMPC: Compressed Context-Dependent MPC Cost via Imitation Learning | Johannes A. Stork Team | 2507.13088 | null |
| 2025-07-17 | Generalist Bimanual Manipulation via Foundation Video Diffusion Models | Jun Zhu Team | 2507.12898 | null |
| 2025-07-17 | Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved) | Jost Tobias Springenberg Team | 2507.12856 | null |
| 2025-07-17 | DEMONSTRATE: Zero-shot Language to Robotic Control via Multi-task Demonstration Learning | Melanie N. Zeilinger Team | 2507.12855 | null |
| 2025-07-17 | Learning to Predict Mobile Robot Stability in Off-Road Environments | Parikshit Maini Team | 2507.12731 | null |
| 2025-07-18 | EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos | Xiaolong Wang Team | 2507.12440 | null |
| 2025-07-16 | The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey | Jiming Chen Team | 2507.11840 | null |
| 2025-07-15 | Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification | Zsolt Kira Team | 2507.11662 | null |
| 2025-07-15 | MPC-based Coarse-to-Fine Motion Planning for Robotic Object Transportation in Cluttered Environments | Steven Liu Team | 2507.11211 | null |
| 2025-07-15 | A Robust Controller based on Gaussian Processes for Robotic Manipulators with Unknown Uncertainty | Ruggero Carli Team | 2507.11170 | null |
| 2025-07-15 | Enhancing Autonomous Manipulator Control with Human-in-loop for Uncertain Assembly Environments | Kazuya Yoshida Team | 2507.11006 | null |
| 2025-07-15 | Object-Centric Mobile Manipulation through SAM2-Guided Perception and Imitation Learning | Jun Morimoto Team | 2507.10899 | null |
| 2025-07-14 | Versatile and Generalizable Manipulation via Goal-Conditioned Reinforcement Learning with Grounded Object Detection | Colin Bellinger Team | 2507.10814 | null |
| 2025-07-14 | rt-RISeg: Real-Time Model-Free Robot Interactive Segmentation for Active Instance-Level Object Understanding | Kaiyu Hang Team | 2507.10776 | null |
| 2025-07-14 | A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Flight Computers | Arko Barman Team | 2507.10775 | null |
| 2025-07-14 | Vision Language Action Models in Robotic Manipulation: A Systematic Review | Irfan Hussain Team | 2507.10672 | null |
| 2025-07-16 | GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning | Dandan Tu Team | 2507.10628 | null |
| 2025-07-14 | MP1: Mean Flow Tames Policy Learning in 1-step for Robotic Manipulation | Mengyuan Liu Team | 2507.10543 | null |
| 2025-07-14 | Prompt Informed Reinforcement Learning for Visual Coverage Path Planning | Venkat Margapuri Team | 2507.10284 | null |
| 2025-07-14 | Should We Ever Prefer Decision Transformer for Offline Reinforcement Learning? | Keith Ross Team | 2507.10174 | null |
| 2025-07-16 | MTF-Grasp: A Multi-tier Federated Learning Approach for Robotic Grasping | Monowar Bhuyan Team | 2507.10158 | null |
| 2025-07-13 | Learning to Control Dynamical Agents via Spiking Neural Networks and Metropolis-Hastings Sampling | Ali Al-Zawqari Team | 2507.09540 | null |
| 2025-07-13 | Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles | Keqiang Li Team | 2507.09537 | null |
| 2025-07-13 | SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation | Boyu Wang Team | 2507.09459 | null |
| 2025-07-12 | DAA: Deep Angular A Star for Image-based Path Planning* | Zhiwei Xu Team | 2507.09305 | null |
| 2025-07-15 | Learning and Transferring Better with Depth Information in Visual Reinforcement Learning | Jingdong Zhao Team | 2507.09180 | null |
| 2025-07-12 | PRAG: Procedural Action Generator | Karla Stepanova Team | 2507.09167 | null |
| 2025-07-12 | Towards Human-level Dexterity via Robot Learning | Gagan Khandate Team | 2507.09117 | null |
| 2025-07-11 | Imitation Learning in Continuous Action Spaces: Mitigating Compounding Error without Interaction | Max Simchowitz Team | 2507.09061 | null |
| 2025-07-11 | Behavioral Exploration: Learning to Explore via In-Context Adaptation | Sergey Levine Team | 2507.09041 | null |
| 2025-07-11 | Learning human-to-robot handovers through 3D scene reconstruction | Changjae Oh Team | 2507.08726 | null |
| 2025-07-11 | Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots | Yue Gao Team | 2507.08303 | null |
| 2025-07-11 | CL3R: 3D Reconstruction and Contrastive Learning for Enhanced Robotic Manipulation Representations | He Wang Team | 2507.08262 | null |
| 2025-07-10 | Imitation Learning for Obstacle Avoidance Using End-to-End CNN-Based Sensor Fusion | Raafat E. Shalaby Team | 2507.08112 | null |
| 2025-07-15 | EXPO: Stable Reinforcement Learning with Expressive Policies | Chelsea Finn Team | 2507.07986 | null |
| 2025-07-15 | Reinforcement Learning with Action Chunking | Sergey Levine Team | 2507.07969 | null |
| 2025-07-09 | Self-Wearing Adaptive Garments via Soft Robotic Unfurling | Allison M. Okamura Team | 2507.07221 | null |
| 2025-07-09 | Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand | Xinjun Sheng Team | 2507.06822 | null |
| 2025-07-09 | Learning safe, constrained policies via imitation learning: Connection to Probabilistic Inference and a Naive Algorithm | George A. Vouros Team | 2507.06780 | null |
| 2025-07-13 | Spatial-Temporal Aware Visuomotor Diffusion Policy Learning | Yanwei Fu Team | 2507.06710 | null |
| 2025-07-09 | Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement | Martin Riedmiller Team | 2507.06701 | null |
| 2025-07-09 | Goal-Oriented Skill Abstraction for Offline Multi-Task Reinforcement Learning | Jian Cheng Team | 2507.06628 | null |
| 2025-07-09 | Q-STAC: Q-Guided Stein Variational Model Predictive Actor-Critic | Fabio Ramos Team | 2507.06625 | null |
| 2025-07-09 | Token Bottleneck: One Token to Remember Dynamics | Sangdoo Yun Team | 2507.06543 | null |
| 2025-07-08 | Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction | Alessio Del Bue Team | 2507.06404 | null |
| 2025-07-08 | EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow | Liang Wang Team | 2507.06224 | null |
| 2025-07-08 | Is Diversity All You Need for Scalable Robotic Manipulation? | Hongyang Li Team | 2507.06219 | null |
| 2025-07-08 | Fast Bilateral Teleoperation and Imitation Learning Using Sensorless Force Control via Accurate Dynamics Model | Toshiaki Tsuji Team | 2507.06174 | null |
| 2025-07-08 | Learning Agile Tensile Perching for Aerial Robots from Demonstrations | Basaran Bahadir Kocer Team | 2507.06172 | null |
| 2025-07-08 | SCCRUB: Surface Cleaning Compliant Robot Utilizing Bristles | Jeffrey Ian Lipton Team | 2507.06053 | null |
| 2025-07-08 | LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving | Jian Sun Team | 2507.05754 | null |
| 2025-07-08 | Hybrid Diffusion Policies with Projective Geometric Algebra for Efficient Robot Manipulation Learning | Daniel Rakita Team | 2507.05695 | null |
| 2025-07-08 | Integrating Diffusion-based Multi-task Learning with Online Reinforcement Learning for Robust Quadruped Robot Control | Bin Liang Team | 2507.05674 | null |
| 2025-07-08 | Stable Tracking-in-the-Loop Control of Cable-Driven Surgical Manipulators under Erroneous Kinematic Chains | Michael C. Yip Team | 2507.05663 | null |
| 2025-07-08 | DreamGrasp: Zero-Shot 3D Multi-Object Reconstruction from Partial-View Images for Robotic Manipulation | Frank Chongwoo Park Team | 2507.05627 | null |
| 2025-07-07 | Gaussian Process-Based Active Exploration Strategies in Vision and Touch | Nadia Figueroa Team | 2507.05522 | null |
| 2025-07-07 | A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation | Russ Tedrake Team | 2507.05331 | null |
| 2025-07-07 | VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting | Yanzhi Wang Team | 2507.05116 | null |
| 2025-07-07 | When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning | Sebastien Ourselin Team | 2507.05011 | null |
| 2025-07-07 | Training-free Generation of Temporally Consistent Rewards from VLMs | Jian Tang Team | 2507.04789 | null |
| 2025-07-07 | DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics | Mingsheng Shang Team | 2507.04661 | null |
| 2025-07-07 | PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation | Chee-Meng Chew Team | 2507.04633 | null |
| 2025-07-07 | Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts | Junjie Hu Team | 2507.04631 | null |
| 2025-07-06 | VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation | Lei Han Team | 2507.04524 | null |
| 2025-07-06 | DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Xin Jin Team | 2507.04447 | null |
| 2025-07-06 | Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks | Yi Fang Team | 2507.04331 | null |
| 2025-07-05 | Are Learning-Based Approaches Ready for Real-World Indoor Navigation? A Case for Imitation Learning | Sebastian Houben Team | 2507.04086 | null |
| 2025-07-05 | Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation | Yadan Luo Team | 2507.04049 | null |
| 2025-07-08 | RwoR: Generating Robot Demonstrations from Human Hand Collection for Policy Learning without Robot | Hao Dong Team | 2507.03930 | null |
| 2025-07-05 | DK-RRT: Deep Koopman RRT for Collision-Aware Motion Planning of Space Manipulators in Dynamic Debris Environments | Dezhi Yu Team | 2507.03878 | null |
| 2025-07-04 | Dexterous Teleoperation of 20-DoF ByteDexter Hand via Human Motion Retargeting | Zeyu Ren Team | 2507.03227 | null |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null |
| 2025-07-02 | Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN | Matthias Kerzel Team | 2507.02171 | null |
| 2025-07-02 | TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types | Wei-Shi Zheng Team | 2507.01857 | null |
| 2025-07-02 | S3D: A Spatial Steerable Surgical Drilling Framework for Robotic Spinal Fixation Procedures | Farshid Alambeigi Team | 2507.01779 | null |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null |
| 2025-07-01 | Search-Based Robot Motion Planning With Distance-Based Adaptive Motion Primitives | Bakir Lacevic Team | 2507.01198 | null |
| 2025-07-01 | Imitation Learning for Satellite Attitude Control under Unknown Perturbations | Xiaoli Bai Team | 2507.01161 | null |
| 2025-07-01 | SonoGym: High Performance Simulation for Challenging Surgical Tasks with Robotic Ultrasound | Philipp Fürnstahl Team | 2507.01152 | null |
| 2025-07-01 | Geometry-aware 4D Video Generation for Robot Manipulation | Shuran Song Team | 2507.01099 | null |
| 2025-07-01 | DexWrist: A Robotic Wrist for Constrained and Dynamic Manipulation | Pulkit Agrawal Team | 2507.01008 | null |
| 2025-07-04 | Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations | Yunzhu Li Team | 2507.00990 | null |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | null |
| 2025-07-01 | Learning Steerable Imitation Controllers from Unstructured Animal Motions | Stelian Coros Team | 2507.00677 | null |
| 2025-07-01 | RoboEval: Where Robotic Manipulation Meets Structured and Scalable Evaluation | Siddhartha Srinivasa Team | 2507.00435 | null |
| 2025-07-01 | Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning | Yang Gao Team | 2506.23944 | null |
| 2025-06-30 | World4Omni: A Zero-Shot Framework from Image Generation World Model to Robotic Manipulation | Lin Shao Team | 2506.23919 | null |
| 2025-06-30 | Advancing Learnable Multi-Agent Pathfinding Solvers with Active Fine-Tuning | Alexey Skrynnik Team | 2506.23793 | null |
| 2025-06-30 | PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? | Ransalu Senanayake Team | 2506.23725 | null |
| 2025-07-04 | ParticleFormer: A 3D Point Cloud World Model for Multi-Object, Multi-Material Robotic Manipulation | Mac Schwager Team | 2506.23126 | null |
| 2025-06-29 | Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots | Yue Gao Team | 2506.23125 | null |
| 2025-06-28 | Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation | Navid Azizan Team | 2506.22827 | null |
| 2025-06-28 | SPI-BoTER: Error Compensation for Industrial Robots via Sparse Attention Masking and Hybrid Loss with Spatial-Physical Information | Yuqiang Wu Team | 2506.22788 | null |
| 2025-06-28 | Learning Efficient Robotic Garment Manipulation with Standardization | Bin He Team | 2506.22769 | null |
| 2025-06-28 | RoboPearls: Editable Video Simulation for Robot Manipulation | Xiaodan Liang Team | 2506.22756 | null |
| 2025-06-27 | Spherical Pendulum with Quad-Rotor Thrust Vectoring Actuation -- A Novel Mechatronics and Control Benchmark Platform | Tsu-Chin Tsao Team | 2506.22410 | null |
| 2025-06-27 | RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation | Abhinav Valada Team | 2506.22007 | null |
| 2025-06-26 | Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation | Venkat Krovi Team | 2506.21732 | null |
| 2025-06-24 | Ark: An Open-source Python-based Framework for Robot Learning | Haitham Bou-Ammar Team | 2506.21628 | null |
| 2025-06-24 | FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models | Huiping Zhuang Team | 2506.21627 | null |
| 2025-06-26 | ACTLLM: Action Consistency Tuned Large Language Model | Chenliang Xu Team | 2506.21250 | null |
| 2025-07-02 | World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Xipeng Qiu Team | 2506.21230 | null |
| 2025-06-26 | UAIbot: Beginner-friendly web-based simulator for interactive robotics learning and research | Vinicius Mariano Gonçalves Team | 2506.21178 | null |
| 2025-06-26 | Knowledge-Driven Imitation Learning: Enabling Generalization Across Diverse Conditions | Cewu Lu Team | 2506.21057 | null |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null |
| 2025-06-25 | Learning-Based Distance Estimation for 360° Single-Sensor Setups | Andreas Zell Team | 2506.20586 | null |
| 2025-06-25 | Learn to Position -- A Novel Meta Method for Robotic Positioning | Xiaoming Tao Team | 2506.20445 | null |
| 2025-06-25 | Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration | Quanquan Gu Team | 2506.20307 | null |
| 2025-06-24 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null |
| 2025-06-24 | T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models | Qingyao Wu Team | 2506.19498 | null |
| 2025-06-24 | Is an object-centric representation beneficial for robotic manipulation ? | Liming Chen Team | 2506.19408 | null |
| 2025-06-24 | Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference | Nutan Chen Team | 2506.19303 | null |
| 2025-06-25 | AnchorDP3: 3D Affordance Guided Sparse Diffusion Policy for Robotic Manipulation | Hui Shen Team | 2506.19269 | null |
| 2025-06-24 | Robust Behavior Cloning Via Global Lipschitz Regularization | Sean B. Andersson Team | 2506.19250 | null |
| 2025-06-23 | CUPID: Curating Data your Robot Loves with Influence Functions | Jeannette Bohg Team | 2506.19121 | null |
| 2025-06-23 | Multimodal Anomaly Detection with a Mixture-of-Experts | Dongheui Lee Team | 2506.19077 | null |
| 2025-06-25 | FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation | Lillian Chin Team | 2506.18960 | null |
| 2025-06-23 | RAG-6DPose: Retrieval-Augmented 6D Pose Estimation via Leveraging CAD as Knowledge Base | Xiangyang Xue Team | 2506.18856 | null |
| 2025-06-23 | SViP: Sequencing Bimanual Visuomotor Policies with Object-Centric Motion Primitives | Jia Pan Team | 2506.18825 | null |
| 2025-06-23 | Learning Point Correspondences In Radar 3D Point Clouds For Radar-Inertial Odometry | Jan Steinbrener Team | 2506.18580 | null |
| 2025-06-23 | Robots and Children that Learn Together : Improving Knowledge Retention by Teaching Peer-Like Interactive Robots | Alessandro Di Nuovo Team | 2506.18365 | null |
| 2025-06-23 | Robotic Manipulation of a Rotating Chain with Bottom End Fixed | Quang-Cuong Pham Team | 2506.18355 | null |
| 2025-06-23 | Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies | Xiaolin Chang Team | 2506.18304 | null |
| 2025-06-23 | Learning Approach to Efficient Vision-based Active Tracking of a Flying Target by an Unmanned Aerial Vehicle | Souma Chowdhury Team | 2506.18264 | null |
| 2025-06-22 | RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation | Yao Mu Team | 2506.18088 | null |
| 2025-06-21 | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Xiao Li Team | 2506.17639 | null |
| 2025-06-21 | Imitation Learning for Active Neck Motion Enabling Robot Manipulation beyond the Field of View | Yasuo Kuniyoshi Team | 2506.17624 | null |
| 2025-06-20 | Kinematic Model Optimization via Differentiable Contact Manifold for In-Space Manipulation | Satyandra K. Gupta Team | 2506.17458 | null |
| 2025-06-20 | Monocular One-Shot Metric-Depth Alignment for RGB-Based Robot Grasping | Jingjin Yu Team | 2506.17110 | null |
| 2025-06-24 | Learning Accurate Whole-body Throwing with High-frequency Residual Policy and Pullback Tube Acceleration | Marco Hutter Team | 2506.16986 | null |
| 2025-06-20 | Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections | Shuran Song Team | 2506.16685 | null |
| 2025-06-19 | CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity | Yunzhu Li Team | 2506.16652 | null |
| 2025-06-19 | Reimagination with Test-time Observation Interventions: Distractor-Robust World Model Predictions for Visual Model Predictive Control | Ran Tian Team | 2506.16565 | null |
| 2025-06-19 | An Optimization-Augmented Control Framework for Single and Coordinated Multi-Arm Robotic Manipulation | Ozgur S. Oguz Team | 2506.16555 | null |
| 2025-06-19 | Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining | Ding Zhao Team | 2506.16475 | null |
| 2025-06-19 | GoalLadder: Incremental Goal Discovery with Vision-Language Models | Shimon Whiteson Team | 2506.16396 | null |
| 2025-06-19 | CapsDT: Diffusion-Transformer for Capsule Robot Manipulation | Hongliang Ren Team | 2506.16263 | null |
| 2025-06-19 | ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models | Siyuan Huang Team | 2506.16211 | null |
| 2025-06-19 | FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation | Wei Tang Team | 2506.16201 | null |
| 2025-06-19 | ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation | Jitendra Malik Team | 2506.15953 | null |
| 2025-06-18 | Learning from Planned Data to Improve Robotic Pick-and-Place Planning Efficiency | Kensuke Harada Team | 2506.15920 | null |
| 2025-06-18 | Improving Robotic Manipulation: Techniques for Object Pose Estimation, Accommodating Positional Uncertainty, and Disassembly Tasks from Examples | Viral Rasik Galaiya Team | 2506.15865 | null |
| 2025-06-18 | Vision in Action: Learning Active Perception from Human Demonstrations | Shuran Song Team | 2506.15666 | null |
| 2025-06-18 | Learning Task-Agnostic Skill Bases to Uncover Motor Primitives in Animal Behaviors | Anqi Wu Team | 2506.15190 | null |
| 2025-06-18 | Robust Instant Policy: Leveraging Student's t-Regression Model for Robust In-context Imitation Learning of Robot Manipulation | Yukiyasu Domae Team | 2506.15157 | null |
| 2025-06-18 | TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality | Eiichi Yoshida Team | 2506.15146 | null |
| 2025-06-17 | RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills | Chuang Gan Team | 2506.14763 | null |
| 2025-06-17 | Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation | Mustafa Mukadam Team | 2506.14754 | null |
| 2025-06-17 | SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning | Shuo Wang Team | 2506.14648 | null |
| 2025-06-17 | Latent Action Diffusion for Cross-Embodiment Manipulation | Robert K. Katzschmann Team | 2506.14608 | null |
| 2025-06-19 | ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes | Hao Dong Team | 2506.14317 | null |
| 2025-06-17 | Steering Robots with Inference-Time Interactions | Yanwei Wang Team | 2506.14287 | null |
| 2025-06-17 | AMPLIFY: Actionless Motion Priors for Robot Learning from Videos | Animesh Garg Team | 2506.14198 | null |
| 2025-06-17 | Non-Overlap-Aware Egocentric Pose Estimation for Collaborative Perception in Connected Autonomy | Peng Gao Team | 2506.14180 | null |
| 2025-06-17 | GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation | Yebin Liu Team | 2506.14135 | null |
| 2025-06-16 | ATK: Automatic Task-driven Keypoint Selection for Robust Policy Learning | Abhishek Gupta Team | 2506.13867 | null |
| 2025-06-16 | Touch begins where vision ends: Generalizable policies for contact-rich manipulation | Raunaq Bhirangi Team | 2506.13762 | null |
| 2025-06-16 | Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins | Wei-Chiu Ma Team | 2506.13761 | null |
| 2025-06-16 | What Matters in Learning from Large-Scale Datasets for Robot Manipulation | Danfei Xu Team | 2506.13536 | null |
| 2025-06-16 | A Survey on Imitation Learning for Contact-Rich Tasks in Robotics | Arash Ajoudani Team | 2506.13498 | null |
| 2025-06-16 | Learning Swing-up Maneuvers for a Suspended Aerial Manipulation Platform in a Hierarchical Control Framework | Christian Ott Team | 2506.13478 | null |
| 2025-06-16 | VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation | Wei Pan Team | 2506.13428 | null |
| 2025-06-15 | SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration | Wenwu Zhu Team | 2506.12723 | null |
| 2025-06-15 | Adapting by Analogy: OOD Generalization of Visuomotor Policies via Functional Correspondence | Andrea Bajcsy Team | 2506.12678 | null |
| 2025-06-15 | Goal-based Self-Adaptive Generative Adversarial Imitation Learning (Goal-SAGAIL) for Multi-goal Robotic Manipulation Tasks | George Vogiatzis Team | 2506.12676 | null |
| 2025-06-14 | AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making | Qingyao Wu Team | 2506.12374 | null |
| 2025-06-13 | Role of Uncertainty in Model Development and Control Design for a Manufacturing Process | Francis Assadian Team | 2506.12273 | null |
| 2025-06-13 | SAIL: Faster-than-Demonstration Execution of Imitation Learning Policies | Danfei Xu Team | 2506.11948 | null |
| 2025-06-13 | mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity | Robert K. Katzschmann Team | 2506.11916 | null |
| 2025-06-13 | ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations | Maria Bauza Villalonga Team | 2506.11775 | null |
| 2025-06-13 | Control Architecture and Design for a Multi-robotic Visual Servoing System in Automated Manufacturing Environment | Rongfei Li Team | 2506.11387 | null |
| 2025-06-12 | Influence Functions for Data Attribution in Linear System Identification and LQR Control | Dongmei Chen Team | 2506.11293 | null |
| 2025-06-12 | Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation | Cordelia Schmid Team | 2506.11261 | null |
| 2025-06-12 | Eye, Robot: Learning to Look to Act with a BC-RL Perception-Action Loop | Angjoo Kanazawa Team | 2506.10968 | null |
| 2025-06-12 | GENMANIP: LLM-driven Simulation for Generalizable Instruction-Following Manipulation | Jiangmiao Pang Team | 2506.10966 | null |
| 2025-06-12 | Human-Robot Navigation using Event-based Cameras and Reinforcement Learning | Rodrigo Verschae Team | 2506.10790 | null |
| 2025-06-12 | Demonstrating Multi-Suction Item Picking at Scale via Multi-Modal Learning of Pick Success | Kapil Katyal Team | 2506.10359 | null |
| 2025-06-11 | Innovative Adaptive Imaged Based Visual Servoing Control of 6 DoFs Industrial Robot Manipulators | Francis Assadian Team | 2506.10240 | null |
| 2025-06-11 | One For All: LLM-based Heterogeneous Mission Planning in Precision Agriculture | Stefano Carpin Team | 2506.10106 | null |
| 2025-06-11 | eFlesh: Highly customizable Magnetic Touch Sensing using Cut-Cell Microstructures | Raunaq Bhirangi Team | 2506.09994 | null |
| 2025-06-11 | Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation | Xiao Ma Team | 2506.09990 | null |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null |
| 2025-06-11 | Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving | Chen Lv Team | 2506.09800 | null |
| 2025-06-11 | CHIP: A multi-sensor dataset for 6D pose estimation of chairs in industrial settings | Davide Boscaini Team | 2506.09699 | null |
| 2025-06-11 | Advances on Affordable Hardware Platforms for Human Demonstration Acquisition in Agricultural Applications | Néstor García Team | 2506.09494 | null |
| 2025-06-11 | DCIRNet: Depth Completion with Iterative Refinement for Dexterous Grasping of Transparent and Reflective Objects | Hong Liu Team | 2506.09491 | null |
| 2025-06-11 | Time-Unified Diffusion Policy with Action Discrimination for Robotic Manipulation | Le Wang Team | 2506.09422 | null |
| 2025-06-11 | Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation | Xiang Li Team | 2506.09384 | null |
| 2025-06-11 | ContextBuddy: AI-Enhanced Contextual Insights for Security Alert Investigation (Applied to Intrusion Detection) | Cecile Paris Team | 2506.09365 | null |
| 2025-06-10 | UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation | Li Fei-Fei Team | 2506.09284 | null |
| 2025-06-10 | Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism | Bolei Zhou Team | 2506.09176 | null |
| 2025-06-10 | FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency | Jian Tang Team | 2506.08822 | null |
| 2025-06-10 | Towards Biosignals-Free Autonomous Prosthetic Hand Control via Imitation Learning | Xianta Jiang Team | 2506.08795 | null |
| 2025-06-10 | Bayesian Inverse Physics for Neuro-Symbolic Robot Learning | Frank Kirchner Team | 2506.08756 | null |
| 2025-06-10 | Deep Reinforcement Learning-Based Motion Planning and PDE Control for Flexible Manipulators | Jouni Mattila Team | 2506.08639 | null |
| 2025-06-10 | RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping | Gitta Kutyniok Team | 2506.08632 | null |
| 2025-06-10 | Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots | Lijun Zhu Team | 2506.08416 | null |
| 2025-06-11 | HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation | Cong Wang Team | 2506.08296 | null |
| 2025-06-09 | ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving | Xinggang Wang Team | 2506.08052 | null |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null |
| 2025-06-09 | BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation | Xilin Chen Team | 2506.07530 | null |
| 2025-06-09 | Reinforcement Learning via Implicit Imitation Guidance | Chelsea Finn Team | 2506.07505 | null |
| 2025-06-09 | RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy | Hui Cheng Team | 2506.07490 | null |
| 2025-06-08 | CARoL: Context-aware Adaptation for Robot Learning | Xuan Wang Team | 2506.07006 | null |
| 2025-06-07 | SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game | Shanghang Zhang Team | 2506.06690 | null |
| 2025-06-07 | RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation | Si Liu Team | 2506.06677 | null |
| 2025-06-07 | Self-Adapting Improvement Loops for Robotic Learning | Chen Sun Team | 2506.06658 | null |
| 2025-06-06 | Enhancing Robot Safety via MLLM-Based Semantic Interpretation of Failure Data | Somil Bansal Team | 2506.06570 | null |
| 2025-06-06 | NeSyPack: A Neuro-Symbolic Framework for Bimanual Logistics Packing | Changliu Liu Team | 2506.06567 | null |
| 2025-06-06 | MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping | Farshad Khorrami Team | 2506.06535 | null |
| 2025-06-06 | 3DFlowAction: Learning Cross-Embodiment Manipulation from 3D Flow World Model | Mingkui Tan Team | 2506.06199 | null |
| 2025-06-06 | Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization | Tingnan Zhang Team | 2506.06196 | null |
| 2025-06-10 | BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning | Rudolf Lioutikov Team | 2506.06072 | null |
| 2025-06-06 | Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning | Ping Luo Team | 2506.05985 | null |
| 2025-06-06 | Optimal Robotic Velcro Peeling with Force Feedback | Volkan Isler Team | 2506.05812 | null |
| 2025-06-06 | Where Do We Look When We Teach? Analyzing Human Gaze Behavior Across Demonstration Devices in Robot Imitation Learning | Hiroshi Bito Team | 2506.05808 | null |
| 2025-06-06 | FlowOE: Imitation Learning with Flow Policy from Ensemble RL Experts for Optimal Execution under Heston Volatility and Concave Market Impacts | Zhi Chen Team | 2506.05755 | null |
| 2025-06-06 | You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping | Xiangyang Xue Team | 2506.05719 | null |
| 2025-06-05 | A Smooth Sea Never Made a Skilled |
Gokul Swamy Team | 2506.05294 | null |
| 2025-06-05 | LiPo: A Lightweight Post-optimization Framework for Smoothing Action Chunks Generated by Learned Policies | Suhan Park Team | 2506.05165 | null |
| 2025-06-05 | DemoSpeedup: Accelerating Visuomotor Policies via Entropy-Guided Demonstration Acceleration | Huazhe Xu Team | 2506.05064 | null |
| 2025-06-06 | ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning | Jian Tang Team | 2506.04941 | null |
| 2025-06-05 | Learning dissection trajectories from expert surgical videos via imitation learning with equivariant diffusion | Qi Dou Team | 2506.04716 | null |
| 2025-06-05 | Advancing Tool-Augmented Large Language Models via Meta-Verification and Reflection Learning | Wanxiang Che Team | 2506.04625 | null |
| 2025-06-04 | SGN-CIRL: Scene Graph-based Navigation with Curriculum, Imitation, and Reinforcement Learning | Aleksandr Panov Team | 2506.04505 | null |
| 2025-06-04 | Object-centric 3D Motion Field for Robot Learning from Human Videos | Pieter Abbeel Team | 2506.04227 | null |
| 2025-06-04 | Splatting Physical Scenes: End-to-End Real-to-Sim from Imperfect Robot Data | Leonard Hasenclever Team | 2506.04120 | null |
| 2025-06-04 | STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization | Liqiang Nie Team | 2506.03863 | link |
| 2025-06-04 | SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models | Jian Tang Team | 2506.03574 | null |
| 2025-06-05 | Confidence-Guided Human-AI Collaboration: Reinforcement Learning with Distributional Proxy Value Propagation for Autonomous Driving | Hu Chuan Team | 2506.03568 | link |
| 2025-06-03 | ORV: 4D Occupancy-centric Robot Video Generation | Hao Zhao Team | 2506.03079 | null |
| 2025-06-03 | Geometric Visual Servo Via Optimal Transport | Ashutosh Tiwari Team | 2506.02768 | null |
| 2025-06-03 | Rodrigues Network for Learning Robot Actions | Leonidas Guibas Team | 2506.02618 | null |
| 2025-06-03 | Reachability Weighted Offline Goal-conditioned Resampling | Joni Pajarinen Team | 2506.02577 | null |
| 2025-06-02 | Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning | Pheng-Ann Heng Team | 2506.01953 | null |
| 2025-06-02 | Feel the Force: Contact-Driven Learning from Humans | Lerrel Pinto Team | 2506.01944 | null |
| 2025-06-02 | Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control | Dahua Lin Team | 2506.01943 | null |
| 2025-06-02 | FreeTacMan: Robot-free Visuo-Tactile Data Collection System for Contact-rich Manipulation | Hongyang Li Team | 2506.01941 | null |
| 2025-06-02 | Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics | Matej Hoffmann Team | 2506.01756 | null |
| 2025-06-02 | Reasoning-Table: Exploring Reinforcement Learning for Table Reasoning | Kang Liu Team | 2506.01710 | link |
| 2025-06-02 | WoMAP: World Models For Embodied Open-Vocabulary Object Localization | Anirudha Majumdar Team | 2506.01600 | null |
| 2025-06-02 | FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens | Yuexin Ma Team | 2506.01583 | null |
| 2025-06-02 | Trajectory First: A Curriculum for Discovering Diverse Policies | Marc Toussaint Team | 2506.01568 | null |
| 2025-06-02 | Variational Adaptive Noise and Dropout towards Stable Recurrent Neural Networks | Shingo Murata Team | 2506.01350 | null |
| 2025-06-01 | OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation | Valts Blukis Team | 2506.01196 | null |
| 2025-06-01 | HoMeR: Learning In-the-Wild Mobile Manipulation via Hybrid Imitation and Whole-Body Control | Jeannette Bohg Team | 2506.01185 | null |
| 2025-06-01 | Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning | Jing Li Team | 2506.00782 | null |
| 2025-05-31 | XYZ-IBD: High-precision Bin-picking Dataset for Object 6D Pose Estimation Capturing Real-world Industrial Complexity | Benjamin Busam Team | 2506.00599 | null |
| 2025-05-31 | Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents | Zhou Yu Team | 2506.00320 | null |
| 2025-05-30 | 3D Gaussian Splat Vulnerabilities | Polo Chau Team | 2506.00280 | null |
| 2025-05-30 | Bi-Manual Joint Camera Calibration and Scene Representation | Weiming Zhi Team | 2505.24819 | null |
| 2025-05-30 | MagicGripper: A Multimodal Sensor-Integrated Gripper for Contact-Rich Robotic Manipulation | Dandan Zhang Team | 2505.24382 | null |
| 2025-05-30 | Imitation Learning-Based Path Generation for the Complex Assembly of Deformable Objects | Christoffer Sloth Team | 2505.24339 | null |
| 2025-05-30 | SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping | Hao Dong Team | 2505.24305 | null |
| 2025-05-30 | Safety-Aware Robust Model Predictive Control for Robotic Arms in Dynamic Environments | Suwoong Lee Team | 2505.24209 | null |
| 2025-05-30 | Learning Gentle Humanoid Locomotion and End-Effector Stabilization Control | Guanya Shi Team | 2505.24198 | null |
| 2025-05-29 | Mobi- |
Jeannette Bohg Team | 2505.23692 | null |
| 2025-05-30 | Normalizing Flows are Capable Models for RL | Benjamin Eysenbach Team | 2505.23527 | null |
| 2025-05-29 | Optimization-based Posture Generation for Whole-body Contact Motion by Contact Point Search on the Body Surface | Masayuki Inaba Team | 2505.23501 | null |
| 2025-05-29 | Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents | Lichao Sun Team | 2505.23450 | null |
| 2025-05-29 | Enhanced DACER Algorithm with High Diffusion Efficiency | Shengbo Eben Li Team | 2505.23426 | null |
| 2025-05-29 | RoboTransfer: Geometry-Consistent Video Diffusion for Robotic Visual Policy Transfer | Zhizhong Su Team | 2505.23171 | null |
| 2025-05-28 | SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning | Yuke Zhu Team | 2505.22626 | null |
| 2025-05-28 | Hybrid Learning for Cold-Start-Aware Microservice Scheduling in Dynamic Edge Environments | Weijia Jia Team | 2505.22424 | link |
| 2025-05-28 | Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning | Marian Verhelst Team | 2505.22404 | null |
| 2025-05-28 | State and Input Constrained Adaptive Tracking Control of Uncertain Euler-Lagrange Systems with Robustness and Feasibility Analysis | Shubhendu Bhasin Team | 2505.22352 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-05-28 | Learning Compositional Behaviors from Demonstration and Language | Jiajun Wu Team | 2505.21981 | null |
| 2025-05-29 | ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null |
| 2025-05-28 | Streaming Flow Policy: Simplifying diffusion |
Siddharth Ancha Team | 2505.21851 | null |
| 2025-05-27 | PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation | Tianmin Shu Team | 2505.21652 | null |
| 2025-05-30 | Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks | Bryan A. Plummer Team | 2505.21649 | null |
| 2025-05-27 | CLAMP: Crowdsourcing a LArge-scale in-the-wild haptic dataset with an open-source device for Multimodal robot Perception | Tapomayukh Bhattacharjee Team | 2505.21495 | null |
| 2025-05-27 | EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation | Robert Platt Team | 2505.21351 | null |
| 2025-05-27 | EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild | Gonzalo Ferrer Team | 2505.21282 | null |
| 2025-05-27 | Learning What to Do and What Not To Do: Offline Imitation from Expert and Undesirable Demonstrations | Tanvi Verma Team | 2505.21182 | null |
| 2025-05-27 | Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning | George Retsinas Team | 2505.20962 | null |
| 2025-05-27 | Learning Unified Force and Position Control for Legged Loco-Manipulation | Siyuan Huang Team | 2505.20829 | null |
| 2025-05-27 | Spatial RoboGrasp: Generalized Robotic Grasping Control Policy | Luhui Hu Team | 2505.20814 | null |
| 2025-05-27 | Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt | Jianyu Chen Team | 2505.20795 | null |
| 2025-05-28 | ControlTac: Force- and Position-Controlled Tactile Data Augmentation with a Single Reference Image | Ruohan Gao Team | 2505.20498 | null |
| 2025-05-26 | OSVI-WM: One-Shot Visual Imitation for Unseen Tasks using World-Model-Guided Trajectory Generation | Farshad Khorrami Team | 2505.20425 | null |
| 2025-05-26 | Co-Design of Soft Gripper with Neural Physics | Xiaolong Wang Team | 2505.20404 | null |
| 2025-05-26 | EgoZero: Robot Learning from Smart Glasses | Lerrel Pinto Team | 2505.20290 | null |
| 2025-05-26 | URPlanner: A Universal Paradigm For Collision-Free Robotic Motion Planning Based on Deep Reinforcement Learning | Marcelo H. Ang Jr Team | 2505.20175 | null |
| 2025-05-27 | MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents | Xiaodan Liang Team | 2505.20148 | link |
| 2025-05-26 | ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving | Dongbin Zhao Team | 2505.20024 | link |
| 2025-05-26 | Inverse Q-Learning Done Right: Offline Imitation Learning in |
Luca Viano Team | 2505.19946 | null |
| 2025-05-26 | TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning | Dongbin Zhao Team | 2505.19769 | null |
| 2025-05-26 | Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning | Jean-Baptiste Mouret Team | 2505.19717 | null |
| 2025-05-25 | Structured Reinforcement Learning for Combinatorial Decision-Making | Maximilian Schiffer Team | 2505.19053 | link |
| 2025-05-25 | WorldEval: World Model as Real-World Robot Policies Evaluator | Yi Xu Team | 2505.19017 | null |
| 2025-05-25 | Online Knowledge Distillation with Reward Guidance | Chen Jia Team | 2505.18952 | null |
| 2025-05-24 | Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning | Giovanni Beltrame Team | 2505.18858 | null |
| 2025-05-24 | On the Dual-Use Dilemma in Physical Reasoning and Force | Nikolaus Correll Team | 2505.18792 | null |
| 2025-05-24 | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Ziwei Wang Team | 2505.18719 | null |
| 2025-05-24 | MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations | Hong Thanh Nguyen Team | 2505.18595 | null |
| 2025-05-24 | Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning | Zhiyun Lin Team | 2505.18487 | null |
| 2025-05-24 | Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy | Yu She Team | 2505.18474 | null |
| 2025-05-24 | ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning | Yu She Team | 2505.18472 | null |
| 2025-05-23 | ProgRM: Build Better GUI Agents with Progress Rewards | Kai Yu Team | 2505.18121 | null |
| 2025-05-23 | Classification of assembly tasks combining multiple primitive actions using Transformers and xLSTMs | Pedro Neto Team | 2505.18012 | null |
| 2025-05-23 | Is Single-View Mesh Reconstruction Ready for Robotics? | Ingmar Posner Team | 2505.17966 | null |
| 2025-05-23 | SynRES: Towards Referring Expression Segmentation in the Wild via Synthetic Data | Donghyun Kim Team | 2505.17695 | null |
| 2025-05-23 | Learning Equilibria from Data: Provably Efficient Multi-Agent Imitation Learning | Giorgia Ramponi Team | 2505.17610 | null |
| 2025-05-23 | Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy | Bin Zhao Team | 2505.17434 | null |
| 2025-05-23 | Bootstrapping Imitation Learning for Long-horizon Manipulation via Hierarchical Data Collection Space | Hui Cheng Team | 2505.17389 | null |
| 2025-05-22 | ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems | Farhad Imani Team | 2505.17295 | null |
| 2025-05-22 | CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning | Limin Wang Team | 2505.17006 | null |
| 2025-05-22 | 3D Equivariant Visuomotor Policy Learning via Spherical Projection | Robin Walters Team | 2505.16969 | null |
| 2025-05-22 | Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only | Donglin Wang Team | 2505.16856 | null |
| 2025-05-22 | Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation | Soumik Sarkar Team | 2505.16547 | null |
| 2025-05-24 | ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models | Xiuying Chen Team | 2505.16517 | null |
| 2025-05-22 | Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2) | Junchi Yan Team | 2505.16394 | null |
| 2025-05-22 | TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation | Hengdi Zhang Team | 2505.16289 | null |
| 2025-05-22 | SEM: Enhancing Spatial Understanding for Robust Robot Manipulation | Zhizhong Su Team | 2505.16196 | null |
| 2025-05-22 | Tactile-based Reinforcement Learning for Adaptive Grasping under Observation Uncertainties | Yang Ye Team | 2505.16167 | null |
| 2025-05-21 | WaveTouch: Active Tactile Sensing Using Vibro-Feedback for Classification of Variable Stiffness and Infill Density Objects | Bakhtiyar Orazbayev Team | 2505.16062 | null |
| 2025-05-25 | Proactive Hierarchical Control Barrier Function-Based Safety Prioritization in Close Human-Robot Interaction Scenarios | Prashanth Krishnamurthy Team | 2505.16055 | null |
| 2025-05-21 | UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning | Si Liu Team | 2505.15725 | null |
| 2025-05-21 | Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization | Junwei Liang Team | 2505.15660 | null |
| 2025-05-21 | FLARE: Robot Learning with Implicit World Modeling | Linxi Fan Team | 2505.15659 | null |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Ken Goldberg Team | 2505.15517 | null |
| 2025-05-21 | Guided Policy Optimization under Partial Observability | Zongqing Lu Team | 2505.15418 | link |
| 2025-05-21 | Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control | Jungwook Choi Team | 2505.15304 | null |
| 2025-05-21 | Learning-based Autonomous Oversteer Control and Collision Avoidance | Seung-Hyun Kong Team | 2505.15275 | null |
| 2025-05-21 | Filtering Learning Histories Enhances In-Context Reinforcement Learning | Santiago Paternain Team | 2505.15143 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | RoboCulture: A Robotics Platform for Automated Biological Experimentation | Milica Radisic Team | 2505.14941 | null |
| 2025-05-20 | Imitation Learning via Focused Satisficing | Brian Ziebart Team | 2505.14820 | null |
| 2025-05-20 | DORA: Object Affordance-Guided Reinforcement Learning for Dexterous Robotic Manipulation | Jianwei Zhang Team | 2505.14819 | null |
| 2025-05-20 | Vid2World: Crafting Video Diffusion Models to Interactive World Models | Mingsheng Long Team | 2505.14357 | null |
| 2025-05-20 | AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory | Ping Luo Team | 2505.14030 | null |
| 2025-05-20 | RLVR-World: Training World Models with Reinforcement Learning | Mingsheng Long Team | 2505.13934 | link |
| 2025-05-20 | Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning | Yutong Ban Team | 2505.13925 | null |
| 2025-05-20 | Learning to Insert for Constructive Neural Vehicle Routing Solver | Qingfu Zhang Team | 2505.13904 | null |
| 2025-05-20 | Structured Agent Distillation for Large Language Model | Yanzhi Wang Team | 2505.13820 | null |
| 2025-05-21 | Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation | Georgia Chalvatzaki Team | 2505.13667 | null |
| 2025-05-19 | TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion | Minh Nhat Vu Team | 2505.13549 | null |
| 2025-05-19 | GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation | Rose Hendrix Team | 2505.13441 | null |
| 2025-05-19 | KinTwin: Imitation Learning with Torque and Muscle Driven Biomechanical Models Enables Precise Replication of Able-Bodied and Impaired Movement from Markerless Motion Capture | R. James Cotton Team | 2505.13436 | null |
| 2025-05-19 | TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation | Jiangmiao Pang Team | 2505.12748 | null |
| 2025-05-19 | Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation | Chi-Wing Fu Team | 2505.12744 | null |
| 2025-05-19 | Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning | Taesup Moon Team | 2505.12737 | null |
| 2025-05-19 | DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories | Linxi Fan Team | 2505.12705 | null |
| 2025-05-19 | Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion | Qi Wu Team | 2505.12679 | null |
| 2025-05-19 | HIL: Hybrid Imitation Learning of Diverse Parkour Skills from Videos | Xue Bin Peng Team | 2505.12619 | null |
| 2025-05-18 | MTIL: Encoding Full History with Mamba for Temporal Imitation Learning | Zhouping Yin Team | 2505.12410 | link |
| 2025-05-18 | PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis | Zhipong Cai Team | 2505.12294 | null |
| 2025-05-20 | RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction | Bo Zhao Team | 2505.12224 | null |
| 2025-05-20 | Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study | Hae-Won Park Team | 2505.12222 | null |
| 2025-05-17 | L2D2: Robot Learning from 2D Drawings | Dylan P. Losey Team | 2505.12072 | null |
| 2025-05-17 | H2R: A Human-to-Robot Data Augmentation for Robot Pre-training from Videos | Shanghang Zhang Team | 2505.11920 | null |
| 2025-05-17 | GLOVER++: Unleashing the Potential of Affordance Learning from Human Behaviors for Robotic Manipulation | Junwei Liang Team | 2505.11865 | null |
| 2025-05-17 | Learning IMU Bias with Diffusion Model | Guoquan Huang Team | 2505.11763 | null |
| 2025-05-16 | Zero-Shot Visual Generalization in Robot Manipulation | Gaurav Sukhatme Team | 2505.11719 | null |
| 2025-05-16 | Employing Laban Shape for Generating Emotionally and Functionally Expressive Trajectories in Robotic Manipulators | Alessandro Roncone Team | 2505.11716 | null |
| 2025-05-16 | EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video | Jian Zhang Team | 2505.11709 | null |
| 2025-05-16 | Grounded Task Axes: Zero-Shot Semantic Skill Generalization via Task-Axis Controllers and Visual Foundation Models | Oliver Kroemer Team | 2505.11680 | null |
| 2025-05-16 | SHIELD: Safety on Humanoids via CBFs In Expectation on Learned Dynamics | Aaron D. Ames Team | 2505.11494 | null |
| 2025-05-16 | Exploiting Radiance Fields for Grasp Generation on Novel Synthetic Views | Todor Stoyanov Team | 2505.11467 | null |
| 2025-05-16 | ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations | Jesse Zhang Team | 2505.10911 | null |
| 2025-05-16 | Counterfactual Behavior Cloning: Offline Imitation Learning from Imperfect Human Demonstrations | Dylan P. Losey Team | 2505.10760 | null |
| 2025-05-15 | Infinigen-Sim: Procedural Generation of Articulated Simulation Assets | Jia Deng Team | 2505.10755 | null |
| 2025-05-15 | Knowledge capture, adaptation and composition (KCAC): A framework for cross-task curriculum learning in robotic manipulation | Yan Jin Team | 2505.10522 | null |
| 2025-05-15 | IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning | Junshan Zhang Team | 2505.10442 | null |
| 2025-05-15 | NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning | Chengyuan Chen Team | 2505.10359 | null |
| 2025-05-15 | SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning | Axel Krieger Team | 2505.10251 | null |
| 2025-05-15 | Training People to Reward Robots | Matthew Howard Team | 2505.10151 | null |
| 2025-05-15 | EmbodiedMAE: A Unified 3D Multi-Modal Representation for Robot Manipulation | Jianye Hao Team | 2505.10105 | null |
| 2025-05-15 | FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation | Qing Li Team | 2505.10075 | null |
| 2025-05-15 | APEX: Action Priors Enable Efficient Exploration for Skill Imitation on Articulated Robots | Guillaume Sartoretti Team | 2505.10022 | null |
| 2025-05-15 | ImagineBench: Evaluating Reinforcement Learning with Large Language Model Rollouts | Yang Yu Team | 2505.10010 | link |
| 2025-05-16 | PointArena: Probing Multimodal Grounding Through Language-Guided Pointing | Ranjay Krishna Team | 2505.09990 | null |
| 2025-05-15 | Learning Diverse Natural Behaviors for Enhancing the Agility of Quadrupedal Robots | Chunlin Chen Team | 2505.09979 | null |
| 2025-05-14 | Learning Rock Pushability on Rough Planetary Terrain | Cagri Kilic Team | 2505.09833 | null |
| 2025-05-14 | Trailblazer: Learning offroad costmaps for long range planning | Srikanth Saripalli Team | 2505.09739 | null |
| 2025-05-14 | EnerVerse-AC: Envisioning Embodied Environments with Action Condition | Guanghui Ren Team | 2505.09723 | null |
| 2025-05-14 | ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation | Daniel Seita Team | 2505.09698 | null |
| 2025-05-14 | DataMIL: Selecting Data for Robot Imitation Learning with Datamodels | Roberto Martín-Martín Team | 2505.09603 | null |
| 2025-05-14 | Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware | Ken Goldberg Team | 2505.09601 | null |
| 2025-05-14 | VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation | Shuo Wang Team | 2505.09577 | null |
| 2025-05-14 | Learning Long-Context Diffusion Policies via Past-Token Prediction | Chelsea Finn Team | 2505.09561 | null |
| 2025-05-14 | Distilling Realizable Students from Unrealizable Teachers | Sanjiban Choudhury Team | 2505.09546 | null |
| 2025-05-14 | Exploring Pose-Guided Imitation Learning for Robotic Precise Insertion | Qixin Cao Team | 2505.09424 | null |
| 2025-05-14 | Neural Multivariate Regression: Qualitative Insights from the Unconstrained Feature Model | Keith Ross Team | 2505.09308 | null |
| 2025-05-14 | Latent Theory of Mind: A Decentralized Diffusion Architecture for Cooperative Manipulation | Guillaume Sartoretti Team | 2505.09144 | null |
| 2025-05-14 | FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis | He Wang Team | 2505.09109 | null |
| 2025-05-14 | Imitation Learning for Adaptive Control of a Virtual Soft Exoglove | Letizia Gionfrida Team | 2505.09099 | null |
| 2025-05-13 | ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation | Dongyi Wang Team | 2505.08986 | null |
| 2025-05-13 | Augmented Reality for RObots (ARRO): Pointing Visuomotor Policies Towards Visual Robustness | Wolfram Burgard Team | 2505.08627 | null |
| 2025-05-13 | Beyond Predefined Actions: Integrating Behavior Trees and Dynamic Movement Primitives for Robot Learning from Demonstration | Todor Stoyanov Team | 2505.08625 | null |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null |
| 2025-05-13 | Parameter Estimation using Reinforcement Learning Causal Curiosity: Limits and Challenges | Weisi Guo Team | 2505.08453 | null |
| 2025-05-13 | Adaptive Diffusion Policy Optimization for Robotic Manipulation | Zhuang Yang Team | 2505.08376 | null |
| 2025-05-13 | Learning Like Humans: Advancing LLM Reasoning Capabilities via Adaptive Difficulty Curriculum Learning and Expert-Guided Self-Reformulation | Qianchun Lu Team | 2505.08364 | null |
| 2025-05-13 | Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning | Biwei Huang Team | 2505.08361 | null |
| 2025-05-13 | HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands | Yunhui Liu Team | 2505.08213 | null |
| 2025-05-13 | CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding | Shuo Wang Team | 2505.08194 | null |
| 2025-05-12 | What Matters for Batch Online Reinforcement Learning in Robotics? | Chelsea Finn Team | 2505.08078 | null |
| 2025-05-12 | H |
Huazhe Xu Team | 2505.07819 | null |
| 2025-05-12 | Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models | Jia-Bin Huang Team | 2505.07815 | null |
| 2025-05-12 | Improving Trajectory Stitching with Flow Models | Ioannis Havoutis Team | 2505.07802 | null |
| 2025-05-12 | Guiding Data Collection via Factored Scaling Curves | Anirudha Majumdar Team | 2505.07728 | null |
| 2025-05-12 | GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion | Peng Yin Team | 2505.07455 | null |
| 2025-05-12 | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | Donglin Wang Team | 2505.07395 | null |
| 2025-05-11 | X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real | Sanjiban Choudhury Team | 2505.07096 | null |
| 2025-05-11 | YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action | Bailing Tian Team | 2505.06923 | null |
| 2025-05-10 | JaxRobotarium: Training and Deploying Multi-Robot Policies in 10 Minutes | Harish Ravichandar Team | 2505.06771 | null |
| 2025-05-10 | Learned IMU Bias Prediction for Invariant Visual Inertial Odometry | Nikolay Atanasov Team | 2505.06748 | null |
| 2025-05-10 | ACORN: Adaptive Contrastive Optimization for Safe and Robust Fine-Grained Robotic Manipulation | Zixian Yue Team | 2505.06628 | null |
| 2025-05-10 | Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach | Xiaokang Yang Team | 2505.06482 | null |
| 2025-05-09 | Adaptive Wiping: Adaptive contact-rich manipulation through few-shot imitation learning with Force-Torque feedback and pre-trained object representations | Gentiane Venture Team | 2505.06451 | null |
| 2025-05-09 | VIN-NBV: A View Introspection Network for Next-Best-View Selection for Resource-Efficient 3D Reconstruction | Roni Sengupta Team | 2505.06219 | null |
| 2025-05-09 | Neuro-Symbolic Concepts | Jiajun Wu Team | 2505.06191 | null |
| 2025-05-07 | Efficient Sensorimotor Learning for Open-world Robot Manipulation | Yifeng Zhu Team | 2505.06136 | null |
| 2025-05-09 | Robot Learning Using Multi-Coordinate Elastic Maps | Reza Azadeh Team | 2505.06092 | null |
| 2025-05-09 | TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations | Abhinav Shrivastava Team | 2505.06079 | null |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null |
| 2025-05-09 | Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives | Mac Schwager Team | 2505.05787 | null |
| 2025-05-09 | FlowHFT: Flow Policy Induced Optimal High-Frequency Trading under Diverse Market Conditions | Steve Yang Team | 2505.05784 | null |
| 2025-05-08 | CLAM: Continuous Latent Action Models for Robot Learning from Unlabeled Demonstrations | Stephen Tu Team | 2505.04999 | null |
| 2025-05-08 | CubeDAgger: Improved Robustness of Interactive Imitation Learning without Violation of Dynamic Stability | Taisuke Kobayashi Team | 2505.04897 | null |
| 2025-05-08 | D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation | Daniel Seita Team | 2505.04860 | null |
| 2025-05-07 | Steerable Scene Generation with Post Training and Inference-Time Search | Russ Tedrake Team | 2505.04831 | null |
| 2025-05-07 | Primal-dual algorithm for contextual stochastic combinatorial optimization | Axel Parmentier Team | 2505.04757 | null |
| 2025-05-07 | Merging and Disentangling Views in Visual Reinforcement Learning for Robotic Manipulation | Henrik I. Christensen Team | 2505.04619 | null |
| 2025-05-06 | OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation | Donglin Wang Team | 2505.03912 | null |
| 2025-05-06 | AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control | Xiaolong Wang Team | 2505.03738 | null |
| 2025-05-06 | Meta-Optimization and Program Search using Language Models for Task and Motion Planning | Marc Toussaint Team | 2505.03725 | null |
| 2025-05-06 | Ergodic Generative Flows | Yinchuan Li Team | 2505.03561 | null |
| 2025-05-06 | RIFT: Closed-Loop RL Fine-Tuning for Realistic and Controllable Traffic Simulation | Sifa Zheng Team | 2505.03344 | null |
| 2025-05-06 | The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning | Abhinav Valada Team | 2505.03296 | null |
| 2025-05-05 | Sim2Real Transfer for Vision-Based Grasp Verification | Markus Vincze Team | 2505.03046 | link |
| 2025-05-05 | Zero-shot Sim2Real Transfer for Magnet-Based Tactile Sensor on Insertion Tasks | Jia Deng Team | 2505.02915 | null |
| 2025-05-05 | Re-purposing a modular origami manipulator into an adaptive physical computer for machine learning and robotic perception | Suyi Li Team | 2505.02744 | null |
| 2025-05-05 | Spatiotemporal Non-Uniformity-Aware Online Task Scheduling in Collaborative Edge Computing for Industrial Internet of Things | Bo Lei Team | 2505.02597 | null |
| 2025-05-05 | Automated Hybrid Reward Scheduling via Large Language Models for Robotic Skill Learning | Jianqiang Li Team | 2505.02483 | null |
| 2025-05-05 | MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans | Siyuan Huang Team | 2505.02388 | null |
| 2025-05-04 | Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning | Hao Su Team | 2505.02228 | null |
| 2025-05-04 | CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation | Hao Dong Team | 2505.02166 | null |
| 2025-05-04 | Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions | Mingyu Ding Team | 2505.02152 | null |
| 2025-05-03 | Act Natural! Extending Naturalistic Projection to Multimodal Behavior Scenarios | David Fridovich-Keil Team | 2505.01945 | null |
| 2025-05-07 | RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation | Xiaodan Liang Team | 2505.01709 | null |
| 2025-05-02 | FalconWing: An Open-Source Platform for Ultra-Light Fixed-Wing Aircraft Research | Sayan Mitra Team | 2505.01383 | null |
| 2025-05-06 | Robotic Visual Instruction | Xianzheng Ma Team | 2505.00693 | null |
| 2025-05-01 | Towards Autonomous Micromobility through Scalable Urban Simulation | Bolei Zhou Team | 2505.00690 | null |
| 2025-05-01 | DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation | Yang Gao Team | 2505.00527 | null |
| 2025-05-01 | Optimal Interactive Learning on the Job via Facility Location Planning | George Konidaris Team | 2505.00490 | null |
| 2025-04-30 | LLM-based Interactive Imitation Learning for Robotic Manipulation | Stefan Wermter Team | 2504.21769 | null |
| 2025-04-30 | RoboGround: Robotic Manipulation with Grounded Vision-Language Priors | Zhou Zhao Team | 2504.21530 | null |
| 2025-04-30 | Provably-Safe, Online System Identification | Ram Vasudevan Team | 2504.21486 | null |
| 2025-04-29 | TesserAct: Learning 4D Embodied World Models | Chuang Gan Team | 2504.20995 | null |
| 2025-04-29 | XPG-RL: Reinforcement Learning with Explainable Priority Guidance for Efficiency-Boosted Mechanical Search | Elena Shrestha Team | 2504.20969 | null |
| 2025-04-29 | PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations | Xuguang Lan Team | 2504.20520 | null |
| 2025-04-29 | SPARK Hand: Scooping-Pinching Adaptive Robotic Hand with Kempe Mechanism for Vertical Passive Grasp in Environmental Constraints | Wenzeng Zhang Team | 2504.20506 | null |
| 2025-04-28 | UTTG_ A Universal Teleoperation Approach via Online Trajectory Generation | Hesheng Wang Team | 2504.19736 | null |
| 2025-04-28 | GPA-RAM: Grasp-Pretraining Augmented Robotic Attention Mamba for Spatial Task Learning | Mengyuan Liu Team | 2504.19683 | null |
| 2025-04-27 | PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies | Edward Adelson Team | 2504.19341 | null |
| 2025-04-29 | Learned Perceptive Forward Dynamics Model for Safe and Platform-aware Robotic Navigation | Marco Hutter Team | 2504.19322 | link |
| 2025-04-27 | Learning to Drive from a World Model | Yassine Yousfi Team | 2504.19077 | null |
| 2025-04-26 | RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning | Pieter Abbeel Team | 2504.18904 | null |
| 2025-04-26 | Imitation Learning for Autonomous Driving: Insights from Real-World Testing | Tufan Kumbasar Team | 2504.18847 | null |
| 2025-04-26 | Hierarchical Reinforcement Learning in Multi-Goal Spatial Navigation with Autonomous Mobile Robots | Alfredo Weitzenfeld Team | 2504.18794 | null |
| 2025-04-26 | STDArm: Transferring Visuomotor Policies From Static Data Training to Dynamic Robot Manipulation | Yanyong Zhang Team | 2504.18792 | null |
| 2025-04-25 | Generalization Capability for Imitation Learning | Yixiao Wang Team | 2504.18538 | null |
| 2025-04-25 | Instrumentation for Better Demonstrations: A Case Study | Francis wyffels Team | 2504.18481 | null |
| 2025-04-25 | Action Flow Matching for Continual Robot Learning | Lantao Liu Team | 2504.18471 | null |
| 2025-04-25 | Design and Evaluation of a UGV-Based Robotic Platform for Precision Soil Moisture Remote Sensing | George Nikolakopoulos Team | 2504.18284 | null |
| 2025-04-28 | Implementation Analysis of Collaborative Robot Digital Twins in Physics Engines | Hans D. Schotten Team | 2504.18200 | null |
| 2025-04-25 | Offline Learning of Controllable Diverse Behaviors | Ludovic Denoyer Team | 2504.18160 | null |
| 2025-04-24 | CIVIL: Causal and Intuitive Visual Imitation Learning | Dylan P. Losey Team | 2504.17959 | null |
| 2025-04-24 | Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning | Prithviraj Ammanabrolu Team | 2504.17950 | null |
| 2025-04-24 | Learning Attentive Neural Processes for Planning with Pushing Actions | Nicholas Roy Team | 2504.17924 | null |
| 2025-04-24 | CaRL: Learning Scalable Planning Policies with Simple Rewards | Andreas Geiger Team | 2504.17838 | null |
| 2025-04-23 | Learning Underwater Active Perception in Simulation | Donald G. Dansereau Team | 2504.17817 | null |
| 2025-04-24 | Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation | Jiangmiao Pang Team | 2504.17784 | null |
| 2025-04-24 | Integrating Learning-Based Manipulation and Physics-Based Locomotion for Whole-Body Badminton Robot Control | Dong Xuan Team | 2504.17771 | null |
| 2025-04-24 | Robotic Grinding Skills Learning Based on Geodesic Length Dynamic Motion Primitives | Han Ding Team | 2504.17216 | null |
| 2025-04-23 | Geometric Formulation of Unified Force-Impedance Control on SE(3) for Robotic Manipulators | Roberto Horowitz Team | 2504.17080 | null |
| 2025-04-23 | A Systematic Approach to Design Real-World Human-in-the-Loop Deep Reinforcement Learning: Salient Features, Challenges and Trade-offs | Younes Zerouali Team | 2504.17006 | null |
| 2025-04-23 | Latent Diffusion Planning for Imitation Learning | Chelsea Finn Team | 2504.16925 | null |
| 2025-04-23 | MOSAIC: A Skill-Centric Algorithmic Framework for Long-Horizon Manipulation Planning | Maxim Likhachev Team | 2504.16738 | null |
| 2025-04-23 | ManipDreamer: Boosting Robotic Manipulation World Model with Action Tree and Visual Guidance | Shanghang Zhang Team | 2504.16464 | null |
| 2025-04-22 | Mass-Adaptive Admittance Control for Robotic Manipulators | Logan E. Beaver Team | 2504.16224 | null |
| 2025-04-22 | Ury Zhilinsky Team | 2504.16054 | null | |
| 2025-04-22 | SPECI: Skill Prompts based Hierarchical Continual Imitation Learning for Robot Manipulation | Xiangli Nie Team | 2504.15561 | null |
| 2025-04-22 | VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation | Matei Ciocarlie Team | 2504.15535 | null |
| 2025-04-22 | Few-Shot Vision-Language Action-Incremental Policy Learning | Weili Guan Team | 2504.15517 | null |
| 2025-04-21 | LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning | Boyuan Chen Team | 2504.15472 | null |
| 2025-04-23 | Advancing Embodied Intelligence in Robotic-Assisted Endovascular Procedures: A Systematic Review of AI Solutions | Peng Qi Team | 2504.15327 | null |
| 2025-04-21 | Immersive Teleoperation Framework for Locomanipulation Tasks | Dimitrios Kanoulas Team | 2504.15229 | null |
| 2025-04-21 | A Genetic Fuzzy-Enabled Framework on Robotic Manipulation for In-Space Servicing | Kelly Cohen Team | 2504.15226 | null |
| 2025-04-21 | A General Infrastructure and Workflow for Quadrotor Deep Reinforcement Learning and Reality Deployment | Huaping Liu Team | 2504.15129 | null |
| 2025-04-21 | SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks | Animesh Garg Team | 2504.14857 | null |
| 2025-04-20 | Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline | Hongsheng Li Team | 2504.14709 | null |
| 2025-04-24 | Latent Representations for Visual Proprioception in Inexpensive Robots | Ladislau Bölöni Team | 2504.14634 | null |
| 2025-04-18 | DiffOG: Differentiable Policy Trajectory Optimization with Generalizability | Yu She Team | 2504.13807 | null |
| 2025-04-18 | Imitation Learning with Precisely Labeled Human Demonstrations | Yilong Song Team | 2504.13803 | null |
| 2025-04-21 | SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM | Javier Civera Team | 2504.13713 | link |
| 2025-04-18 | Self-Mixing Laser Interferometry: In Search of an Ambient Noise-Resilient Alternative to Acoustic Sensing | Francis wyffels Team | 2504.13711 | null |
| 2025-04-18 | On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting | Jan Peters Team | 2504.13618 | null |
| 2025-04-18 | A Model-Based Approach to Imitation Learning through Multi-Step Predictions | Na Li Team | 2504.13413 | null |
| 2025-04-17 | RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins | Ping Luo Team | 2504.13059 | null |
| 2025-04-17 | Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator | E. Witrant Team | 2504.13056 | null |
| 2025-04-17 | Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation | Iman Soltani Team | 2504.12967 | null |
| 2025-04-17 | TSGS: Improving Gaussian Splatting for Transparent Surface Reconstruction via Normal and De-lighting Priors | Yi Yang Team | 2504.12799 | null |
| 2025-04-17 | Trajectory Adaptation using Large Language Models | Ravi Prakash Team | 2504.12755 | null |
| 2025-04-17 | Embodied Neuromorphic Control Applied on a 7-DOF Robotic Manipulator | Lei Wang Team | 2504.12702 | link |
| 2025-04-21 | A0: An Affordance-Aware Hierarchical Model for General Robotic Manipulation | Xiaodan Liang Team | 2504.12636 | null |
| 2025-04-17 | Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration | Jeannette Bohg Team | 2504.12609 | null |
| 2025-04-16 | Adapting a World Model for Trajectory Following in a 3D Game | Raluca Georgescu Team | 2504.12299 | null |
| 2025-04-16 | Towards Forceful Robotic Foundation Models: a Literature Survey | Nikolaus Correll Team | 2504.11827 | null |
| 2025-04-14 | Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning | Fei Liu Team | 2504.11493 | link |
| 2025-04-15 | Next-Future: Sample-Efficient Policy Learning for Robotic-Arm Tasks | Suryansh Kumar Team | 2504.11247 | null |
| 2025-04-17 | CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image | Yi Zhu Team | 2504.11230 | null |
| 2025-04-15 | Superfast Configuration-Space Convex Set Computation on GPUs for Online Motion Planning | Daniela Rus Team | 2504.10783 | link |
| 2025-04-14 | Improving In-Context Learning with Reasoning Distillation | Xiang Gao Team | 2504.10647 | null |
| 2025-04-14 | Flying Hand: End-Effector-Centric Framework for Versatile Aerial Manipulation Teleoperation and Policy Learning | Guanya Shi Team | 2504.10334 | null |
| 2025-04-14 | Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation | Guoying Gu Team | 2504.10280 | null |
| 2025-04-14 | Prior Does Matter: Visual Navigation via Denoising Diffusion Bridge Models | Hui Cheng Team | 2504.10041 | link |
| 2025-04-14 | Efficient Task-specific Conditional Diffusion Policies: Shortcut Model Acceleration and SO(3) Optimization | Wei Sui Team | 2504.09927 | null |
| 2025-04-12 | Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators | Marco M. Nicotra Team | 2504.09188 | null |
| 2025-04-11 | BiFlex: A Passive Bimodal Stiffness Flexible Wrist for Manipulation in Unstructured Environments | Roberto Martín-Martín Team | 2504.08706 | null |
| 2025-04-11 | Diffusion Models for Robotic Manipulation: A Survey | Rania Rayyes Team | 2504.08438 | null |
| 2025-04-10 | Echo: An Open-Source, Low-Cost Teleoperation System with Force Feedback for Dataset Collection in Robot Learning | Dzmitry Tsetserukou Team | 2504.07939 | null |
| 2025-04-10 | TOCALib: Optimal control library with interpolation for bimanual manipulation and obstacles avoidance | Aleksandr Panov Team | 2504.07708 | null |
| 2025-04-10 | Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction | Hesheng Wang Team | 2504.07375 | link |
| 2025-04-09 | Adaptive Vision-Guided Robotic Arm Control for Precision Pruning in Dynamic Orchard Environments | Manoj Karkee Team | 2504.07309 | null |
| 2025-04-09 | AssistanceZero: Scalably Solving Assistance Games | Anca Dragan Team | 2504.07091 | link |
| 2025-04-09 | Two by Two: Learning Multi-Task Pairwise Objects Assembly for Generalizable Robot Manipulation | Huazhe Xu Team | 2504.06961 | null |
| 2025-04-09 | Developing Modular Grasping and Manipulation Pipeline Infrastructure to Streamline Performance Benchmarking | Holly Yanco Team | 2504.06819 | null |
| 2025-04-09 | Interactive Expressive Motion Generation Using Dynamic Movement Primitives | Kai O. Arras Team | 2504.06735 | null |
| 2025-04-09 | Overcoming Dynamic Environments: A Hybrid Approach to Motion Planning for Manipulators | Gavin Paul Team | 2504.06596 | null |
| 2025-04-09 | CAFE-AD: Cross-Scenario Adaptive Feature Enhancement for Trajectory Planning in Autonomous Driving | Yanyong Zhang Team | 2504.06584 | link |
| 2025-04-09 | OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning | Tyler Fenstermaker Team | 2504.06538 | null |
| 2025-04-08 | ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface | Rui Chen Team | 2504.06156 | null |
| 2025-04-08 | MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos | Marc Pollefeys Team | 2504.06084 | null |
| 2025-04-08 | Learning-enhanced electronic skin for tactile sensing on deformable surface based on electrical impedance tomography | Yunjie Yang Team | 2504.05987 | null |
| 2025-04-08 | Stratified Expert Cloning with Adaptive Selection for User Retention in Large-Scale Recommender Systems | Yongqi Liu Team | 2504.05628 | null |
| 2025-04-08 | TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning | Stephen Xia Team | 2504.05585 | null |
| 2025-04-07 | SPARK-Remote: A Cost-Effective System for Remote Bimanual Robot Teleoperation | Karthik Desingh Team | 2504.05488 | null |
| 2025-04-07 | RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception | Jie Song Team | 2504.05287 | null |
| 2025-04-07 | Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation | Wei Zhang Team | 2504.05225 | link |
| 2025-04-07 | Wavelet Policy: Imitation Policy Learning in Frequency Domain with Wavelet Transforms | Hongrui Zhu Team | 2504.04991 | null |
| 2025-04-07 | Embodied Perception for Test-time Grasping Detection Adaptation with Knowledge Infusion | Fengyu Zhou Team | 2504.04795 | null |
| 2025-04-06 | Tool-as-Interface: Learning Robot Policies from Human Tool Usage through Imitation Learning | Katherine Driggs-Campbell Team | 2504.04612 | null |
| 2025-04-06 | Diffusion-Based Approximate MPC: Fast and Consistent Imitation of Multi-Modal Action Distributions | Katherine J. Kuchenbecker Team | 2504.04603 | null |
| 2025-04-06 | DexTOG: Learning Task-Oriented Dexterous Grasp with Language | Cewu Lu Team | 2504.04573 | null |
| 2025-04-06 | DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments | Lin Shao Team | 2504.04516 | null |
| 2025-04-06 | Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers | Yuke Zhu Team | 2504.04395 | null |
| 2025-04-05 | ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning | Robert K. Katzschmann Team | 2504.04259 | null |
| 2025-04-09 | Digital Gene: Learning about the Physical World through Analytic Concepts | Cewu Lu Team | 2504.04170 | null |
| 2025-04-04 | Dexterous Manipulation through Imitation Learning: A Survey | Hong Zhang Team | 2504.03515 | null |
| 2025-04-04 | GraphSeg: Segmented 3D Representations via Graph Edge Addition and Contraction | Weiming Zhi Team | 2504.03129 | null |
| 2025-04-03 | Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets | Abhishek Gupta Team | 2504.02792 | null |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Shibiao Xu Team | 2504.02477 | null |
| 2025-04-02 | RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics | Qiang Nie Team | 2504.02069 | null |
| 2025-04-02 | Slot-Level Robotic Placement via Visual Imitation from Single Human Video | Arsalan Mousavian Team | 2504.01959 | null |
| 2025-04-02 | Learning with Imperfect Models: When Multi-step Prediction Mitigates Compounding Error | Nikolai Matni Team | 2504.01766 | null |
| 2025-04-02 | TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication | Karla Stepanova Team | 2504.01708 | null |
| 2025-04-02 | 8-DoFs Cable Driven Parallel Robots for Bimanual Teleportation | Josie Hughes Team | 2504.01554 | null |
| 2025-04-02 | Bi-LAT: Bilateral Control-Based Imitation Learning via Natural Language and Action Chunking with Transformers | Yuki Uranishi Team | 2504.01301 | null |
| 2025-04-02 | The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction | Matthew K. X. J Pan Team | 2504.01260 | null |
| 2025-04-01 | Energy Weighted Learning Progress Guided Interleaved Multi-Task Learning | Erhan Oztop Team | 2504.00707 | null |
| 2025-04-01 | Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs | Masaya Kinoshita Team | 2504.00614 | null |
| 2025-04-01 | Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation | Dong Wang Team | 2504.00420 | null |
| 2025-03-31 | CBIL: Collective Behavior Imitation Learning for Fish from Real Videos | Taku Komura Team | 2504.00234 | null |
| 2025-04-02 | Sim-and-Real Co-Training: A Simple Recipe for Vision-Based Robotic Manipulation | Yuke Zhu Team | 2503.24361 | null |
| 2025-04-02 | AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World | Sergey Levine Team | 2503.24278 | link |
| 2025-03-31 | HACTS: a Human-As-Copilot Teleoperation System for Robot Learning | Jian Tang Team | 2503.24070 | null |
| 2025-03-31 | Learning 3D-Gaussian Simulators from RGB Videos | Georg Martius Team | 2503.24009 | null |
| 2025-03-31 | ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos | Dinesh Jayaraman Team | 2503.23877 | link |
| 2025-03-31 | Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning | Yue Wang Team | 2503.23835 | null |
| 2025-03-30 | Can Visuo-motor Policies Benefit from Random Exploration Data? A Case Study on Stacking | Florian T. Pokorny Team | 2503.23571 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-07-23 | BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems | Christian Berger Team | 2507.17722 | null |
| 2025-07-23 | InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Jiangmiao Pang Team | 2507.17520 | null |
| 2025-07-23 | Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection | Elisa Ricci Team | 2507.17456 | null |
| 2025-07-23 | VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization | Shoaib Ehsan Team | 2507.17455 | null |
| 2025-07-23 | Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection | Xi Li Team | 2507.17436 | null |
| 2025-07-23 | Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models | Guanghui Sun Team | 2507.17379 | null |
| 2025-07-23 | RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding | Tianyang Wang Team | 2507.17353 | null |
| 2025-07-23 | HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study | Maria Spence Team | 2507.17118 | null |
| 2025-07-23 | FedVLM: Scalable Personalized Vision-Language Models through Federated Learning | Habeeb Olufowobi Team | 2507.17088 | null |
| 2025-07-22 | VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings | Kannan Achan Team | 2507.17080 | null |
| 2025-07-22 | Controllable Hybrid Captioner for Improved Long-form Video Understanding | Arun Reddy Team | 2507.17047 | null |
| 2025-07-22 | Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning | Kai Chen Team | 2507.16814 | null |
| 2025-07-22 | Cooling Matters: Benchmarking Large Language Models and Vision-Language Models on Liquid-Cooled Versus Air-Cooled H100 GPU Systems | Arslan Munir Team | 2507.16781 | null |
| 2025-07-22 | Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation | Ke Yang Team | 2507.16716 | null |
| 2025-07-22 | Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory | Marco Hutter Team | 2507.16713 | null |
| 2025-07-22 | Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models | Chao Zhang Team | 2507.16524 | null |
| 2025-07-22 | SceneLoom: Communicating Data with Scene Context | Siming Chen Team | 2507.16466 | null |
| 2025-07-22 | Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models | Isao Echizen Team | 2507.16257 | null |
| 2025-07-22 | SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction | Jiaqi Wang Team | 2507.15852 | null |
| 2025-07-21 | Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models | Erkut Erdem Team | 2507.15824 | null |
| 2025-07-23 | Visual-Language Model Knowledge Distillation Method for Image Quality Assessment | Jiarun Song Team | 2507.15680 | null |
| 2025-07-21 | Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging | Margret Keuper Team | 2507.15576 | null |
| 2025-07-21 | HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation | Robby T. Tan Team | 2507.15542 | null |
| 2025-07-21 | Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner | Lin Ma Team | 2507.15509 | null |
| 2025-07-21 | One Last Attention for Your Vision-Language Model | Zhiqiang Shen Team | 2507.15480 | null |
| 2025-07-21 | EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent | Xinlei Chen Team | 2507.15428 | null |
| 2025-07-21 | In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems | Christoph Busch Team | 2507.15285 | null |
| 2025-07-21 | VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving | Tong Heng Lee Team | 2507.15266 | null |
| 2025-07-20 | Survey of GenAI for Automotive Software Development: From Requirements to Executable Code | Alois Knoll Team | 2507.15025 | null |
| 2025-07-20 | Hierarchical Cross-modal Prompt Learning for Vision-Language Models | Zhenhua Huang Team | 2507.14976 | null |
| 2025-07-20 | FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models | Mengnan Du Team | 2507.14823 | null |
| 2025-07-19 | IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark | Ruiheng Zhang Team | 2507.14449 | null |
| 2025-07-18 | CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation | Nicolas Thome Team | 2507.14312 | null |
| 2025-07-18 | In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding | Leonid Sigal Team | 2507.14298 | null |
| 2025-07-18 | VLA-Mark: A cross modal watermark for large vision-language alignment model | Xuming Hu Team | 2507.14067 | null |
| 2025-07-18 | EdgeVLA: Efficient Vision-Language-Action Models | Benjamin Bolte Team | 2507.14049 | null |
| 2025-07-18 | Moodifier: MLLM-Enhanced Emotion-Driven Image Editing | Sharon X. Huang Team | 2507.14024 | null |
| 2025-07-18 | When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models | Alberto Cazzaniga Team | 2507.13868 | null |
| 2025-07-18 | Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions | Jiajun Zhang Team | 2507.13773 | null |
| 2025-07-17 | LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning | Margrit Betke Team | 2507.13568 | null |
| 2025-07-17 | COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark | Vasu Sharma Team | 2507.13405 | null |
| 2025-07-17 | VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning | Jiaya Jia Team | 2507.13348 | null |
| 2025-07-17 | Leveraging Language Prior for Infrared Small Target Detection | Pravendra Singh Team | 2507.13113 | null |
| 2025-07-17 | GLAD: Generalizable Tuning for Vision-Language Models | Shifeng Chen Team | 2507.13089 | null |
| 2025-07-17 | Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection | Changwen Zheng Team | 2507.13061 | null |
| 2025-07-21 | LaViPlan : Language-Guided Visual Path Planning with RLVR | Hayeon Oh Team | 2507.12911 | null |
| 2025-07-17 | City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning | Xiaowen Chu Team | 2507.12795 | null |
| 2025-07-16 | VLMgineer: Vision Language Models as Robotic Toolsmiths | Dinesh Jayaraman Team | 2507.12644 | null |
| 2025-07-16 | NLI4VolVis: Natural Language Interaction for Volume Visualization via LLM Multi-Agents and Editable 3D Gaussian Splatting | Chaoli Wang Team | 2507.12621 | null |
| 2025-07-16 | MindJourney: Test-Time Scaling with World Models for Spatial Reasoning | Chuang Gan Team | 2507.12508 | null |
| 2025-07-16 | ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving | Xinge Zhu Team | 2507.12499 | null |
| 2025-07-15 | Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering | Dimosthenis Karatzas Team | 2507.12490 | null |
| 2025-07-20 | PhysX-3D: Physical-Grounded 3D Asset Generation | Ziwei Liu Team | 2507.12465 | null |
| 2025-07-16 | Describe Anything Model for Visual Question Answering on Text-rich Images | Min Xu Team | 2507.12441 | null |
| 2025-07-16 | AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models | Sihao Ding Team | 2507.12414 | null |
| 2025-07-16 | Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models | Bernhard Kainz Team | 2507.12236 | null |
| 2025-07-16 | InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing | Wen-Huang Cheng Team | 2507.12060 | null |
| 2025-07-16 | GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models | Rongrong Ji Team | 2507.11969 | null |
| 2025-07-16 | POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering | Qin Jin Team | 2507.11939 | null |
| 2025-07-15 | Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis | Lihang Ying Team | 2507.11730 | null |
| 2025-07-18 | How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study | Rossella Arcucci Team | 2507.11200 | null |
| 2025-07-15 | Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities | Yang Zhang Team | 2507.11155 | null |
| 2025-07-15 | Assessing Color Vision Test in Large Vision-language Models | Hongyang Chen Team | 2507.11153 | null |
| 2025-07-15 | MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models | Hamza Moustafa Team | 2507.11114 | null |
| 2025-07-15 | Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander | Lei Chen Team | 2507.11079 | null |
| 2025-07-15 | Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection | Guanzhong Tian Team | 2507.11003 | null |
| 2025-07-14 | EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Xiaojuan Qi Team | 2507.10548 | null |
| 2025-07-14 | CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding | Yi Wang Team | 2507.10449 | null |
| 2025-07-14 | Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter | Bin Luo Team | 2507.10355 | null |
| 2025-07-14 | Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection | Wenqiang Zhang Team | 2507.10225 | null |
| 2025-07-14 | BlueGlass: A Framework for Composite AI Safety | Kay-Ulrich Scholl Team | 2507.10106 | null |
| 2025-07-14 | Foundation Model Driven Robotics: A Comprehensive Review | Ammar Waheed Team | 2507.10087 | null |
| 2025-07-14 | LayLens: Improving Deepfake Understanding through Simplified Explanations | Abhinav Dhall Team | 2507.10066 | null |
| 2025-07-14 | CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books | Dimosthenis Karatzas Team | 2507.10053 | null |
| 2025-07-14 | Text-Driven Causal Representation Learning for Source-Free Domain Generalization | Zhen Lei Team | 2507.09961 | null |
| 2025-07-13 | NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection | Pulei Xiong Team | 2507.09795 | null |
| 2025-07-13 | Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score | Muhammad Haris Khan Team | 2507.09615 | null |
| 2025-07-13 | Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations | Guiguang Ding Team | 2507.09500 | null |
| 2025-07-13 | GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? | Huaxiu Yao Team | 2507.09491 | null |
| 2025-07-12 | Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models | Tat-Seng Chua Team | 2507.09209 | null |
| 2025-07-12 | MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models | Dahan Wang Team | 2507.09184 | null |
| 2025-07-12 | OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering | Niaz Abdolrahim Team | 2507.09155 | null |
| 2025-07-12 | RadEyeVideo: Enhancing general-domain Large Vision Language Model for chest X-ray analysis with video representations of eye gaze | Honghan Wu Team | 2507.09097 | null |
| 2025-07-11 | BlindSight: Harnessing Sparsity for Efficient VLMs | Steven K. Reinhardt Team | 2507.09071 | null |
| 2025-07-11 | Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery | Seana Coulson Team | 2507.09011 | null |
| 2025-07-11 | VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models | Olivier Déforges Team | 2507.08982 | null |
| 2025-07-11 | ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way | Subarna Tripathi Team | 2507.08679 | null |
| 2025-07-11 | Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance | András Lőrincz Team | 2507.08624 | null |
| 2025-07-11 | Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data | Ambedkar Dukkipati Team | 2507.08610 | null |
| 2025-07-11 | BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis | Hui Xiong Team | 2507.08607 | null |
| 2025-07-11 | Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R | Sanidhya Kashyap Team | 2507.08505 | null |
| 2025-07-11 | LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning | Lei Fan Team | 2507.08496 | null |
| 2025-07-11 | Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models | Jianping Fan Team | 2507.08410 | null |
| 2025-07-11 | Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning | Yejin Choi Team | 2507.08224 | null |
| 2025-07-10 | CLIP Won't Learn Object-Attribute Binding from Natural Data and Here is Why | Thomas Brox Team | 2507.07985 | null |
| 2025-07-10 | Scaling RL to Long Videos | Song Han Team | 2507.07966 | null |
| 2025-07-10 | SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment | Lei Fan Team | 2507.07939 | null |
| 2025-07-10 | MoSE: Skill-by-Skill Mixture-of-Expert Learning for Autonomous Driving | Chao Zhang Team | 2507.07818 | null |
| 2025-07-10 | Energy-Guided Decoding for Object Hallucination Mitigation | Christopher Zach Team | 2507.07731 | null |
| 2025-07-10 | One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models | Cairong Zhao Team | 2507.07709 | null |
| 2025-07-10 | Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought | Daiki Chijiwa Team | 2507.07685 | null |
| 2025-07-11 | ViLU: Learning Vision-Language Uncertainties for Failure Prediction | Nicolas Thome Team | 2507.07620 | null |
| 2025-07-10 | LOSC: LiDAR Open-voc Segmentation Consolidator | Renaud Marlet Team | 2507.07605 | null |
| 2025-07-10 | The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs | Qun Liu Team | 2507.07562 | null |
| 2025-07-10 | ArchiveGPT: A human-centered evaluation of using a vision language model for image cataloguing | Markus Huff Team | 2507.07551 | null |
| 2025-07-11 | Entity Re-identification in Visual Storytelling via Contrastive Reinforcement Learning | David Martins de Matos Team | 2507.07340 | null |
| 2025-07-09 | ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation | Suren Kumar Team | 2507.07317 | null |
| 2025-07-09 | LangNavBench: Evaluation of Natural Language Understanding in Semantic Navigation | Angel X. Chang Team | 2507.07299 | null |
| 2025-07-09 | MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning | Dan Goldwasser Team | 2507.07297 | null |
| 2025-07-09 | 4KAgent: Agentic Any Image to 4K Super-Resolution | Zhengzhong Tu Team | 2507.07105 | null |
| 2025-07-11 | Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models | Junfei Xiao Team | 2507.07104 | null |
| 2025-07-09 | Evaluating Attribute Confusion in Fashion Text-to-Image Generation | Davide Talon Team | 2507.07079 | null |
| 2025-07-09 | Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM | Sibei Yang Team | 2507.06973 | null |
| 2025-07-09 | CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale | Quan Wang Team | 2507.06959 | null |
| 2025-07-09 | VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation | Tat-Seng Chua Team | 2507.06899 | null |
| 2025-07-09 | HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement | Yanning Zhang Team | 2507.06814 | null |
| 2025-07-09 | Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu | Donghyeok Choi Team | 2507.06761 | null |
| 2025-07-09 | Text-promptable Object Counting via Quantity Awareness Enhancement | Li Li Team | 2507.06679 | null |
| 2025-07-09 | Cross-Modal Dual-Causal Learning for Long-Term Action Recognition | Fan Chao Team | 2507.06603 | null |
| 2025-07-09 | Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection | Xiangmin Xu Team | 2507.06510 | null |
| 2025-07-09 | 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds | Nick Haber Team | 2507.06484 | null |
| 2025-07-08 | VisioPath: Vision-Language Enhanced Model Predictive Control for Safe Autonomous Navigation in Mixed Traffic | Andreas A. Malikopoulos Team | 2507.06441 | null |
| 2025-07-08 | CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions | Yi R. Fung Team | 2507.06210 | null |
| 2025-07-08 | Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling | Naga Harshita Marupaka Team | 2507.06183 | null |
| 2025-07-10 | Skywork-R1V3 Technical Report | Yahui Zhou Team | 2507.06167 | null |
| 2025-07-08 | LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models | Hongming Shan Team | 2507.06140 | null |
| 2025-07-08 | GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing | Hao Liu Team | 2507.05887 | null |
| 2025-07-08 | Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports | Hitomi Yanaka Team | 2507.05799 | null |
| 2025-07-08 | SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning | Tao He Team | 2507.05798 | null |
| 2025-07-08 | A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation | Yue Gao Team | 2507.05731 | null |
| 2025-07-09 | Integrated Structural Prompt Learning for Vision-Language Models | Bin Luo Team | 2507.05677 | null |
| 2025-07-08 | R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding | Shabnam Ghadar Team | 2507.05673 | null |
| 2025-07-08 | Dynamic Rank Adaptation for Vision-Language Models | Bin Luo Team | 2507.05668 | null |
| 2025-07-08 | Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik's Cube | Shenghai Yuan Team | 2507.05607 | null |
| 2025-07-08 | Rethinking Layered Graphic Design Generation with a Top-Down Approach | Qifeng Chen Team | 2507.05601 | null |
| 2025-07-08 | PaddleOCR 3.0 Technical Report | Yanjun Ma Team | 2507.05595 | null |
| 2025-07-07 | Fine-Grained Vision-Language Modeling for Multimodal Training Assistants in Augmented Reality | Junxiao Wang Team | 2507.05515 | null |
| 2025-07-07 | Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model | Even Oldridge Team | 2507.05513 | null |
| 2025-07-07 | OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts | Priyadarshini Panda Team | 2507.05427 | null |
| 2025-07-07 | pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models | Ramtin Pedarsani Team | 2507.05394 | null |
| 2025-07-07 | NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving | Cheng Lu Team | 2507.05227 | null |
| 2025-07-07 | All in One: Visual-Description-Guided Unified Point Cloud Segmentation | Rao Muhammad Anwer Team | 2507.05211 | null |
| 2025-07-07 | Differential Attention for Multimodal Crisis Event Analysis | Abdullah-Al-Zubaer Imran Team | 2507.05165 | null |
| 2025-07-07 | INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling | Bo Zheng Team | 2507.05056 | null |
| 2025-07-07 | Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision | Nicolas Padoy Team | 2507.05020 | null |
| 2025-07-07 | Training-free Generation of Temporally Consistent Rewards from VLMs | Jian Tang Team | 2507.04789 | null |
| 2025-07-07 | Vision-Language Models Can't See the Obvious | Sanath Narayan Team | 2507.04741 | null |
| 2025-07-07 | An analysis of vision-language models for fabric retrieval | Fabio Poiesi Team | 2507.04735 | null |
| 2025-07-07 | A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets | Jie Zhou Team | 2507.04699 | null |
| 2025-07-07 | MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding | Dinesh Manocha Team | 2507.04686 | null |
| 2025-07-07 | Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation | Chang Xu Team | 2507.04680 | null |
| 2025-07-06 | VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation | Lei Han Team | 2507.04524 | null |
| 2025-07-08 | FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection | Ruixuan Wang Team | 2507.04511 | null |
| 2025-07-06 | MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization | Changhao Chen Team | 2507.04509 | null |
| 2025-07-06 | Think Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection | Sanasam Ranbir Singh Team | 2507.04458 | null |
| 2025-07-06 | Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions | Johan Bos Team | 2507.04377 | null |
| 2025-07-05 | LVLM-Composer's Explicit Planning for Image Generation | Amina Grant Team | 2507.04152 | null |
| 2025-07-05 | Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation | Hunter Young Team | 2507.04151 | null |
| 2025-07-05 | PresentAgent: Multimodal Agent for Presentation Video Generation | Yang Zhao Team | 2507.04036 | null |
| 2025-07-05 | A Comparative Study of Specialized LLMs as Dense Retrievers | Jiafeng Guo Team | 2507.03958 | null |
| 2025-07-03 | ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects | Cewu Lu Team | 2507.02600 | null |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null |
| 2025-07-02 | Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges | Anuj Sharma Team | 2507.02074 | null |
| 2025-07-01 | Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames | Cordelia Schmid Team | 2507.02001 | null |
| 2025-07-02 | How Do Vision-Language Models Process Conflicting Information Across Modalities? | Ellie Pavlick Team | 2507.01790 | null |
| 2025-07-02 | Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition | Muzammil Behzad Team | 2507.01673 | null |
| 2025-07-02 | MARVIS: Modality Adaptive Reasoning over VISualizations | Chinmay Hegde Team | 2507.01544 | null |
| 2025-07-02 | Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence | Martin Schramm Team | 2507.01504 | null |
| 2025-07-02 | BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments | Mingzhai Sun Team | 2507.01485 | null |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null |
| 2025-07-02 | CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning | Yoshitaka Ushiku Team | 2507.01409 | null |
| 2025-07-02 | Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model | Xi Li Team | 2507.01351 | null |
| 2025-07-02 | AIGVE-MACS: Unified Multi-Aspect Commenting and Scoring Model for AI-Generated Video Evaluation | Jiawei Zhang Team | 2507.01255 | null |
| 2025-07-02 | GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | Jie Tang Team | 2507.01006 | null |
| 2025-07-04 | Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations | Yunzhu Li Team | 2507.00990 | null |
| 2025-07-01 | Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact | Seyedali Mirjalili Team | 2507.00951 | null |
| 2025-07-01 | The Age of Sensorial Zero Trust: Why We Can No Longer Trust Our Senses | Fabio Correa Xavier Team | 2507.00907 | null |
| 2025-07-01 | ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models | Yaqi Xie Team | 2507.00898 | null |
| 2025-07-01 | GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond | Luc Van Gool Team | 2507.00886 | null |
| 2025-07-01 | UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement | Xiangxiang Chu Team | 2507.00721 | null |
| 2025-07-01 | Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English | Rajesh Sharma Team | 2507.00700 | null |
| 2025-07-01 | Context-Aware Academic Emotion Dataset and Benchmark | Wenwu Yang Team | 2507.00586 | null |
| 2025-07-01 | Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation | Rong Xiao Team | 2507.00537 | null |
| 2025-07-01 | Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving | Yadan Luo Team | 2507.00525 | null |
| 2025-06-30 | EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations | Sungzoon Cho Team | 2506.24016 | null |
| 2025-06-30 | The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models | Tieniu Tan Team | 2506.24000 | null |
| 2025-06-30 | GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models | Hassan Rivaz Team | 2506.23903 | null |
| 2025-06-30 | A Closer Look at Conditional Prompt Tuning for Vision-Language Models | Heng Tao Shen Team | 2506.23856 | null |
| 2025-06-30 | Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model | Fahad Shahbaz Khan Team | 2506.23822 | null |
| 2025-06-30 | Visual Textualization for Image Prompted Object Detection | Yan Xu Team | 2506.23785 | null |
| 2025-06-30 | PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? | Ransalu Senanayake Team | 2506.23725 | null |
| 2025-06-30 | On the Domain Robustness of Contrastive Vision-Language Models | Erik Rodner Team | 2506.23663 | null |
| 2025-06-30 | CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models | Bing Qin Team | 2506.23590 | null |
| 2025-06-30 | A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation | Jie Xu Team | 2506.23584 | null |
| 2025-07-01 | ZonUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding | ShengJing Yang Team | 2506.23491 | null |
| 2025-06-30 | Sanitizing Manufacturing Dataset Labels Using Vision-Language Models | Vinh Nguyen Team | 2506.23465 | null |
| 2025-06-29 | GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields | Yutaka Matsuo Team | 2506.23352 | null |
| 2025-06-29 | IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | Brandon Y. Feng Team | 2506.23329 | null |
| 2025-07-01 | SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting | Hongliang Ren Team | 2506.23309 | null |
| 2025-06-29 | Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models | Tanmoy Chakraborty Team | 2506.23122 | null |
| 2025-06-29 | MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings | Zhicheng Dou Team | 2506.23115 | null |
| 2025-06-29 | Empowering Small VLMs to Think with Dynamic Memorization and Exploration | Long Chen Team | 2506.23061 | null |
| 2025-06-29 | SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions | Maarten Sap Team | 2506.23046 | null |
| 2025-06-28 | Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models | Swadesh Swain Team | 2506.22982 | null |
| 2025-06-27 | MiCo: Multi-image Contrast for Reinforcement Visual Reasoning | Hengshuang Zhao Team | 2506.22434 | null |
| 2025-06-27 | Test-Time Consistency in Vision Language Models | Leonid Sigal Team | 2506.22395 | null |
| 2025-06-27 | Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation | Xun Xu Team | 2506.22375 | null |
| 2025-06-27 | Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment | Bo Du Team | 2506.22283 | null |
| 2025-06-27 | COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication | Albert Gatt Team | 2506.22274 | null |
| 2025-06-27 | Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs | Mahdieh Soleymani Baghshah Team | 2506.22146 | null |
| 2025-06-27 | Universal Retrieval for Multimodal Trajectory Modeling | Dehan Kong Team | 2506.22056 | null |
| 2025-06-27 | Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation | Daisuke Deguchi Team | 2506.22032 | null |
| 2025-06-27 | SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation | Xulei Yang Team | 2506.21892 | null |
| 2025-06-27 | Integrating Multi-Modal Sensors: A Review of Fusion Techniques for Intelligent Vehicles | Matthew J. Barth Team | 2506.21885 | null |
| 2025-06-27 | Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation | Zhiting Hu Team | 2506.21876 | null |
| 2025-06-27 | On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling | Ben Y. Zhao Team | 2506.21874 | null |
| 2025-06-27 | Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling | Yong Man Ro Team | 2506.21863 | null |
| 2025-06-27 | Embodied Domain Adaptation for Object Detection | Feras Dayoub Team | 2506.21860 | null |
| 2025-06-27 | The Cost of Avoiding Backpropagation | Hui Guan Team | 2506.21833 | null |
| 2025-06-26 | ViStruct: Simulating Expert-Like Reasoning Through Task Decomposition and Visual Attention Cues | Carolina Nobre Team | 2506.21762 | null |
| 2025-06-26 | Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs | Ismini Lourentzou Team | 2506.21656 | null |
| 2025-06-26 | Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration | Jian Wu Team | 2506.21509 | null |
| 2025-06-26 | Global and Local Entailment Learning for Natural World Imagery | Nathan Jacobs Team | 2506.21476 | null |
| 2025-06-26 | Spatial Mental Modeling from Limited Views | Li Fei-Fei Team | 2506.21458 | null |
| 2025-06-27 | ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models | Ziwei Liu Team | 2506.21356 | null |
| 2025-06-26 | LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning | Hayaru Shouno Team | 2506.21317 | null |
| 2025-06-26 | DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images | Ganesh Ramakrishnan Team | 2506.21316 | null |
| 2025-06-26 | World-aware Planning Narratives Enhance Large Vision-Language Model Planner | Xipeng QIu Team | 2506.21230 | null |
| 2025-06-26 | Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion | Jian Liang Team | 2506.21144 | null |
| 2025-06-26 | V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling | Bin Ran Team | 2506.21041 | null |
| 2025-06-26 | Multimodal Prompt Alignment for Facial Expression Recognition | Shutao Li Team | 2506.21017 | null |
| 2025-06-26 | Style-Aligned Image Composition for Robust Detection of Abnormal Cells in Cytopathology | S Kevin Zhou Team | 2506.21001 | null |
| 2025-06-26 | TSDASeg: A Two-Stage Model with Direct Alignment for Interactive Point Cloud Segmentation | Yihong Wu Team | 2506.20991 | null |
| 2025-06-26 | SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes | Zheng Zhang Team | 2506.20990 | null |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null |
| 2025-06-26 | E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs | Minh-Son Dao Team | 2506.20944 | null |
| 2025-06-25 | Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models | Zafer Dogan Team | 2506.20832 | null |
| 2025-06-25 | How do Foundation Models Compare to Skeleton-Based Approaches for Gesture Recognition in Human-Robot Interaction? | Bastian Leibe Team | 2506.20795 | null |
| 2025-06-27 | Shape2Animal: Creative Animal Generation from Natural Silhouettes | Trung-Nghia Le Team | 2506.20616 | null |
| 2025-06-25 | HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction | Maja Matarić Team | 2506.20566 | null |
| 2025-06-25 | Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation | Morten Rieger Hannemose Team | 2506.20449 | null |
| 2025-06-25 | CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition | Michael Gienger Team | 2506.20373 | null |
| 2025-06-25 | Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards | Bo Zheng Team | 2506.20332 | null |
| 2025-06-25 | MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations | Vikram S. Adve Team | 2506.20100 | null |
| 2025-06-24 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null |
| 2025-06-24 | Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models | Christoph M. Friedrich Team | 2506.19825 | null |
| 2025-06-24 | CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation | Jiangmiao Pang Team | 2506.19816 | null |
| 2025-06-24 | UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation | Zhongliang Jiang Team | 2506.19694 | null |
| 2025-06-24 | PEVLM: Parallel Encoding for Vision-Language Models | Yong Wu Team | 2506.19651 | null |
| 2025-06-24 | V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis | Zuozhu Liu Team | 2506.19610 | null |
| 2025-06-24 | ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP | Bokui Chen Team | 2506.19608 | null |
| 2025-06-24 | Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects | Angelo Cangelosi Team | 2506.19579 | null |
| 2025-06-24 | Visual hallucination detection in large vision-language models via evidential conflict | Liping Jing Team | 2506.19513 | null |
| 2025-06-24 | T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models | Qingyao Wu Team | 2506.19498 | null |
| 2025-06-24 | Emergence of Text Readability in Vision Language Models | Bohyung Han Team | 2506.19389 | null |
| 2025-06-24 | Robotic Perception with a Large Tactile-Vision-Language Model for Physical Property Inference | Nutan Chen Team | 2506.19303 | null |
| 2025-06-24 | Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models | Dan Zeng Team | 2506.19300 | null |
| 2025-06-24 | Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding | Hui Xiong Team | 2506.19288 | null |
| 2025-06-24 | MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models | Bo Zheng Team | 2506.19257 | null |
| 2025-06-24 | Scaffolding Dexterous Manipulation with Vision-Language Models | Dorsa Sadigh Team | 2506.19212 | null |
| 2025-06-23 | Reading Smiles: Proxy Bias in Foundation Models for Facial Emotion Recognition | Bjoern W. Schuller Team | 2506.19079 | null |
| 2025-06-23 | HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models | Krzysztof Czarnecki Team | 2506.19072 | null |
| 2025-06-23 | GLIMPSE: Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation for Generative LVLMs | Guanxi Shen Team | 2506.18985 | null |
| 2025-06-23 | VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning | Jian Zhang Team | 2506.18564 | null |
| 2025-06-23 | Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey | Heng Tao Shen Team | 2506.18504 | null |
| 2025-06-23 | InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models | Wenhai Wang Team | 2506.18385 | null |
| 2025-06-23 | Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review | Jing Qin Team | 2506.18378 | null |
| 2025-06-23 | Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? | Bill Howe Team | 2506.18322 | null |
| 2025-06-24 | Referring Expression Instance Retrieval and A Strong End-to-End Baseline | JinQiao Wang Team | 2506.18246 | null |
| 2025-06-23 | Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning | Xinhai Zhao Team | 2506.18234 | null |
| 2025-06-22 | See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis | Xiaoxiao Li Team | 2506.18140 | null |
| 2025-06-22 | CLGRPO: Reasoning Ability Enhancement for Small VLMs | Zhiwang Zhang Team | 2506.18048 | null |
| 2025-06-22 | Adapting Vision-Language Models for Evaluating World Models | Sarah Parisot Team | 2506.17967 | null |
| 2025-06-21 | RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models | Marco Pavone Team | 2506.17811 | null |
| 2025-06-21 | MDSAM:Memory-Driven Sparse Attention Matrix for LVLMs Hallucination Mitigation | Xiaochuan Shi Team | 2506.17664 | null |
| 2025-06-21 | Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning | Yu-Chiang Frank Wang Team | 2506.17645 | null |
| 2025-06-21 | CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning | Xiaoling Wang Team | 2506.17629 | null |
| 2025-06-21 | DRAMA-X: A Fine-grained Intent Prediction and Risk Reasoning Benchmark For Driving | Zhengzhong Tu Team | 2506.17590 | null |
| 2025-06-21 | HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models | Tao He Team | 2506.17587 | null |
| 2025-06-20 | Trustworthy Few-Shot Transfer of Medical VLMs through Split Conformal Prediction | Jose Dolz Team | 2506.17503 | null |
| 2025-06-20 | Few-Shot, Now for Real: Medical VLMs Adaptation without Balanced Sets or Validation | Ismail Ben Ayed Team | 2506.17500 | null |
| 2025-06-20 | General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting | Georgios Georgakis Team | 2506.17462 | null |
| 2025-06-20 | Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling? | Klara Nahrstedt Team | 2506.17417 | null |
| 2025-06-20 | VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning | Hengshuang Zhao Team | 2506.17221 | null |
| 2025-06-20 | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens | Chuang Gan Team | 2506.17218 | null |
| 2025-06-20 | Do We Need Large VLMs for Spotting Soccer Actions? | Sandeep Chaurasia Team | 2506.17144 | null |
| 2025-06-20 | Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments | Nathaniel D. Bastian Team | 2506.16994 | null |
| 2025-06-20 | FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation | Jinqiao Wang Team | 2506.16806 | null |
| 2025-06-20 | Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes | Chen Feng Team | 2506.16805 | null |
| 2025-06-20 | Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models | Xiaohua Xu Team | 2506.16760 | null |
| 2025-06-20 | TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion | Xinbo Gao Team | 2506.16730 | null |
| 2025-06-20 | V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos | Xiaoyu Qin Team | 2506.16716 | null |
| 2025-06-20 | VLM-Empowered Multi-Mode System for Efficient and Safe Planetary Navigation | Liang Ding Team | 2506.16703 | null |
| 2025-06-20 | LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation | Jing Liu Team | 2506.16691 | null |
| 2025-06-19 | CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity | Yunzhu Li Team | 2506.16652 | null |
| 2025-06-19 | History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation | Fatemeh Afghah Team | 2506.16623 | null |
| 2025-06-19 | GoalLadder: Incremental Goal Discovery with Vision-Language Models | Shimon Whiteson Team | 2506.16396 | null |
| 2025-06-19 | CLIP-MG: Guiding Semantic Attention with Skeletal Pose Features and RGB Data for Micro-Gesture Recognition on the iMiGUE Dataset | Amith Adiraju Team | 2506.16385 | null |
| 2025-06-19 | FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models | Tat-Seng Chua Team | 2506.16218 | null |
| 2025-06-19 | AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models | Shanghang Zhang Team | 2506.16112 | null |
| 2025-06-19 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation | Yansong Tang Team | 2506.16058 | null |
| 2025-06-19 | DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning | Zongqing Lu Team | 2506.16012 | null |
| 2025-06-18 | VectorEdits: A Dataset and Benchmark for Instruction-Based Editing of Vector Graphics | Michal Štefánik Team | 2506.15903 | null |
| 2025-06-18 | GenRecal: Generation after Recalibration from Large to Small Vision-Language Models | Yueh-Hua Wu Team | 2506.15681 | null |
| 2025-06-18 | Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning | Imran Razzak Team | 2506.15649 | null |
| 2025-06-18 | FindingDory: A Benchmark to Evaluate Memory in Embodied Agents | Zsolt Kira Team | 2506.15635 | null |
| 2025-06-18 | WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts | Rémi Lebret Team | 2506.15594 | link |
| 2025-06-18 | DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement | Zhuang Li Team | 2506.15583 | link |
| 2025-06-18 | Context-Informed Grounding Supervision | Minjoon Seo Team | 2506.15480 | link |
| 2025-06-19 | OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models | Guotai Wang Team | 2506.15318 | null |
| 2025-06-18 | MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering | Adrian K. Davision Team | 2506.15298 | null |
| 2025-06-18 | ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections | Shin'ichi Satoh Team | 2506.15180 | null |
| 2025-06-18 | DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory | Yue Gao Team | 2506.15096 | null |
| 2025-06-18 | An Empirical Study of Bugs in Data Visualization Libraries | Chengnian Sun Team | 2506.15084 | link |
| 2025-06-17 | PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning | Yeyun Gong Team | 2506.14907 | link |
| 2025-06-17 | RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills | Chuang Gan Team | 2506.14763 | null |
| 2025-06-17 | Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models | Yuke Zhu Team | 2506.14727 | null |
| 2025-06-17 | AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions | Dacheng Tao Team | 2506.14697 | null |
| 2025-06-17 | Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models | Jiaheng Wei Team | 2506.14674 | null |
| 2025-06-17 | StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery | Michelle Pasco Team | 2506.14670 | null |
| 2025-06-17 | SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks | Liang Lin Team | 2506.14512 | null |
| 2025-06-17 | Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? | Soumik Sarkar Team | 2506.14507 | link |
| 2025-06-17 | Adapting Lightweight Vision Language Models for Radiological Visual Question Answering | Chang Sun Team | 2506.14451 | null |
| 2025-06-17 | Causally Steered Diffusion for Automated Video Counterfactual Generation | Sotirios A. Tsaftaris Team | 2506.14404 | null |
| 2025-06-17 | Narrate2Nav: Real-Time Visual Navigation with Implicit Language Reasoning in Human-Centric Environments | Xuesu Xiao Team | 2506.14233 | null |
| 2025-06-17 | Interpreting Biomedical VLMs on High-Imbalance Out-of-Distributions: An Insight into BiomedCLIP on Radiology | Benjamin Kwan Team | 2506.14136 | null |
| 2025-06-17 | A Hierarchical Test Platform for Vision Language Model (VLM)-Integrated Real-World Autonomous Driving | Ziran Wang Team | 2506.14100 | null |
| 2025-06-16 | Disentangling 3D from Large Vision-Language Models for Controlled Portrait Generation | Hyeongwoo Kim Team | 2506.14015 | null |
| 2025-06-16 | GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics | Mac Schwager Team | 2506.14009 | null |
| 2025-06-16 | Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography | Alejandro Santos-Díaz Team | 2506.13964 | null |
| 2025-06-16 | HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment | Abdul Bais Team | 2506.13925 | null |
| 2025-06-16 | Touch begins where vision ends: Generalizable policies for contact-rich manipulation | Raunaq Bhirangi Team | 2506.13762 | null |
| 2025-06-16 | Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins | Wei-Chiu Ma Team | 2506.13761 | null |
| 2025-06-16 | OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning | Yonghang Tai Team | 2506.13723 | null |
| 2025-06-16 | ROSA: Harnessing Robot States for Vision-Language and Action Alignment | Xiaoyan Sun Team | 2506.13679 | null |
| 2025-06-16 | DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models | Hanspeter Pfister Team | 2506.13638 | null |
| 2025-06-16 | VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation | Wei Pan Team | 2506.13428 | null |
| 2025-06-16 | Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation | Marija Popović Team | 2506.13367 | null |
| 2025-06-16 | Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling | Rei Kawakami Team | 2506.13282 | null |
| 2025-06-16 | Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments | Ee-Chien Chang Team | 2506.13205 | null |
| 2025-06-16 | Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence | Bernard Ghanem Team | 2506.13187 | null |
| 2025-06-16 | GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models | Jun Wang Team | 2506.13166 | null |
| 2025-06-16 | Rethinking Test-Time Scaling for Medical AI: Model and Task-Aware Strategies for LLMs and VLMs | Byung-Hoon Kim Team | 2506.13102 | null |
| 2025-06-16 | PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue | Siqi Liu Team | 2506.13063 | null |
| 2025-06-17 | HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs | Xuezhi Cao Team | 2506.13038 | null |
| 2025-06-15 | CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making | Zuozhu Liu Team | 2506.12849 | null |
| 2025-06-15 | Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models | Chang D. Yoo Team | 2506.12822 | null |
| 2025-06-15 | Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models | Wentao Zhang Team | 2506.12776 | null |
| 2025-06-15 | NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models | Jitao Sang Team | 2506.12706 | null |
| 2025-06-15 | Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context | Sandeep Singhal Team | 2506.12683 | null |
| 2025-06-14 | Not All Tokens and Heads Are Equally Important: Dual-Level Attention Intervention for Hallucination Mitigation | Yuexian Zou Team | 2506.12609 | null |
| 2025-06-13 | Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale | Minsu Cho Team | 2506.12009 | null |
| 2025-06-13 | How Visual Representations Map to Language Feature Space in Multimodal LLMs | Neel Nanda Team | 2506.11976 | null |
| 2025-06-13 | Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation | Kaifu Zhang Team | 2506.11820 | null |
| 2025-06-13 | MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space | Jan Strich Team | 2506.11684 | null |
| 2025-06-13 | VLM@school -- Evaluation of AI image understanding on German middle school knowledge | Vincent Tischler Team | 2506.11604 | null |
| 2025-06-13 | EasyARC: Evaluating Vision Language Models on True Visual Reasoning | Aylin Akkus Team | 2506.11595 | null |
| 2025-06-13 | Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis | Johannes Betz Team | 2506.11526 | null |
| 2025-06-13 | Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs | Min-Yen Kan Team | 2506.11515 | null |
| 2025-06-13 | Taming Stable Diffusion for Computed Tomography Blind Super-Resolution | Lichao Mou Team | 2506.11496 | null |
| 2025-06-13 | On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving | Mert D. Pesé Team | 2506.11472 | null |
| 2025-06-12 | Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving | Liam Paull Team | 2506.11234 | null |
| 2025-06-12 | AIR: Zero-shot Generative Model Adaptation with Iterative Refinement | Ngai-Man Cheung Team | 2506.10895 | link |
| 2025-06-13 | RationalVLA: A Rational Vision-Language-Action Model with Dual System | Haoang Li Team | 2506.10826 | null |
| 2025-06-12 | Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding | Mir Feroskhan Team | 2506.10756 | null |
| 2025-06-13 | IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain | Yefeng Zheng Team | 2506.10730 | link |
| 2025-06-12 | GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning | Guan Huang Team | 2506.10639 | null |
| 2025-06-12 | Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning | Yong Liu Team | 2506.10575 | null |
| 2025-06-12 | LLMs Are Not Yet Ready for Deepfake Image Detection | Kristen Moore Team | 2506.10474 | null |
| 2025-06-12 | UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models | Shuai Lu Team | 2506.10342 | null |
| 2025-06-12 | Using Vision Language Models to Detect Students' Academic Emotion through Facial Expressions | Gaowei Chen Team | 2506.10334 | null |
| 2025-06-12 | HalLoc: Token-level Localization of Hallucinations for Vision Language Models | Gunhee Kim Team | 2506.10286 | null |
| 2025-06-11 | Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval | Francis Ferraro Team | 2506.10202 | null |
| 2025-06-11 | Improving Personalized Search with Regularized Low-Rank Parameter Updates | Bryan Russell Team | 2506.10182 | null |
| 2025-06-11 | A Navigation Framework Utilizing Vision-Language Models | Kaiyu tang Team | 2506.10172 | null |
| 2025-06-11 | One Patient, Many Contexts: Scaling Medical AI Through Contextual Intelligence | Marinka Zitnik Team | 2506.10157 | null |
| 2025-06-11 | ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs | Lijuan Wang Team | 2506.10128 | null |
| 2025-06-11 | Test-Time Adaptation for Generalizable Task Progress Estimation | Alessandra Russo Team | 2506.10085 | null |
| 2025-06-11 | Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing | Tieniu Tan Team | 2506.09965 | link |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null |
| 2025-06-11 | 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation | Hyunjung Shim Team | 2506.09883 | link |
| 2025-06-11 | Adding simple structure at inference improves Vision-Language Compositionality | Gorka Azkune Team | 2506.09691 | link |
| 2025-06-11 | FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models | Liangqiong Qu Team | 2506.09638 | null |
| 2025-06-11 | Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs | Jaehyung Kim Team | 2506.09522 | link |
| 2025-06-11 | Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning | Jia Li Team | 2506.09473 | null |
| 2025-06-11 | TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision | Susmit Jha Team | 2506.09445 | null |
| 2025-06-11 | DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt | Ge Li Team | 2506.09353 | null |
| 2025-06-10 | UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation | Li Fei-Fei Team | 2506.09284 | null |
| 2025-06-10 | MultiNet: An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Harshvardhan Sikka Team | 2506.09172 | null |
| 2025-06-10 | VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning | Zhenfei Yin Team | 2506.09049 | null |
| 2025-06-11 | Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs | Yonatan Belinkov Team | 2506.09047 | null |
| 2025-06-10 | Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better | Jiaqi Wang Team | 2506.09040 | null |
| 2025-06-10 | Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models | Liansheng Wang Team | 2506.08990 | null |
| 2025-06-10 | Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions | Yejin Choi Team | 2506.08927 | null |
| 2025-06-12 | Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought | Shanghang Zhang Team | 2506.08817 | null |
| 2025-06-10 | Multimodal Representation Alignment for Cross-modal Information Retrieval | Luis A. Leiva Team | 2506.08774 | null |
| 2025-06-10 | PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Xiaodan Liang Team | 2506.08708 | null |
| 2025-06-10 | VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism | Weijiang Yu Team | 2506.08691 | null |
| 2025-06-10 | ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction | Taesup Kim Team | 2506.08678 | null |
| 2025-06-10 | Convergence of Spectral Principal Paths: How Deep Networks Distill Linear Representations from Noisy Inputs | Ang Li Team | 2506.08543 | null |
| 2025-06-10 | Better Reasoning with Less Data: Enhancing VLMs Through Unified Modality Scoring | Jiaheng Wei Team | 2506.08429 | null |
| 2025-06-11 | SafeCoT: Improving VLM Safety with Minimal Reasoning | Chaochao Lu Team | 2506.08399 | null |
| 2025-06-10 | SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding | Jaeyoung Do Team | 2506.08391 | null |
| 2025-06-09 | A Good CREPE needs more than just Sugar: Investigating Biases in Compositional Vision-Language Benchmarks | Matthias Bethge Team | 2506.08227 | null |
| 2025-06-11 | GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra | Guha Balakrishnan Team | 2506.08194 | null |
| 2025-06-09 | Open World Scene Graph Generation using Vision Language Models | Anuj Karpatne Team | 2506.08189 | null |
| 2025-06-09 | CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems | Ramya Korlakai Vinayak Team | 2506.08071 | null |
| 2025-06-10 | Vision Transformers Don't Need Trained Registers | Yossi Gandelsman Team | 2506.08010 | null |
| 2025-06-09 | Hidden in plain sight: VLMs overlook their visual representations | Trevor Darrell Team | 2506.08008 | null |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null |
| 2025-06-09 | Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations | Yiqing Shen Team | 2506.07943 | null |
| 2025-06-09 | Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models | Zsolt Kira Team | 2506.07936 | null |
| 2025-06-09 | SAM2Auto: Auto Annotation Using FLASH | Q. M. Jonathan Wu Team | 2506.07850 | null |
| 2025-06-09 | Image Reconstruction as a Tool for Feature Analysis | Andrey Kuznetsov Team | 2506.07803 | null |
| 2025-06-09 | Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger | Shiming Xiang Team | 2506.07785 | null |
| 2025-06-09 | Language-Vision Planner and Executor for Text-to-Visual Reasoning | Ling Liu Team | 2506.07778 | null |
| 2025-06-10 | ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models | Shuai Lu Team | 2506.07739 | null |
| 2025-06-09 | OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting | Bastian Leibe Team | 2506.07697 | null |
| 2025-06-09 | Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline | Idan Szpektor Team | 2506.07631 | null |
| 2025-06-09 | Event-Priori-Based Vision-Language Model for Efficient Visual Understanding | Michele Magno Team | 2506.07627 | null |
| 2025-06-10 | SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems | Zhengzhong Tu Team | 2506.07564 | null |
| 2025-06-10 | GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition | Conghui He Team | 2506.07553 | null |
| 2025-06-09 | Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent | Ting Yang Ling Team | 2506.07509 | null |
| 2025-06-09 | Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency | Xinggang Wang Team | 2506.07497 | null |
| 2025-06-09 | CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization | Hyun Myung Team | 2506.07484 | null |
| 2025-06-09 | LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments | Josh Park Team | 2506.07416 | null |
| 2025-06-09 | MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems | Tao Qi Team | 2506.07399 | null |
| 2025-06-06 | CoMemo: LVLMs Need Image Context with Image Memory | Jifeng Dai Team | 2506.06279 | null |
| 2025-06-06 | Movie Facts and Fibs (MF |
André F. T. Martins Team | 2506.06275 | null |
| 2025-06-06 | Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study | Lena Maier-Hein Team | 2506.06232 | null |
| 2025-06-06 | GenIR: Generative Visual Feedback for Mental Image Retrieval | James Davis Team | 2506.06220 | null |
| 2025-06-06 | STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Horst Possegger Team | 2506.06218 | null |
| 2025-06-06 | WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management | Zijian Wang Team | 2506.06084 | null |
| 2025-06-06 | Full Conformal Adaptation of Medical Vision-Language Models | Jose Dolz Team | 2506.06076 | null |
| 2025-06-06 | BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning | Rudolf Lioutikov Team | 2506.06072 | null |
| 2025-06-06 | MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks | Yiren Song Team | 2506.05982 | null |
| 2025-06-06 | HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios | Weihao Gu Team | 2506.05883 | null |
| 2025-06-06 | Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions? | Hitomi Yanaka Team | 2506.05765 | null |
| 2025-06-06 | MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory | João Magalhães Team | 2506.05696 | null |
| 2025-06-06 | DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models | Xianpeng Lang Team | 2506.05667 | null |
| 2025-06-05 | MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning | Furong Huang Team | 2506.05523 | null |
| 2025-06-05 | Degradation-Aware Image Enhancement via Vision-Language Classification | Zibo Meng Team | 2506.05450 | null |
| 2025-06-06 | Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Xiaodan Liang Team | 2506.05318 | null |
| 2025-06-05 | MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm | Xiang Bai Team | 2506.05218 | null |
| 2025-06-05 | Quantifying Cross-Modality Memorization in Vision-Language Models | Chiyuan Zhang Team | 2506.05198 | null |
| 2025-06-05 | CIVET: Systematic Evaluation of Understanding in VLMs | Giuseppe Riccardi Team | 2506.05146 | null |
| 2025-06-05 | PixCell: A generative foundation model for digital histopathology images | Dimitris Samaras Team | 2506.05127 | null |
| 2025-06-05 | A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions | Dung Nguyen Team | 2506.05061 | null |
| 2025-06-05 | Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System | Moju Zhao Team | 2506.05020 | null |
| 2025-06-05 | ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT | Mikołaj Koszowski Team | 2506.04929 | null |
| 2025-06-05 | SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs | Dacheng Tao Team | 2506.04743 | null |
| 2025-06-05 | Robust Few-Shot Vision-Language Model Adaptation | Shu Kong Team | 2506.04713 | null |
| 2025-06-05 | HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Sung Ju Hwang Team | 2506.04704 | null |
| 2025-06-05 | SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents | Yu-Wing Tai Team | 2506.04606 | null |
| 2025-06-05 | MuSciClaims: Multimodal Scientific Claim Verification | Niranjan Balasubramanian Team | 2506.04585 | null |
| 2025-06-05 | Handle-based Mesh Deformation Guided By Vision Language Model | Aniket Bera Team | 2506.04562 | null |
| 2025-06-04 | RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics | Shanghang Zhang Team | 2506.04308 | null |
| 2025-06-04 | Image Editing As Programs with Diffusion Models | Xinchao Wang Team | 2506.04158 | null |
| 2025-06-04 | Recent Advances in Medical Image Classification | Ngoc Quoc Ly Team | 2506.04129 | null |
| 2025-06-04 | LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward | Jing Li Team | 2506.04070 | null |
| 2025-06-04 | Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization | Min Zhang Team | 2506.04039 | null |
| 2025-06-04 | Vocabulary-free few-shot learning for Vision-Language Models | Christophe De Vleeschouwer Team | 2506.04005 | null |
| 2025-06-04 | DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models | Anders Holst Team | 2506.03933 | null |
| 2025-06-04 | Zero-Shot Temporal Interaction Localization for Egocentric Videos | Hesheng Wang Team | 2506.03662 | null |
| 2025-06-04 | Spatial Understanding from Videos: Structured Prompts Meet Simulation Data | Liqiang Nie Team | 2506.03642 | null |
| 2025-06-04 | VLMs Can Aggregate Scattered Training Patches | Chaochao Lu Team | 2506.03614 | null |
| 2025-06-04 | BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance | Ngan Le Team | 2506.03589 | null |
| 2025-06-04 | MiMo-VL Technical Report | Bingquan Xia Team | 2506.03569 | null |
| 2025-06-04 | Target Semantics Clustering via Text Representations for Robust Universal Domain Adaptation | Yixin Zhang Team | 2506.03521 | null |
| 2025-06-04 | DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models | Aliaksandr Siarohin Team | 2506.03517 | null |
| 2025-06-04 | POLARIS: A High-contrast Polarimetric Imaging Benchmark Dataset for Exoplanetary Disk Representation Learning | Weixin Yao Team | 2506.03511 | link |
| 2025-06-03 | Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views | Hansaem Kim Team | 2506.03371 | null |
| 2025-06-03 | Robustness in Both Domains: CLIP Needs a Robust Text Encoder | Volkan Cevher Team | 2506.03355 | null |
| 2025-06-03 | Grounded Vision-Language Interpreter for Integrated Task and Motion Planning | Atsushi Hashimoto Team | 2506.03270 | null |
| 2025-06-03 | OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models | Li Yi Team | 2506.03135 | null |
| 2025-06-03 | EgoVLM: Policy Optimization for Egocentric Video Understanding | Linshen Liu Team | 2506.03097 | null |
| 2025-06-03 | DPO Learning with LLMs-Judge Signal for Computer Use Agents | Phillip Howard Team | 2506.03095 | null |
| 2025-06-03 | From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit | Demba Ba Team | 2506.03093 | null |
| 2025-06-03 | Text-guided Generation of Efficient Personalized Inspection Plans | Aniket Bera Team | 2506.02917 | null |
| 2025-06-04 | FlySearch: Exploring how vision-language models explore | Maciej Wołczyk Team | 2506.02896 | null |
| 2025-06-03 | Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights | Tony Wu Team | 2506.02865 | null |
| 2025-06-03 | SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking | Yiwei Wang Team | 2506.02803 | null |
| 2025-06-04 | Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning | Arash Afkanpour Team | 2506.02738 | null |
| 2025-06-03 | Iterative Self-Improvement of Vision Language Models for Image Scoring and Self-Explanation | Toshihiko Yamasaki Team | 2506.02708 | null |
| 2025-06-03 | Small Aid, Big Leap: Efficient Test-Time Adaptation for Vision-Language Models with AdaptNet | Zhi Wang Team | 2506.02671 | null |
| 2025-06-03 | Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models | Dong Seog Han Team | 2506.02615 | null |
| 2025-06-03 | Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models | Farzan Farnia Team | 2506.02557 | null |
| 2025-06-03 | Sign Language: Towards Sign Understanding for Robot Autonomy | David Hsu Team | 2506.02556 | null |
| 2025-06-03 | SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence | Yueming Jin Team | 2506.02555 | null |
| 2025-06-03 | Rethinking Post-Unlearning Behavior of Large Vision-Language Models | Kyomin Jung Team | 2506.02541 | null |
| 2025-06-04 | MemoryOut: Learning Principal Features via Multimodal Sparse Filtering Network for Semi-supervised Video Anomaly Detection | Qingyao Wu Team | 2506.02535 | null |
| 2025-06-03 | VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments | Yu Wang Team | 2506.02387 | null |
| 2025-06-03 | Auto-Labeling Data for Object Detection | Jason J. Corso Team | 2506.02359 | null |
| 2025-06-03 | RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models | Jianzong Wang Team | 2506.02354 | null |
| 2025-05-30 | ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL | Lili Qiu Team | 2505.24875 | null |
| 2025-05-30 | ProxyThinker: Test-Time Guidance through Small Visual Reasoners | Vicente Ordonez Team | 2505.24872 | null |
| 2025-05-30 | GenSpace: Benchmarking Spatially-Aware Image Generation | Zhou Zhao Team | 2505.24870 | null |
| 2025-05-30 | Time Blindness: Why Video-Language Models Can't See What Humans Can? | Mohamed Elhoseiny Team | 2505.24867 | null |
| 2025-05-30 | Conformal Prediction for Zero-Shot Models | Jose Dolz Team | 2505.24693 | null |
| 2025-05-30 | BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models | Khoa Luu Team | 2505.24649 | null |
| 2025-05-30 | SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition | Wadii Boulila Team | 2505.24600 | null |
| 2025-05-30 | AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders | Liang Ding Team | 2505.24519 | null |
| 2025-05-30 | CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation | Thamar Solorio Team | 2505.24456 | null |
| 2025-05-30 | Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning | Matthias Hein Team | 2505.24424 | null |
| 2025-05-30 | MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs | Sophia Ananiadou Team | 2505.24423 | null |
| 2025-05-30 | Grid-LOGAT: Grid Based Local and Global Area Transcription for Video Question Answering | Fadoua Ghourabi Team | 2505.24371 | null |
| 2025-05-30 | KEVER^2: Knowledge-Enhanced Visual Emotion Reasoning and Retrieval | Yong Li Team | 2505.24342 | null |
| 2025-05-30 | ROAD: Responsibility-Oriented Reward Design for Reinforcement Learning in Autonomous Driving | Songan Zhang Team | 2505.24317 | null |
| 2025-05-30 | Benchmarking Foundation Models for Zero-Shot Biometric Tasks | Arun Ross Team | 2505.24214 | null |
| 2025-05-30 | Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap | Baharan Mirzasoleiman Team | 2505.24208 | null |
| 2025-05-30 | DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? | Xuegong Zhang Team | 2505.24173 | null |
| 2025-05-30 | CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs | Xuchen Song Team | 2505.24120 | null |
| 2025-05-29 | mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation | Zhengzhong Tu Team | 2505.24073 | null |
| 2025-05-29 | Multi-RAG: A Multimodal Retrieval-Augmented Generation System for Adaptive Video Understanding | Tinoosh Mohsenin Team | 2505.23990 | null |
| 2025-05-29 | ZeroGUI: Automating Online GUI Learning at Zero Human Cost | Jifeng Dai Team | 2505.23762 | link |
| 2025-05-29 | Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint | David M. Chan Team | 2505.23759 | link |
| 2025-05-29 | To Trust Or Not To Trust Your Vision-Language Model's Prediction | Olga Fink Team | 2505.23745 | link |
| 2025-05-29 | LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization | Jing Liao Team | 2505.23740 | null |
| 2025-05-29 | Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better | Sergey Levine Team | 2505.23705 | null |
| 2025-05-29 | Grounded Reinforcement Learning for Visual Reasoning | Katerina Fragkiadaki Team | 2505.23678 | null |
| 2025-05-29 | Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition | Liangcai Gao Team | 2505.23566 | null |
| 2025-05-30 | Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information | Weiping Li Team | 2505.23558 | link |
| 2025-05-29 | TRAP: Targeted Redirecting of Agentic Preferences | Gagandeep Singh Team | 2505.23518 | null |
| 2025-05-29 | VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation | Xu-Cheng Yin Team | 2505.23484 | link |
| 2025-05-29 | Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model | Muzammil Behzad Team | 2505.23358 | null |
| 2025-05-29 | LADA: Scalable Label-Specific CLIP Adapter for Continual Learning | Min-Ling Zhang Team | 2505.23271 | link |
| 2025-05-29 | VLM-RRT: Vision Language Model Guided RRT Search for Autonomous UAV Navigation | Panayiotis Kolios Team | 2505.23267 | null |
| 2025-05-29 | Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion | Tao Xiang Team | 2505.23266 | null |
| 2025-05-29 | ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering | Lei Wang Team | 2505.23242 | null |
| 2025-05-29 | PhotoArtAgent: Intelligent Photo Retouching with Language Model-Based Artist Agents | Jinjin Gu Team | 2505.23130 | null |
| 2025-05-29 | Are Unified Vision-Language Models Necessary: Generalization Across Understanding and Generation | Yu Cheng Team | 2505.23043 | link |
| 2025-05-29 | An Empirical Study of Federated Prompt Learning for Vision Language Model | Mang Ye Team | 2505.23024 | null |
| 2025-05-29 | SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model | Zhenwei Shi Team | 2505.23010 | null |
| 2025-05-29 | QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining | Muhao Chen Team | 2505.23004 | link |
| 2025-05-28 | Zero-Shot Vision Encoder Grafting via LLM Surrogates | Tom Goldstein Team | 2505.22664 | link |
| 2025-05-28 | Training Free Stylized Abstraction | Vishal M. Patel Team | 2505.22663 | null |
| 2025-05-28 | VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models | Dong Yu Team | 2505.22654 | null |
| 2025-05-28 | Sherlock: Self-Correcting Reasoning in Vision-Language Models | Ruqi Zhang Team | 2505.22651 | null |
| 2025-05-28 | Hypothesis Testing in Imaging Inverse Problems | Marcelo Pereyra Team | 2505.22481 | null |
| 2025-05-28 | Zero-Shot 3D Visual Grounding from Vision-Language Models | Junwei Liang Team | 2505.22429 | null |
| 2025-05-28 | IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth | Syed Masum Billah Team | 2505.22305 | null |
| 2025-05-28 | Investigating Mechanisms for In-Context Vision Language Binding | Vineet Gandhi Team | 2505.22200 | null |
| 2025-05-29 | Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging | Piji Li Team | 2505.22150 | null |
| 2025-05-28 | 3D Question Answering via only 2D Vision-Language Models | Qianru Sun Team | 2505.22143 | null |
| 2025-05-28 | Reinforced Reasoning for Embodied Planning | Bo Jin Team | 2505.22050 | null |
| 2025-05-28 | Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization | Xinlei Chen Team | 2505.22038 | null |
| 2025-05-28 | Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset | Muhammad Abdul-Mageed Team | 2505.21979 | null |
| 2025-05-29 | DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation | Xin Tan Team | 2505.21969 | null |
| 2025-05-28 | Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack | Usman Naseem Team | 2505.21967 | null |
| 2025-05-28 | Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs | Byonghyo Shim Team | 2505.21955 | null |
| 2025-05-28 | Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null |
| 2025-05-28 | Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation | Christian Desrosiers Team | 2505.21844 | null |
| 2025-05-27 | MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning | Vivek Gupta Team | 2505.21771 | null |
| 2025-05-27 | MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis | Christian Wachinger Team | 2505.21698 | null |
| 2025-05-27 | ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models | Yueting Zhuang Team | 2505.21500 | null |
| 2025-05-27 | AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery | Qing Wang Team | 2505.21499 | null |
| 2025-05-27 | Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration | Ziwei Zhu Team | 2505.21472 | null |
| 2025-05-27 | ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models | Wentao Zhang Team | 2505.21465 | null |
| 2025-05-27 | LazyVLM: Neuro-Symbolic Approach to Video Analytics | M. Tamer Özsu Team | 2505.21459 | null |
| 2025-05-27 | DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models | Soumik Sarkar Team | 2505.21382 | null |
| 2025-05-27 | XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration | Min Zhang Team | 2505.21279 | null |
| 2025-05-27 | Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation | Yutao Yue Team | 2505.21106 | null |
| 2025-05-27 | DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response | Naoto Yokoya Team | 2505.21089 | null |
| 2025-05-27 | LPOI: Listwise Preference Optimization for Vision Language Models | Gunhee Kim Team | 2505.21061 | null |
| 2025-05-27 | RefAV: Towards Planning-Centric Scenario Mining | Neehar Peri Team | 2505.20981 | null |
| 2025-05-27 | On VLMs for Diverse Tasks in Multimodal Meme Classification | Jasabanta Patro Team | 2505.20937 | null |
| 2025-05-27 | A Stereotype Content Analysis on Color-related Social Bias in Large Vision Language Models | Bugeun Kim Team | 2505.20901 | null |
| 2025-05-27 | AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding | Joon Son Chung Team | 2505.20862 | null |
| 2025-05-27 | Rendering-Aware Reinforcement Learning for Vector Graphics Generation | Marco Pedersoli Team | 2505.20793 | null |
| 2025-05-27 | FM-Planner: Foundation Model Guided Path Planning for Autonomous Drone Navigation | Mir Feroskhan Team | 2505.20783 | null |
| 2025-05-27 | Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models | Yao Yang Team | 2505.20728 | null |
| 2025-05-27 | ManiTaskGen: A Comprehensive Task Generator for Benchmarking and Improving Vision-Language Agents on Embodied Decision-Making | Hao Su Team | 2505.20726 | null |
| 2025-05-27 | Automating eHMI Action Design with LLMs for Automated Vehicle Communication | Takeo Igarashi Team | 2505.20711 | null |
| 2025-05-27 | GIFARC: Synthetic Dataset for Leveraging Human-Intuitive Analogies to Elevate AI Reasoning | Sundong Kim Team | 2505.20672 | null |
| 2025-05-26 | Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models | Naoto Yokoya Team | 2505.20236 | null |
| 2025-05-26 | Agentic 3D Scene Generation with Spatially Contextualized VLMs | Chi-Keung Tang Team | 2505.20129 | null |
| 2025-05-26 | MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models | James M. Rehg Team | 2505.20122 | null |
| 2025-05-27 | EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition | Sören Auer Team | 2505.20033 | null |
| 2025-05-26 | ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers | Elmar Rückert Team | 2505.20032 | null |
| 2025-05-26 | Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models | Ernest K. Ryu Team | 2505.20021 | null |
| 2025-05-26 | Can Visual Encoder Learn to See Arrows? | Hiroaki Ozaki Team | 2505.19944 | null |
| 2025-05-26 | Attention! You Vision Language Model Could Be Maliciously Manipulated | Shudong Zhang Team | 2505.19911 | null |
| 2025-05-26 | Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement | Muzammil Behzad Team | 2505.19895 | null |
| 2025-05-26 | One Surrogate to Fool Them All: Universal, Transferable, and Targeted Adversarial Attacks with CLIP | Kehuan Zhang Team | 2505.19840 | null |
| 2025-05-26 | TeViR: Text-to-Video Reward with Diffusion Models for Efficient Reinforcement Learning | Dongbin Zhao Team | 2505.19769 | null |
| 2025-05-26 | Modeling Beyond MOS: Quality Assessment Models Must Integrate Context, Reasoning, and Multimodality | Alessandro Bruno Team | 2505.19696 | null |
| 2025-05-26 | Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs | Shu-Tao Xia Team | 2505.19678 | null |
| 2025-05-26 | JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models | Yingchun Wang Team | 2505.19610 | null |
| 2025-05-26 | What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation | Rongrong Ji Team | 2505.19569 | null |
| 2025-05-26 | FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models | Ruixuan Li Team | 2505.19536 | null |
| 2025-05-26 | Locality-Aware Zero-Shot Human-Object Interaction Detection | Minsu Cho Team | 2505.19503 | null |
| 2025-05-26 | Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models | Guoliang Kang Team | 2505.19498 | null |
| 2025-05-26 | Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model | Yu Cheng Team | 2505.19406 | null |
| 2025-05-27 | DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving | Hao Zhao Team | 2505.19381 | null |
| 2025-05-26 | DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models | Fatemeh Afghah Team | 2505.19373 | null |
| 2025-05-23 | VideoGameBench: Can Vision-Language Models complete popular video games? | Ofir Press Team | 2505.18134 | null |
| 2025-05-23 | One RL to See Them All: Visual Triple Unified Reinforcement Learning | Junjie Yan Team | 2505.18129 | null |
| 2025-05-23 | CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays | Edward Choi Team | 2505.18087 | null |
| 2025-05-23 | FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation | Shibiao Xu Team | 2505.18053 | null |
| 2025-05-23 | Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation | Bogdan Sorin Coseriu Team | 2505.18039 | null |
| 2025-05-23 | Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling | Mun Yong Yi Team | 2505.17982 | null |
| 2025-05-23 | VLM Models and Automated Grading of Atopic Dermatitis | Hamed Ghodrati Team | 2505.17835 | null |
| 2025-05-23 | Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations | Chao Shen Team | 2505.17812 | null |
| 2025-05-23 | U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding | Hongcheng Guo Team | 2505.17779 | null |
| 2025-05-23 | SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain | Yu Li Team | 2505.17727 | null |
| 2025-05-23 | Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek | Xiangdong Zhou Team | 2505.17702 | null |
| 2025-05-23 | Towards General Continuous Memory for Vision-Language Models | Biwei Huang Team | 2505.17670 | null |
| 2025-05-23 | EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications | Min Yang Team | 2505.17654 | null |
| 2025-05-23 | HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning | Jianfei Yang Team | 2505.17645 | null |
| 2025-05-23 | Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports | Takahiro Omi Team | 2505.17625 | null |
| 2025-05-23 | CAS-IQA: Teaching Vision-Language Models for Synthetic Angiography Quality Assessment | Zeng-Guang Hou Team | 2505.17619 | null |
| 2025-05-23 | Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving | Wangmeng Zuo Team | 2505.17609 | null |
| 2025-05-23 | A Unified Multi-Scale Attention-Based Network for Automatic 3D Segmentation of Lung Parenchyma & Nodules In Thoracic CT Images | Furqan Shaukat Team | 2505.17602 | null |
| 2025-05-23 | Multimodal Conversation Structure Understanding | David Bamman Team | 2505.17536 | null |
| 2025-05-23 | Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding | Sungzoon Cho Team | 2505.17529 | null |
| 2025-05-22 | Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models | Mike Zheng Shou Team | 2505.16854 | link |
| 2025-05-23 | LaViDa: A Large Diffusion Language Model for Multimodal Understanding | Aditya Grover Team | 2505.16839 | link |
| 2025-05-22 | From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization | Huaxiu Yao Team | 2505.16832 | link |
| 2025-05-22 | Perceptual Quality Assessment for Embodied AI | Guangtao Zhai Team | 2505.16815 | link |
| 2025-05-22 | SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving | Hongsheng Li Team | 2505.16805 | null |
| 2025-05-22 | REOBench: Benchmarking Robustness of Earth Observation Foundation Models | Tianjin Huang Team | 2505.16793 | link |
| 2025-05-22 | Single Domain Generalization for Few-Shot Counting via Universal Representation Matching | Xinghao Chen Team | 2505.16778 | link |
| 2025-05-22 | IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models | AiTi Aw Team | 2505.16774 | link |
| 2025-05-22 | Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation | Jianbing Shen Team | 2505.16763 | null |
| 2025-05-22 | SD-MAD: Sign-Driven Few-shot Multi-Anomaly Detection in Medical Images | Mahsa Baktashmotlagh Team | 2505.16659 | null |
| 2025-05-22 | Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models | Pål Halvorsen Team | 2505.16647 | null |
| 2025-05-22 | MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation | Zongqing Lu Team | 2505.16602 | null |
| 2025-05-22 | ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models | Xiuying Chen Team | 2505.16517 | null |
| 2025-05-22 | Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models | Yaochu Jin Team | 2505.16446 | null |
| 2025-05-22 | Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models | Kai Han Team | 2505.16416 | link |
| 2025-05-22 | Mitigating Hallucinations in Vision-Language Models through Image-Guided Head Suppression | Souvik Kundu Team | 2505.16411 | link |
| 2025-05-22 | VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving | Samuel Labi Team | 2505.16377 | null |
| 2025-05-22 | MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing | Xinhan Di Team | 2505.16279 | null |
| 2025-05-22 | When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification | Jiaheng Wei Team | 2505.16149 | null |
| 2025-05-22 | Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation | Junfeng Fang Team | 2505.16146 | null |
| 2025-05-21 | InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition | Xue Yang Team | 2505.15818 | null |
| 2025-05-21 | From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems | Soujanya Poria Team | 2505.15685 | null |
| 2025-05-21 | FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models | Qian Wang Team | 2505.15644 | null |
| 2025-05-21 | Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models | Ya Wang Team | 2505.15576 | link |
| 2025-05-21 | TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving | Abdallah Shami Team | 2505.15564 | null |
| 2025-05-21 | Clapper: Compact Learning and Video Representation in VLMs | Fuzheng Zhang Team | 2505.15529 | null |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Ken Goldberg Team | 2505.15517 | null |
| 2025-05-21 | Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought | Libo Qin Team | 2505.15510 | null |
| 2025-05-21 | Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts | Soma Biswas Team | 2505.15506 | link |
| 2025-05-21 | Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification | Irwin King Team | 2505.15504 | null |
| 2025-05-21 | Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models | Bryan Hooi Team | 2505.15489 | null |
| 2025-05-21 | Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL | Qing Li Team | 2505.15436 | null |
| 2025-05-21 | TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models | Keze Wang Team | 2505.15435 | null |
| 2025-05-21 | On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable? | Mohammad Yaqub Team | 2505.15425 | null |
| 2025-05-21 | Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study | Hwanjo Yu Team | 2505.15389 | null |
| 2025-05-21 | RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation | Farshad Khorrami Team | 2505.15373 | null |
| 2025-05-21 | Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition | Youngsook Song Team | 2505.15367 | null |
| 2025-05-21 | AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving | Diange Yang Team | 2505.15298 | null |
| 2025-05-21 | Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs | Zibin Zheng Team | 2505.15265 | null |
| 2025-05-21 | Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation | Kyomin Jung Team | 2505.15249 | null |
| 2025-05-20 | UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens | Wentao Zhang Team | 2505.14671 | null |
| 2025-05-20 | CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation | Faez Ahmed Team | 2505.14646 | null |
| 2025-05-20 | Debating for Better Reasoning: An Unsupervised Multimodal Approach | Mirella Lapata Team | 2505.14627 | null |
| 2025-05-21 | PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models | Wenjia Zhang Team | 2505.14481 | null |
| 2025-05-20 | RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding | Serge Belongie Team | 2505.14462 | link |
| 2025-05-20 | SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation | Masafumi Oyamada Team | 2505.14381 | null |
| 2025-05-20 | Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds | Agnieszka Wykowska Team | 2505.14366 | null |
| 2025-05-20 | DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning | Xing Yu Team | 2505.14362 | link |
| 2025-05-20 | Vision-Language Modeling Meets Remote Sensing: Models, Datasets and Perspectives | Gui-Song Xia Team | 2505.14361 | null |
| 2025-05-20 | Plane Geometry Problem Solving with Multi-modal Reasoning: A Survey | Dongwoo Kim Team | 2505.14340 | null |
| 2025-05-20 | Aligning Attention Distribution to Information Flow for Hallucination Mitigation in Large Vision-Language Models | Chong Feng Team | 2505.14257 | null |
| 2025-05-20 | Visual Agentic Reinforcement Fine-Tuning | Jiaqi Wang Team | 2505.14246 | link |
| 2025-05-20 | VoQA: Visual-only Question Answering | Lei Huang Team | 2505.14227 | null |
| 2025-05-20 | Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models | Matthew Purver Team | 2505.14160 | null |
| 2025-05-20 | Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent | Xuming Hu Team | 2505.14141 | null |
| 2025-05-20 | NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI | Benedikt Wiestler Team | 2505.14064 | null |
| 2025-05-20 | ShieldVLM: Safeguarding the Multimodal Implicit Toxicity via Deliberative Reasoning with LVLMs | Minlie Huang Team | 2505.14035 | null |
| 2025-05-20 | Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models | Yalin Wang Team | 2505.13973 | null |
| 2025-05-20 | APEX: Empowering LLMs with Physics-Based Task Planning for Real-time Insight | Ambuj Singh Team | 2505.13921 | link |
| 2025-05-20 | InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning | Jingkuan Song Team | 2505.13888 | null |
| 2025-05-19 | ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models | Greg Durrett Team | 2505.13444 | null |
| 2025-05-19 | G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning | Baobao Chang Team | 2505.13426 | link |
| 2025-05-19 | Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots | Shreyas Kousik Team | 2505.13376 | null |
| 2025-05-20 | Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning | Lan-Zhe Guo Team | 2505.13317 | null |
| 2025-05-19 | I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models | R. Maria del Rio-Chanona Team | 2505.13302 | link |
| 2025-05-19 | Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts | Sashank Varma Team | 2505.13281 | null |
| 2025-05-19 | From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection | Jian Liang Team | 2505.13233 | link |
| 2025-05-19 | ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models | Pekka Marttinen Team | 2505.13180 | link |
| 2025-05-19 | Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model | Dong Yu Team | 2505.13062 | null |
| 2025-05-20 | 3D Visual Illusion Depth Estimation | Yunde Jia Team | 2505.13061 | link |
| 2025-05-19 | MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO | Ying Shan Team | 2505.13031 | link |
| 2025-05-19 | Uniformity First: Uniformity-aware Test-time Adaptation of Vision-language Models against Image Corruption | Tomoki Hamagami Team | 2505.12912 | link |
| 2025-05-19 | TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks | Jin Dong Team | 2505.12884 | null |
| 2025-05-19 | FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models | Renxin Zhong Team | 2505.12835 | null |
| 2025-05-19 | VLC Fusion: Vision-Language Conditioned Sensor Fusion for Robust Object Detection | Ransalu Senanayake Team | 2505.12715 | null |
| 2025-05-19 | TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning | Soodeh Nikan Team | 2505.12670 | null |
| 2025-05-19 | Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps | Miguel P. Eckstein Team | 2505.12660 | null |
| 2025-05-19 | AutoMat: Enabling Automated Crystal Structure Reconstruction from Microscopy via Agentic Tool Use | Fei Wei Team | 2505.12650 | link |
| 2025-05-19 | Use as Many Surrogates as You Want: Selective Ensemble Attack to Unleash Transferability without Sacrificing Resource Efficiency | Zhengyu Zhao Team | 2505.12644 | null |
| 2025-05-19 | Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents | Honglak Lee Team | 2505.12632 | null |
| 2025-05-16 | Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner | Hong Bu Team | 2505.11404 | null |
| 2025-05-16 | Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild | Guillaume Sartoretti Team | 2505.11350 | null |
| 2025-05-16 | Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models | Joyce Chai Team | 2505.11326 | null |
| 2025-05-16 | Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation | Chang D. Yoo Team | 2505.11221 | null |
| 2025-05-16 | Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing | Begüm Demir Team | 2505.11121 | null |
| 2025-05-16 | CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs | Natalia Díaz-Rodríguez Team | 2505.11060 | null |
| 2025-05-16 | Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere | Prashant Singh Team | 2505.11029 | null |
| 2025-05-16 | On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating | Alessandro Rinaldo Team | 2505.10860 | null |
| 2025-05-16 | Benchmarking performance, explainability, and evaluation strategies of vision-language models for surgery: Challenges and opportunities | Shan Lin Team | 2505.10764 | null |
| 2025-05-15 | GeoGrid-Bench: Can Foundation Models Understand Multimodal Gridded Geo-Spatial Data? | Tanwi Mallick Team | 2505.10714 | null |
| 2025-05-15 | MOSAIC: A Multi-View 2.5D Organ Slice Selector with Cross-Attentional Reasoning for Anatomically-Aware CT Localization in Medical Organ Segmentation | Muzammil Behzad Team | 2505.10672 | null |
| 2025-05-15 | CLIP Embeddings for AI-Generated Image Detection: A Few-Shot Study with Lightweight Classifier | Ziyang Ou Team | 2505.10664 | null |
| 2025-05-15 | Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding | Chong Feng Team | 2505.10634 | null |
| 2025-05-15 | MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly | Mark Steedman Team | 2505.10610 | null |
| 2025-05-18 | MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models | Vithursan Thangarasa Team | 2505.10526 | null |
| 2025-05-16 | AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges | Manoj Karkee Team | 2505.10468 | null |
| 2025-05-15 | Vision language models have difficulty recognizing virtual objects | J. G. Trafton Team | 2505.10453 | null |
| 2025-05-15 | MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models | Xiaodong Gu Team | 2505.10088 | link |
| 2025-05-15 | AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection | Chengjie Wang Team | 2505.09926 | link |
| 2025-05-14 | Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling | Nikolaus Correll Team | 2505.09731 | null |
| 2025-05-14 | ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation | Daniel Seita Team | 2505.09698 | null |
| 2025-05-14 | LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models | Yanan Sun Team | 2505.09659 | link |
| 2025-05-14 | Variational Visual Question Answering | Marcus Rohrbach Team | 2505.09591 | null |
| 2025-05-14 | VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation | Shuo Wang Team | 2505.09577 | null |
| 2025-05-14 | Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput | Lin Ma Team | 2505.09498 | null |
| 2025-05-14 | Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition | Muzammil Behzad Team | 2505.09336 | null |
| 2025-05-14 | MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning | Bin-Bin Gao Team | 2505.09265 | null |
| 2025-05-14 | Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models | Ross Greer Team | 2505.09139 | null |
| 2025-05-14 | Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning | Qing Li Team | 2505.09118 | null |
| 2025-05-14 | OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions | Hao Zhou Team | 2505.09092 | link |
| 2025-05-13 | Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training | Heng Ji Team | 2505.08971 | link |
| 2025-05-15 | Behind Maya: Building a Multilingual Vision Language Model | Alham Fikri Aji Team | 2505.08910 | link |
| 2025-05-12 | Position: Restructuring of Categories and Implementation of Guidelines Essential for VLM Adoption in Healthcare | Imon Banerjee Team | 2505.08818 | null |
| 2025-05-13 | Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving | Xiang Bai Team | 2505.08725 | link |
| 2025-05-13 | OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning | Yu Cheng Team | 2505.08617 | link |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null |
| 2025-05-13 | Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? | Jimmy Huang Team | 2505.08468 | link |
| 2025-05-13 | MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos | Wei Zhang Team | 2505.08367 | null |
| 2025-05-13 | Removing Watermarks with Partial Regeneration using Semantic Information | Michael W. Mahoney Team | 2505.08234 | link |
| 2025-05-13 | CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding | Shuo Wang Team | 2505.08194 | null |
| 2025-05-13 | DSADF: Thinking Fast and Slow for Decision Making | Shufei Zhang Team | 2505.08189 | null |
| 2025-05-12 | Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models | Jia-Bin Huang Team | 2505.07815 | null |
| 2025-05-12 | Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction | Andrew Yates Team | 2505.07730 | null |
| 2025-05-12 | Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images | Vasily Konovalov Team | 2505.07704 | null |
| 2025-05-12 | Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models | Yihong Gong Team | 2505.07690 | null |
| 2025-05-12 | Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead |
Sung Ju Hwang Team | 2505.07675 | null |
| 2025-05-12 | Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning | Hanwang Zhang Team | 2505.07538 | null |
| 2025-05-12 | AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography | Xiaomeng Li Team | 2505.07347 | null |
| 2025-05-12 | Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning | Yahui Zhou Team | 2505.07263 | null |
| 2025-05-12 | Incomplete In-context Learning | Yangshijie Zhang Team | 2505.07251 | null |
| 2025-05-12 | UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning | Dzmitry Tsetserukou Team | 2505.07236 | null |
| 2025-05-12 | Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection | Ningjiang Chen Team | 2505.07219 | link |
| 2025-05-12 | Internet of Agents: Fundamentals, Applications, and Challenges | Dusit Niyato Team | 2505.07176 | null |
| 2025-05-12 | Critique Before Thinking: Mitigating Hallucination through Rationale-Augmented Instruction Tuning | Weiping Wang Team | 2505.07172 | null |
| 2025-05-12 | EmoVLM-KD: Fusing Distilled Expertise with Vision-Language Models for Visual Emotion Analysis | Eunil Park Team | 2505.07164 | null |
| 2025-05-11 | A Vision-Language Foundation Model for Leaf Disease Identification | Luyl-Da Quach Team | 2505.07019 | null |
| 2025-05-11 | Hallucination-Aware Multimodal Benchmark for Gastrointestinal Image Analysis with Large Vision-Language Models | Binod Bhattarai Team | 2505.07001 | null |
| 2025-05-11 | UniDiffGrasp: A Unified Framework Integrating VLM Reasoning and VLM-Guided Part Diffusion for Open-Vocabulary Constrained Grasping with Dual Arms | Zhenze Liu Team | 2505.06832 | null |
| 2025-05-10 | STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation | Jean Oh Team | 2505.06729 | null |
| 2025-05-10 | METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection | Shuo Yang Team | 2505.06663 | link |
| 2025-05-10 | Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation | Nancy F. Chen Team | 2505.06594 | null |
| 2025-05-09 | MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks | Bo Yan Team | 2505.06152 | link |
| 2025-05-09 | Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI | Dominik Bollmann Team | 2505.05895 | null |
| 2025-05-09 | Describe Anything in Medical Images | Min Xu Team | 2505.05804 | null |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null |
| 2025-05-08 | Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos | Nina S. T. Hirata Team | 2505.05681 | null |
| 2025-05-08 | X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP | James Bailey Team | 2505.05528 | link |
| 2025-05-08 | Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging | Junxian He Team | 2505.05464 | link |
| 2025-05-08 | SITE: towards Spatial Intelligence Thorough Evaluation | Boqing Gong Team | 2505.05456 | null |
| 2025-05-08 | DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning | Jun Ma Team | 2505.05360 | null |
| 2025-05-08 | Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization | Joon Son Chung Team | 2505.05343 | link |
| 2025-05-08 | Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects | Matteo Matteucci Team | 2505.05318 | null |
| 2025-05-08 | Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models | Meng Zhang Team | 2505.05189 | null |
| 2025-05-08 | OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning | Qingming Huang Team | 2505.05180 | link |
| 2025-05-08 | Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models | Joachim Denzler Team | 2505.05163 | null |
| 2025-05-08 | CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models | Furao Shen Team | 2505.05130 | null |
| 2025-05-08 | X-Driver: Explainable Autonomous Driving with Vision-Language Models | Zengfeng Zeng Team | 2505.05098 | null |
| 2025-05-08 | Image-Text Relation Prediction for Multilingual Tweets | Edison Marrese-Taylor Team | 2505.05040 | null |
| 2025-05-09 | G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness | Youngjae Yu Team | 2505.05026 | null |
| 2025-05-08 | Split Matching for Inductive Zero-shot Semantic Segmentation | Daisuke Deguchi Team | 2505.05023 | null |
| 2025-05-08 | LVLM-MPC Collaboration for Autonomous Driving: A Safety-Aware and Task-Scalable Control Architecture | Tatsuya Suzuki Team | 2505.04980 | null |
| 2025-05-07 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null |
| 2025-05-07 | "I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments | Xinlei He Team | 2505.04488 | null |
| 2025-05-07 | DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception | Zhuotao Tian Team | 2505.04410 | link |
| 2025-05-07 | CM1 -- A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models | Gernot A. Fink Team | 2505.04214 | null |
| 2025-05-07 | R^3-VQA: "Read the Room" by Video Social Reasoning | Lifeng Fan Team | 2505.04147 | null |
| 2025-05-06 | X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains | Hoifung Poon Team | 2505.03981 | null |
| 2025-05-06 | Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning | Victor Amblard Team | 2505.03703 | null |
| 2025-05-06 | Distribution-Conditional Generation: From Class Distribution to Creative Generation | Xin Geng Team | 2505.03667 | null |
| 2025-05-06 | Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images | Zhenan Sun Team | 2505.03611 | null |
| 2025-05-06 | Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection | Ming-Hsuan Yang Team | 2505.03610 | null |
| 2025-05-06 | Mitigating Image Captioning Hallucinations in Vision-Language Models | Xi Li Team | 2505.03420 | null |
| 2025-05-07 | Enhancing Target-unspecific Tasks through a Features Matrix | Jun Yu Team | 2505.03414 | null |
| 2025-05-06 | Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models | Aiden Doherty Team | 2505.03374 | null |
| 2025-05-06 | A Vision-Language Model for Focal Liver Lesion Classification | Chen Yen-Wei Team | 2505.03350 | null |
| 2025-05-06 | From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection | Rong Xiao Team | 2505.03334 | null |
| 2025-05-06 | Seeing the Abstract: Translating the Abstract Language for Vision Language Models | Yiming Wang Team | 2505.03242 | link |
| 2025-05-06 | VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making | Juan Carlos Niebles Team | 2505.03181 | null |
| 2025-05-06 | Robust Fairness Vision-Language Learning for Medical Image Analysis | Shu Hu Team | 2505.03153 | link |
| 2025-05-05 | Adversarial Robustness Analysis of Vision-Language Models in Medical Image Segmentation | Manish Dhakal Team | 2505.02971 | null |
| 2025-05-05 | LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery | David M. Chan Team | 2505.02829 | null |
| 2025-05-05 | HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction | Dzmitry Tsetserukou Team | 2505.02569 | null |
| 2025-05-05 | Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality | Jimmy Lin Team | 2505.02466 | null |
| 2025-05-05 | Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey | Songcan Chen Team | 2505.02448 | null |
| 2025-05-05 | SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing | Sijie Zhu Team | 2505.02370 | link |
| 2025-05-05 | TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment | Xinwei He Team | 2505.02325 | null |
| 2025-05-04 | Compositional Image-Text Matching and Retrieval by Grounding Entities | Jana Košecká Team | 2505.02278 | null |
| 2025-05-04 | Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin | Xinyang Chen Team | 2505.02056 | null |
| 2025-05-04 | A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models | Xinya Du Team | 2505.01958 | null |
| 2025-05-03 | PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications | Santosh Patapati Team | 2505.01881 | null |
| 2025-05-03 | Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos | Anett Hoppe Team | 2505.01790 | null |
| 2025-05-03 | An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding | Guoliang Xing Team | 2505.01743 | null |
| 2025-05-03 | Vision and Intention Boost Large Language Model in Long-Term Action Anticipation | Yanning Zhang Team | 2505.01713 | null |
| 2025-05-03 | RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation | Xiaodan Liang Team | 2505.01709 | null |
| 2025-05-03 | Topology-Aware CLIP Few-Shot Learning | Dazhi Huang Team | 2505.01694 | null |
| 2025-05-02 | TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action | Jenq-Neng Hwang Team | 2505.01583 | null |
| 2025-05-02 | Grounding Task Assistance with Multimodal Cues from a Single Demonstration | Andrew D. Wilson Team | 2505.01578 | null |
| 2025-05-02 | Dynamic Robot Tool Use with Vision Language Models | Ahmed H. Qureshi Team | 2505.01399 | null |
| 2025-05-02 | Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages | Valerio Guarrasi Team | 2505.01096 | null |
| 2025-05-02 | Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation | Valerio Guarrasi Team | 2505.01091 | null |
| 2025-05-02 | Transferable Adversarial Attacks on Black-Box Vision-Language Models | Matt Fredrikson Team | 2505.01050 | null |
| 2025-04-30 | Entropy Heat-Mapping: Localizing GPT-Based OCR Errors with Sliding-Window Shannon Analysis | Alexei Kaltchenko Team | 2505.00746 | null |
| 2025-05-01 | Robotic Visual Instruction | Xianzheng Ma Team | 2505.00693 | null |
| 2025-05-01 | Visual Test-time Scaling for GUI Agent Grounding | Honglak Lee Team | 2505.00684 | null |
| 2025-05-01 | DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation | Yang Gao Team | 2505.00527 | null |
| 2025-05-01 | LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving | Henry X. Liu Team | 2505.00284 | null |
| 2025-05-01 | AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care | Tianming Liu Team | 2505.00275 | null |
| 2025-04-30 | V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving | Markus Lienkamp Team | 2505.00156 | null |
| 2025-04-30 | Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models | Xintao Wu Team | 2505.00150 | null |
| 2025-04-30 | Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design | Mahdi S. Hosseini Team | 2505.00134 | null |
| 2025-04-30 | Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization | Ganesh Ramakrishnan Team | 2504.21831 | null |
| 2025-04-30 | Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models | Lin Lee Cheong Team | 2504.21559 | null |
| 2025-04-30 | RoboGround: Robotic Manipulation with Grounded Vision-Language Priors | Zhou Zhao Team | 2504.21530 | null |
| 2025-04-30 | Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection | William Hsu Team | 2504.21344 | null |
| 2025-04-29 | MemeBLIP2: A novel lightweight multimodal system to detect harmful memes | Lisha Xu Team | 2504.21226 | null |
| 2025-04-29 | GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model | Yue Zhao Team | 2504.21186 | null |
| 2025-04-29 | Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization | Xiaojun Chang Team | 2504.21063 | null |
| 2025-04-29 | Real-Time Wayfinding Assistant for Blind and Low-Vision Users | Farhan Sadaf Team | 2504.20976 | null |
| 2025-04-29 | FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models | Elisa Ricci Team | 2504.20860 | null |
| 2025-04-29 | In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer | Yi Yang Team | 2504.20690 | null |
| 2025-04-29 | SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data | Freda Shi Team | 2504.20648 | null |
| 2025-04-29 | PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations | Xuguang Lan Team | 2504.20520 | null |
| 2025-04-29 | Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception | Xiaoqiang Li Team | 2504.20468 | null |
| 2025-04-29 | Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks | Dimitrios K. Nasiopoulos Team | 2504.20419 | null |
| 2025-04-29 | FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding | Bo Zheng Team | 2504.20384 | null |
| 2025-04-28 | A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports | Christoph M. Friedrich Team | 2504.20220 | null |
| 2025-04-28 | Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains | Rui Yan Team | 2504.20199 | null |
| 2025-04-28 | SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning | Alan Yuille Team | 2504.20024 | null |
| 2025-04-28 | EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia | Diego Marcos Team | 2504.19742 | null |
| 2025-04-28 | Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model | Guoying Zhao Team | 2504.19739 | null |
| 2025-04-28 | VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning | Xiaobo Xia Team | 2504.19627 | null |
| 2025-04-28 | LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning | Aimin Yang Team | 2504.19524 | null |
| 2025-04-27 | DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning | Shini Han Team | 2504.19127 | null |
| 2025-04-27 | Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction | Jian Liu Team | 2504.19086 | null |
| 2025-04-26 | Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation | Arif Mahmood Team | 2504.18856 | null |
| 2025-04-26 | Video CLIP Model for Multi-View Echocardiography Interpretation | Norihiko Takeda Team | 2504.18800 | null |
| 2025-04-25 | A Review of 3D Object Detection with Vision-Language Models | Manoj Karkee Team | 2504.18738 | null |
| 2025-04-25 | Proof-of-TBI -- Fine-Tuned Vision Language Model Consortium and OpenAI-o3 Reasoning LLM-Based Medical Diagnosis Support System for Mild Traumatic Brain Injury (TBI) Prediction | Donna Broshek Team | 2504.18671 | null |
| 2025-04-25 | Generalization Capability for Imitation Learning | Yixiao Wang Team | 2504.18538 | null |
| 2025-04-25 | Fast-Slow Thinking for Large Vision-Language Model Reasoning | Fei Wu Team | 2504.18458 | null |
| 2025-04-25 | Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation | Guang Yang Team | 2504.18453 | null |
| 2025-04-25 | Revisiting Data Auditing in Large Vision-Language Models | Zhuosheng Zhang Team | 2504.18349 | null |
| 2025-04-25 | A Large Vision-Language Model based Environment Perception System for Visually Impaired People | Shiguo Lian Team | 2504.18027 | null |
| 2025-04-24 | CAMU: Context Augmentation for Meme Understanding | Aditya Joshi Team | 2504.17902 | null |
| 2025-04-24 | FashionM3: Multimodal, Multitask, and Multiround Fashion Assistant based on Unified Vision-Language Model | Waikeung Wong Team | 2504.17826 | null |
| 2025-04-25 | Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction | Weiyan Wen Team | 2504.17671 | null |
| 2025-04-24 | SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting | Qingming Huang Team | 2504.17395 | null |
| 2025-04-24 | M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction | Tatsunori Mori Team | 2504.17353 | null |
| 2025-04-24 | DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model | Hao Yang Team | 2504.17315 | null |
| 2025-04-24 | Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning | Khimya Khetarpal Team | 2504.17282 | null |
| 2025-04-24 | Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation | Minhyuk Sung Team | 2504.17207 | null |
| 2025-04-23 | Distilling semantically aware orders for autoregressive image generation | Marco Pedersoli Team | 2504.17069 | null |
| 2025-04-23 | DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs | Ran Xu Team | 2504.17040 | null |
| 2025-04-24 | V |
Yi R. Fung Team | 2504.16727 | null |
| 2025-04-23 | Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes | Giovanni Fusco Team | 2504.16538 | null |
| 2025-04-23 | TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Jiaya Jia Team | 2504.16505 | null |
| 2025-04-23 | FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing | Biplab Banerjee Team | 2504.16433 | null |
| 2025-04-22 | CLIP-IT: CLIP-based Pairing for Histology Images Classification | Eric Granger Team | 2504.16181 | null |
| 2025-04-22 | MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention | Lili Qiu Team | 2504.16083 | null |
| 2025-04-22 | MR. Video: "MapReduce" is the Principle for Long Video Understanding | Yu-Xiong Wang Team | 2504.16082 | null |
| 2025-04-22 | Describe Anything: Detailed Localized Image and Video Captioning | Yin Cui Team | 2504.16072 | null |
| 2025-04-22 | Vision language models are unreliable at trivial spatial cognition | J. Gregory Trafton Team | 2504.16061 | null |
| 2025-04-22 | Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation | Joyce Chai Team | 2504.16060 | null |
| 2025-04-22 | Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis | Judy Gichoya Team | 2504.16047 | null |
| 2025-04-22 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Mike Zheng Shou Team | 2504.16030 | null |
| 2025-04-24 | Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models | Tolga Çukur Team | 2504.15929 | null |
| 2025-04-21 | CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting | Mohit Bansal Team | 2504.15485 | null |
| 2025-04-21 | Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | Guilin Liu Team | 2504.15271 | null |
| 2025-04-21 | KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking | Kijung Shin Team | 2504.15135 | link |
| 2025-04-21 | Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation | Serge Belongie Team | 2504.14988 | link |
| 2025-04-21 | VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform | Kun Gai Team | 2504.14904 | null |
| 2025-04-21 | Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation | Yunji Chen Team | 2504.14848 | null |
| 2025-04-20 | OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding | Zuozhu Liu Team | 2504.14692 | null |
| 2025-04-20 | NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation | Juho Kannala Team | 2504.14638 | null |
| 2025-04-20 | LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation | Yongsheng Gao Team | 2504.14467 | null |
| 2025-04-20 | Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability | Sandra Avila Team | 2504.14446 | null |
| 2025-04-19 | Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models | Nathaniel D. Bastian Team | 2504.14395 | null |
| 2025-04-19 | How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos? | James Zou Team | 2504.14391 | null |
| 2025-04-19 | A Multimodal Recaptioning Framework to Account for Perceptual Diversity in Multilingual Vision-Language Modeling | Adriana Kovashka Team | 2504.14359 | null |
| 2025-04-19 | Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses | Chau Yuen Team | 2504.14326 | null |
| 2025-04-19 | Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization | Xu Yang Team | 2504.14200 | null |
| 2025-04-19 | Bayesian Principles Improve Prompt Learning In Vision-Language Models | Mijung Park Team | 2504.14123 | null |
| 2025-04-19 | PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models | Ozlem Ozmen Garibay Team | 2504.14117 | null |
| 2025-04-21 | Analysing the Robustness of Vision-Language-Models to Common Corruptions | Umair Bin Mansoor Team | 2504.13690 | null |
| 2025-04-18 | EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model | Beng Chin Ooi Team | 2504.13650 | link |
| 2025-04-18 | PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting | Miao Yu Team | 2504.13624 | null |
| 2025-04-18 | Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization | Huadong Ma Team | 2504.13460 | null |
| 2025-04-18 | Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety | Ross Greer Team | 2504.13399 | null |
| 2025-04-17 | VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture | Yanbo Huang Team | 2504.13365 | null |
| 2025-04-17 | Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models | Jacky Liang Team | 2504.13351 | null |
| 2025-04-17 | WildFireCan-MMD: A Multimodal dataset for Classification of User-generated Content During Wildfires in Canada | Marzieh Amini Team | 2504.13231 | null |
| 2025-04-17 | PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding | Christoph Feichtenhofer Team | 2504.13180 | null |
| 2025-04-17 | Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling | David M. Chan Team | 2504.13169 | link |
| 2025-04-17 | Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training | Zhanhui Kang Team | 2504.13123 | null |
| 2025-04-17 | Probing and Inducing Combinational Creativity in Vision-Language Models | Zilong Zheng Team | 2504.13120 | null |
| 2025-04-17 | Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration | Yong Hong Kuo Team | 2504.13119 | null |
| 2025-04-17 | Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development | Christoph Csallner Team | 2504.13069 | null |
| 2025-04-17 | NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation | Michael Qizhe Shieh Team | 2504.13055 | null |
| 2025-04-17 | Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning | Wenwu Zhu Team | 2504.12680 | link |
| 2025-04-17 | VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization | Siheng Chen Team | 2504.12661 | null |
| 2025-04-16 | Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation | Éric Granger Team | 2504.12436 | link |
| 2025-04-16 | FLIP Reasoning Challenge | Roger Wattenhofer Team | 2504.12256 | null |
| 2025-04-16 | Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models - | Hanno Gottschalk Team | 2504.12137 | null |
| 2025-04-17 | Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions | Zhi-Qi Cheng Team | 2504.11967 | null |
| 2025-04-16 | Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning | Yi Chang Team | 2504.11930 | null |
| 2025-04-16 | A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification | Janis Keuper Team | 2504.11838 | null |
| 2025-04-17 | DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment | Moncef Gabbouj Team | 2504.11733 | null |
| 2025-04-16 | Interpreting the Linear Structure of Vision-language Model Embedding Spaces | Stephanie Gil Team | 2504.11695 | null |
| 2025-04-16 | VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps | Mariano Ceccato Team | 2504.11675 | null |
| 2025-04-15 | Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation | Majid Mirmehdi Team | 2504.11669 | null |
| 2025-04-17 | PATFinger: Prompt-Adapted Transferable Fingerprinting against Unauthorized Multimodal Dataset Usage | Lina Wang Team | 2504.11509 | null |
| 2025-04-15 | From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation | Jungong Han Team | 2504.11368 | null |
| 2025-04-17 | UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis | Yan Lu Team | 2504.11257 | null |
| 2025-04-15 | R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning | Ran He Team | 2504.11195 | null |
| 2025-04-15 | Benchmarking Vision Language Models on German Factual Data | Vincent Tischler Team | 2504.11108 | null |
| 2025-04-16 | Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR | Gongshen Liu Team | 2504.11101 | null |
| 2025-04-15 | QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models | Yu Wang Team | 2504.11038 | null |
| 2025-04-15 | Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles | Ross Greer Team | 2504.10873 | null |
| 2025-04-15 | LVLM_CSP: Accelerating Large Vision Language Models via Clustering, Scattering, and Pruning for Reasoning Segmentation | Mohsen Imani Team | 2504.10854 | null |
| 2025-04-15 | Enhancing Features in Long-tailed Data Using Large Vision Mode | Xuesong Li Team | 2504.10852 | null |
| 2025-04-14 | ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models | Lifeng Zhou Team | 2504.10757 | null |
| 2025-04-14 | AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark | Yu-Xiong Wang Team | 2504.10568 | null |
| 2025-04-14 | Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding | Jiashi Feng Team | 2504.10465 | null |
| 2025-04-15 | GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | Run Luo Team | 2504.10458 | null |
| 2025-04-14 | SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model | Yanning Zhang Team | 2504.10320 | null |
| 2025-04-15 | Breaking the Data Barrier -- Building GUI Agents Through Task Generalization | Junxian He Team | 2504.10127 | null |
| 2025-04-14 | AGO: Adaptive Grounding for Open World 3D Occupancy Prediction | Andreas Zell Team | 2504.10117 | null |
| 2025-04-14 | CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography | Jun-Cheng Chen Team | 2504.10090 | null |
| 2025-04-14 | Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure | Frédéric Dufaux Team | 2504.10049 | null |
| 2025-04-14 | Aligning Anime Video Generation with Human Feedback | Zuxuan Wu Team | 2504.10044 | null |
| 2025-04-14 | KeyMPs: One-Shot Vision-Language Guided Motion Generation by Sequencing DMPs for Occlusion-Rich Tasks | Takamitsu Matsubara Team | 2504.10011 | null |
| 2025-04-14 | GenTe: Generative Real-world Terrains for General Legged Robot Locomotion Control | Xiaoqiang Ji Team | 2504.09997 | null |
| 2025-04-14 | Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models | Keisuke Ozawa Team | 2504.09979 | null |
| 2025-04-14 | Can VLMs Assess Similarity Between Graph Visualizations? | Jinwook Seo Team | 2504.09859 | null |
| 2025-04-14 | VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents | Jun Suzuki Team | 2504.09795 | null |
| 2025-04-13 | A Survey on Efficient Vision-Language Models | Nirmalya Roy Team | 2504.09724 | null |
| 2025-04-13 | Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference | Tadahiro Taniguchi Team | 2504.09620 | null |
| 2025-04-13 | DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning | Mukesh Prasad Team | 2504.09598 | null |
| 2025-04-13 | Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation | Yunhong Wang Team | 2504.09480 | null |
| 2025-04-13 | Identity-Aware Vision-Language Model for Explainable Face Forgery Detection | Yu-Gang Jiang Team | 2504.09439 | null |
| 2025-04-13 | BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning | Boqing Gong Team | 2504.09426 | null |
| 2025-04-12 | PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks | Yang Liu Team | 2504.09258 | null |
| 2025-04-11 | AstroLLaVA: towards the unification of astronomical data and natural language | Dimitrios Tanoglidis Team | 2504.08583 | null |
| 2025-04-11 | EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models | Jinwoo Kim Team | 2504.08205 | null |
| 2025-04-10 | Investigating Vision-Language Model for Point Cloud-based Vehicle Classification | Camille Kamga Team | 2504.08154 | null |
| 2025-04-10 | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | David Ha Team | 2504.08066 | null |
| 2025-04-10 | VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning | Feng Zhao Team | 2504.07956 | null |
| 2025-04-10 | SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos | Yuhao Chen Team | 2504.07867 | null |
| 2025-04-10 | CollEX -- A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections | Chris Biemann Team | 2504.07643 | null |
| 2025-04-10 | VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model | Tiancheng Zhao Team | 2504.07615 | link |
| 2025-04-10 | TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs | Xuezhi Cao Team | 2504.07556 | null |
| 2025-04-10 | Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models | Xian-Sheng Hua Team | 2504.07521 | link |
| 2025-04-10 | Kimi-VL Technical Report | Ziwei Chen Team | 2504.07491 | link |
| 2025-04-09 | Perception in Reflection | Vishal M. Patel Team | 2504.07165 | null |
| 2025-04-09 | Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation | Marzieh Fadaee Team | 2504.07072 | null |
| 2025-04-09 | Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition | Aythami Morales Team | 2504.06925 | null |
| 2025-04-09 | MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking | Hesheng Wang Team | 2504.06863 | null |
| 2025-04-09 | ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models | Namhoon Lee Team | 2504.06838 | null |
| 2025-04-09 | LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding | Bo XU Team | 2504.06835 | null |
| 2025-04-08 | PromptHMR: Promptable Human Mesh Recovery | Muhammed Kocabas Team | 2504.06397 | null |
| 2025-04-08 | SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation | Zhaozheng Yin Team | 2504.06389 | null |
| 2025-04-08 | OmniSVG: A Unified Scalable Vector Graphics Generation Model | Yu-Gang Jiang Team | 2504.06263 | null |
| 2025-04-08 | Latent Multimodal Reconstruction for Misinformation Detection | Panagiotis C. Petrantonakis Team | 2504.06010 | link |
| 2025-04-08 | Measuring Déjà vu Memorization Efficiently | Kamalika Chaudhuri Team | 2504.05651 | null |
| 2025-04-08 | A Lightweight Large Vision-language Model for Multimodal Medical Images | Navid Toosy Saidy Team | 2504.05575 | null |
| 2025-04-10 | ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering | Shafiq Joty Team | 2504.05506 | null |
| 2025-04-07 | Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models | Aliasghar Arab Team | 2504.05477 | null |
| 2025-04-07 | Taxonomy-Aware Evaluation of Vision-Language Models | Stella Frank Team | 2504.05457 | null |
| 2025-04-07 | Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly | Anamaria Crisan Team | 2504.05445 | null |
| 2025-04-07 | InteractVLM: 3D Interaction Reasoning from 2D Foundational Models | Dimitrios Tzionas Team | 2504.05303 | null |
| 2025-04-07 | SmolVLM: Redefining small and efficient multimodal models | Thomas Wolf Team | 2504.05299 | null |
| 2025-04-07 | A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? | Ismail Ben Ayed Team | 2504.05227 | null |
| 2025-04-07 | Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation | Wei Zhang Team | 2504.05225 | null |
| 2025-04-08 | A Taxonomy of Self-Handover | Katsushi Ikeuchi Team | 2504.04939 | null |
| 2025-04-07 | SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models | Lorenz Hufe Team | 2504.04893 | null |
| 2025-04-07 | Don't Lag, RAG: Training-Free Adversarial Detection Using RAG | Ofer Hadar Team | 2504.04858 | null |
| 2025-04-07 | OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance | Xinhan Di Team | 2504.04781 | null |
| 2025-04-07 | Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding | Zahir Alsulaimawi Team | 2504.04772 | null |
| 2025-04-07 | Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions | Yue Wang Team | 2504.04744 | null |
| 2025-04-07 | Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data | Venkatesh Saligrama Team | 2504.04740 | null |
| 2025-04-06 | M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models | Ruixiang Tang Team | 2504.04633 | null |
| 2025-04-06 | Foundation Models for Software Engineering of Cyber-Physical Systems: the Road Ahead | Shaukat Ali Team | 2504.04630 | null |
| 2025-04-06 | Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection | Xiaomeng Huang Team | 2504.04517 | link |
| 2025-04-06 | OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning | Jose M. Alvarez Team | 2504.04348 | null |
| 2025-04-06 | MedM-VL: What Makes a Good Medical LVLM? | Ji Wu Team | 2504.04323 | null |
| 2025-04-05 | GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill | Siyuan Huang Team | 2504.04191 | null |
| 2025-04-05 | LATTE: Lightweight Attention-based Traffic Accident Anticipation Engine | Zhenning Li Team | 2504.04103 | null |
| 2025-04-05 | TARAC: Mitigating Hallucination in LVLMs via Temporal Attention Real-time Accumulative Connection | Xiaohua Xu Team | 2504.04099 | null |
| 2025-04-04 | VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models | Anelia Angelova Team | 2504.03970 | null |
| 2025-04-04 | Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models | Matias Valdenegro-Toro Team | 2504.03440 | null |
| 2025-04-04 | SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding | Naoto Yokoya Team | 2504.03254 | null |
| 2025-04-04 | Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators | Lawson L. S. Wong Team | 2504.03245 | null |
| 2025-04-04 | Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation | Robby T. Tan Team | 2504.03193 | null |
| 2025-04-04 | NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving | Zhengzhong Tu Team | 2504.03164 | null |
| 2025-04-04 | TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference | Xianpeng Lang Team | 2504.03154 | null |
| 2025-04-04 | MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories | Arvind Ramanathan Team | 2504.03153 | null |
| 2025-04-03 | QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding | Bryan Wang Team | 2504.02971 | null |
| 2025-04-03 | STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection | Naoufel Werghi Team | 2504.02823 | null |
| 2025-04-03 | Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models | Zeynep Akata Team | 2504.02821 | null |
| 2025-04-03 | Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence | Serena Yeung-Levy Team | 2504.02799 | null |
| 2025-04-03 | Robot-Led Vision Language Model Wellbeing Assessment of Children | Hatice Gunes Team | 2504.02765 | null |
| 2025-04-04 | Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme | Pengfei Liu Team | 2504.02587 | null |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Shibiao Xu Team | 2504.02477 | null |
| 2025-04-03 | Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation | Rui Yan Team | 2504.02438 | null |
| 2025-04-03 | ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback | Hailong Wang Team | 2504.02357 | null |
| 2025-04-03 | Large (Vision) Language Models are Unsupervised In-Context Learners | Maria Brbic Team | 2504.02349 | link |
| 2025-04-03 | Re-thinking Temporal Search for Long-Form Video Understanding | Manling Li Team | 2504.02259 | null |
| 2025-04-03 | SocialGesture: Delving into Multi-person Gesture Understanding | James M. Rehg Team | 2504.02244 | null |
| 2025-04-02 | FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs | Fatima Albreiki Team | 2504.01916 | link |
| 2025-04-02 | Is Temporal Prompting All We Need For Limited Labeled Action Recognition? | Xiaobo Jin Team | 2504.01890 | null |
| 2025-04-02 | Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images | Abdullah-Al-Zubaer Imran Team | 2504.01838 | link |
| 2025-04-02 | BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing | Leonidas Guibas Team | 2504.01786 | null |
| 2025-04-02 | AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization | Linli Xu Team | 2504.01735 | null |
| 2025-04-02 | Reasoning LLMs for User-Aware Multimodal Conversational Agents | Mohamed Chetouani Team | 2504.01700 | null |
| 2025-04-02 | CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition | Hamzah Luqman Team | 2504.01666 | link |
| 2025-04-02 | BioAtt: Anatomical Prior Driven Low-Dose CT Denoising | UiHyun Cho Team | 2504.01662 | null |
| 2025-04-02 | Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models | Ming-Hsuan Yang Team | 2504.01589 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-07-23 | InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation | Jiangmiao Pang Team | 2507.17520 | null |
| 2025-07-23 | ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents | Hesheng Wang Team | 2507.17462 | null |
| 2025-07-23 | Confidence Calibration in Vision-Language-Action Models | Richard Zemel Team | 2507.17383 | null |
| 2025-07-23 | VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback | Harold Soh Team | 2507.17294 | null |
| 2025-07-22 | ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning | Fu-En Yang Team | 2507.16815 | null |
| 2025-07-21 | Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos | Zongqing Lu Team | 2507.15597 | null |
| 2025-07-22 | GR-3 Technical Report | Yichu Yang Team | 2507.15493 | null |
| 2025-07-18 | EdgeVLA: Efficient Vision-Language-Action Models | Benjamin Bolte Team | 2507.14049 | null |
| 2025-07-21 | LaViPlan : Language-Guided Visual Path Planning with RLVR | Hayeon Oh Team | 2507.12911 | null |
| 2025-07-17 | AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation | Jun Zhu Team | 2507.12768 | null |
| 2025-07-18 | EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos | Xiaolong Wang Team | 2507.12440 | null |
| 2025-07-14 | Vision Language Action Models in Robotic Manipulation: A Systematic Review | Irfan Hussain Team | 2507.10672 | null |
| 2025-07-12 | Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization | Yang Gao Team | 2507.09160 | null |
| 2025-07-09 | 3D-Generalist: Self-Improving Vision-Language-Action Models for Crafting 3D Worlds | Nick Haber Team | 2507.06484 | null |
| 2025-07-07 | NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving | Cheng Lu Team | 2507.05227 | null |
| 2025-07-10 | VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting | Yanzhi Wang Team | 2507.05116 | null |
| 2025-07-17 | DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge | Xin Jin Team | 2507.04447 | null |
| 2025-07-06 | Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties | Yunxin Liu Team | 2507.04227 | null |
| 2025-07-03 | DexVLG: Dexterous Vision-Language-Grasp Model at Scale | He Wang Team | 2507.02747 | null |
| 2025-07-02 | cVLA: Towards Efficient Camera-Space VLAs | Thomas Brox Team | 2507.02190 | null |
| 2025-07-02 | A Survey on Vision-Language-Action Models: An Action Tokenization Perspective | Yaodong Yang Team | 2507.01925 | null |
| 2025-07-02 | MoIRA: Modular Instruction Routing Architecture for Multi-Task Robotics | Nadiya Shvai Team | 2507.01843 | null |
| 2025-07-03 | TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control | Yanwei Fu Team | 2507.01424 | null |
| 2025-07-01 | VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers | Tong He Team | 2507.01016 | null |
| 2025-07-01 | Evo-0: Vision-Language-Action Model with Implicit Spatial Understanding | Bo Zhao Team | 2507.00416 | null |
| 2025-06-30 | A Survey on Vision-Language-Action Models for Autonomous Driving | Lijun Sun Team | 2506.24044 | null |
| 2025-06-27 | 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration | Li Zhang Team | 2506.22242 | null |
| 2025-06-26 | WorldVLA: Towards Autoregressive Action World Model | Hao Chen Team | 2506.21539 | null |
| 2025-06-26 | Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends | Zeng-Guang Hou Team | 2506.20966 | null |
| 2025-06-24 | Unified Vision-Language-Action Model | Zhaoxiang Zhang Team | 2506.19850 | null |
| 2025-06-24 | CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation | Jiangmiao Pang Team | 2506.19816 | null |
| 2025-07-07 | RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models | Marco Pavone Team | 2506.17811 | null |
| 2025-06-21 | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Xiao Li Team | 2506.17639 | null |
| 2025-06-21 | VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models | Lin Shao Team | 2506.17561 | null |
| 2025-06-19 | CapsDT: Diffusion-Transformer for Capsule Robot Manipulation | Hongliang Ren Team | 2506.16263 | null |
| 2025-06-19 | ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models | Siyuan Huang Team | 2506.16211 | null |
| 2025-06-19 | ClutterDexGrasp: A Sim-to-Real System for General Dexterous Grasping in Cluttered Scenes | Hao Dong Team | 2506.14317 | null |
| 2025-06-16 | GRaD-Nav++: Vision-Language Model Enabled Visual Drone Navigation with Gaussian Radiance Fields and Differentiable Dynamics | Mac Schwager Team | 2506.14009 | null |
| 2025-06-16 | AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning | Jiaqi Ma Team | 2506.13757 | link |
| 2025-06-19 | LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction | Shankar Sastry Team | 2506.13751 | null |
| 2025-06-16 | CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding | Haoang Li Team | 2506.13725 | null |
| 2025-06-16 | ROSA: Harnessing Robot States for Vision-Language and Action Alignment | Xiaoyan Sun Team | 2506.13679 | null |
| 2025-06-16 | Block-wise Adaptive Caching for Accelerating Diffusion Policy | Zhi Wang Team | 2506.13456 | null |
| 2025-06-19 | A Comprehensive Survey on Continual Learning in Generative Models | Cheng-Lin Liu Team | 2506.13045 | link |
| 2025-06-19 | SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration | Wenwu Zhu Team | 2506.12723 | null |
| 2025-06-13 | RationalVLA: A Rational Vision-Language-Action Model with Dual System | Haoang Li Team | 2506.10826 | null |
| 2025-06-11 | EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models | Linfeng Zhang Team | 2506.10100 | null |
| 2025-06-11 | SAFE: Multitask Failure Detection for Vision-Language-Action Models | Florian Shkurti Team | 2506.09937 | null |
| 2025-06-11 | From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models | Chen Feng Team | 2506.09930 | null |
| 2025-06-17 | An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models | Harshvardhan Sikka Team | 2506.09172 | null |
| 2025-06-10 | FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency | Jian Tang Team | 2506.08822 | null |
| 2025-06-10 | Hybrid Reasoning for Perception, Explanation, and Autonomous Action in Manufacturing | Sebastian W. Pattinson Team | 2506.08462 | null |
| 2025-06-11 | TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization | Qi Wang Team | 2506.08440 | null |
| 2025-06-11 | HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation | Cong Wang Team | 2506.08296 | null |
| 2025-06-14 | Agentic Surgical AI: Surgeon Style Fingerprinting and Privacy Risk Quantification via Discrete Diffusion in a Vision-Language-Action Framework | Jason H. Moore Team | 2506.08185 | link |
| 2025-06-09 | BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models | Tieniu Tan Team | 2506.07961 | null |
| 2025-06-09 | Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse | Chris Xiaoxuan Lu Team | 2506.07639 | null |
| 2025-06-09 | BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation | Xilin Chen Team | 2506.07530 | link |
| 2025-06-09 | Real-Time Execution of Action Chunking Flow Policies | Sergey Levine Team | 2506.07339 | null |
| 2025-06-12 | Robotic Policy Learning via Human-assisted Action Preference Optimization | Di Hu Team | 2506.07127 | null |
| 2025-06-07 | RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation | Si Liu Team | 2506.06677 | null |
| 2025-06-06 | MapleGrasp: Mask-guided Feature Pooling for Language-driven Efficient Robotic Grasping | Farshad Khorrami Team | 2506.06535 | null |
| 2025-06-06 | DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models | Xianpeng Lang Team | 2506.05667 | null |
| 2025-06-04 | SwitchVLA: Execution-Aware Task Switching for Vision-Language-Action Models | Jian Tang Team | 2506.03574 | null |
| 2025-06-03 | Adversarial Attacks on Robotic Vision Language Action Models | J. Zico Kolter Team | 2506.03350 | link |
| 2025-06-02 | Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning | Pheng-Ann Heng Team | 2506.01953 | null |
| 2025-06-02 | SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics | Remi Cadene Team | 2506.01844 | link |
| 2025-06-02 | MLA-Trust: Benchmarking Trustworthiness of Multimodal LLM Agents in GUI Environments | Jun Zhu Team | 2506.01616 | null |
| 2025-06-02 | ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding | Huaxiu Yao Team | 2506.01300 | null |
| 2025-06-01 | OG-VLA: 3D-Aware Vision Language Action Model via Orthographic Image Generation | Valts Blukis Team | 2506.01196 | null |
| 2025-05-31 | LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks | Zhijie Deng Team | 2506.00411 | null |
| 2025-05-30 | Towards a Generalizable Bimanual Foundation Policy via Flow-based Video Prediction | Xuelong Li Team | 2505.24156 | null |
| 2025-05-29 | Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models | Hao Zhao Team | 2505.23757 | link |
| 2025-05-29 | Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better | Sergey Levine Team | 2505.23705 | null |
| 2025-05-29 | Agentic Robot: A Brain-Inspired Framework for Vision-Language-Action Models in Embodied Agents | Lichao Sun Team | 2505.23450 | null |
| 2025-05-29 | TrackVLA: Embodied Visual Tracking in the Wild | He Wang Team | 2505.23189 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-05-29 | ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge | Yi Xu Team | 2505.21906 | null |
| 2025-05-27 | EaqVLA: Encoding-aligned Quantization for Vision-Language-Action Models | Xiang Chen Team | 2505.21567 | null |
| 2025-06-02 | Hume: Introducing System-2 Thinking in Visual-Language-Action Model | Xuelong Li Team | 2505.21432 | null |
| 2025-05-27 | Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models | Tao Chen Team | 2505.21200 | null |
| 2025-05-26 | Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review | Goldie Nejat Team | 2505.20503 | null |
| 2025-05-26 | What Can RL Bring to VLA Generalization? An Empirical Study | Yu Wang Team | 2505.19789 | null |
| 2025-05-26 | RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback | Yongtao Wang Team | 2505.19767 | null |
| 2025-05-25 | ReFineVLA: Reasoning-Aware Teacher-Guided Transfer Fine-Tuning | Minh Nhat Vu Team | 2505.19080 | null |
| 2025-05-24 | Genie Centurion: Accelerating Scalable Real-World Robot Training with Human Rewind-and-Refine Guidance | Maoqing Yao Team | 2505.18793 | null |
| 2025-05-24 | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Ziwei Wang Team | 2505.18719 | link |
| 2025-05-22 | ScanBot: Towards Intelligent Surface Scanning in Embodied Robotic Systems | Farhad Imani Team | 2505.17295 | null |
| 2025-05-22 | Interactive Post-Training for Vision-Language-Action Models | Philipp Krähenbühl Team | 2505.17016 | null |
| 2025-05-22 | Perceptual Quality Assessment for Embodied AI | Guangtao Zhai Team | 2505.16815 | link |
| 2025-05-22 | BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization | Lichao Sun Team | 2505.16640 | null |
| 2025-05-22 | DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving | Junchi Yan Team | 2505.16278 | null |
| 2025-05-21 | From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems | Soujanya Poria Team | 2505.15685 | link |
| 2025-05-24 | Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization | Junwei Liang Team | 2505.15660 | link |
| 2025-05-21 | FLARE: Robot Learning with Implicit World Modeling | Linxi Fan Team | 2505.15659 | null |
| 2025-05-21 | Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control | Jungwook Choi Team | 2505.15304 | null |
| 2025-05-21 | EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy | Hongliang Ren Team | 2505.15206 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory | Ping Luo Team | 2505.14030 | null |
| 2025-05-22 | InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning | Jingkuan Song Team | 2505.13888 | link |
| 2025-05-25 | RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction | Bo Zhao Team | 2505.12224 | null |
| 2025-05-17 | OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning | Yang Gao Team | 2505.11917 | null |
| 2025-05-16 | Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions | Donglin Wang Team | 2505.11214 | null |
| 2025-05-16 | Conditioning Matters: Training Diffusion Policies is Faster Than You Think | Jianye Hao Team | 2505.11123 | null |
| 2025-05-14 | Real2Render2Real: Scaling Robot Data Without Dynamics Simulation or Robot Hardware | Ken Goldberg Team | 2505.09601 | null |
| 2025-05-14 | RT-cache: Efficient Robot Trajectory Retrieval System | Amir Barati Farimani Team | 2505.09040 | null |
| 2025-05-13 | From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation | Jianye Hao Team | 2505.08548 | null |
| 2025-05-17 | Training Strategies for Efficient Embodied Reasoning | Sergey Levine Team | 2505.08243 | null |
| 2025-05-12 | Pixel Motion as Universal Representation for Robot Control | Michael S Ryoo Team | 2505.07817 | null |
| 2025-05-12 | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | Donglin Wang Team | 2505.07395 | null |
| 2025-05-15 | UniVLA: Learning to Act Anywhere with Task-centric Latent Actions | Hongyang Li Team | 2505.06111 | link |
| 2025-05-09 | 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks | Farshad Khorrami Team | 2505.05800 | null |
| 2025-05-08 | Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments | Harshvardhan Sikka Team | 2505.05540 | link |
| 2025-05-07 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null |
| 2025-05-06 | OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation | Donglin Wang Team | 2505.03912 | link |
| 2025-05-16 | Task Reconstruction and Extrapolation for |
Quanyi Li Team | 2505.03500 | null |
| 2025-05-06 | GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data | He Wang Team | 2505.03233 | null |
| 2025-05-06 | Automated Data Curation Using GPS & NLP to Generate Instruction-Action Pairs for Autonomous Vehicle Vision-Language Navigation Datasets | Ross Greer Team | 2505.03174 | null |
| 2025-05-04 | CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation | Hao Dong Team | 2505.02166 | null |
| 2025-05-04 | Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions | Mingyu Ding Team | 2505.02152 | null |
| 2025-04-28 | NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks | Soujanya Poria Team | 2504.19854 | null |
| 2025-04-22 | Ury Zhilinsky Team | 2504.16054 | null | |
| 2025-04-22 | Few-Shot Vision-Language Action-Incremental Policy Learning | Weili Guan Team | 2504.15517 | null |
| 2025-04-18 | GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents | Xiaobo Xia Team | 2504.10458 | null |
| 2025-04-09 | OPAL: Encoding Causal Understanding of Physical Systems for Robot Learning | Tyler Fenstermaker Team | 2504.06538 | null |
| 2025-04-02 | Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning | Roozbeh Mottaghi Team | 2504.00907 | null |
| 2025-03-30 | OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | Alois C. Knoll Team | 2503.23463 | link |
| 2025-03-27 | CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models | Tsung-Yi Lin Team | 2503.22020 | null |
| 2025-04-14 | MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation | Shanghang Zhang Team | 2503.20384 | null |
| 2025-03-25 | Gemini Robotics: Bringing AI into the Physical World | Yuxiang Zhou Team | 2503.20020 | null |
| 2025-03-25 | Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy | Yuntao Chen Team | 2503.19757 | null |
| 2025-03-25 | DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data | Lin Ma Team | 2503.19516 | null |
| 2025-03-27 | GR00T N1: An Open Foundation Model for Generalist Humanoid Robots | Yuke Zhu Team | 2503.14734 | null |
| 2025-03-15 | ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis | Mingyu Ding Team | 2503.14526 | null |
| 2025-03-17 | MoManipVLA: Transferring Vision-language-action Models for General Mobile Manipulation | Haibin Yan Team | 2503.13446 | null |
| 2025-03-17 | HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | Shanghang Zhang Team | 2503.10631 | null |
| 2025-03-12 | CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games | Bo Zheng Team | 2503.09527 | null |
| 2025-03-11 | MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models | Zongyuan Ge Team | 2503.08007 | null |
| 2025-03-10 | PointVLA: Injecting the 3D World into Vision-Language-Action Models | Yichen Zhu Team | 2503.07511 | null |
| 2025-03-06 | Refined Policy Distillation: From VLA Generalists to RL Experts | Florian Walter Team | 2503.05833 | null |
| 2025-03-06 | VLA Model-Expert Collaboration for Bi-directional Manipulation Learning | Zeng-Guang Hou Team | 2503.04163 | null |
| 2025-03-26 | OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction | Pieter Abbeel Team | 2503.03734 | null |
| 2025-03-05 | SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning | Yaodong Yang Team | 2503.03480 | null |
| 2025-03-04 | Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding | Haoang Li Team | 2503.02310 | null |
| 2025-03-03 | CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs | Dzmitry Tsetserukou Team | 2503.01378 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-07-22 | Humanoid Robot Whole-body Geometric Calibration with Embedded Sensors and a Single Plane | Florent Lamiraux Team | 2507.16369 | null |
| 2025-07-20 | Integrating Reason-Based Moral Decision-Making in the Reinforcement Learning Architecture | Lisa Dargasz Team | 2507.15895 | null |
| 2025-07-21 | EMP: Executable Motion Prior for Humanoid Robot Standing Upper-body Motion Imitation | Rong Xiong Team | 2507.15649 | null |
| 2025-07-16 | Robot Drummer: Learning Rhythmic Skills for Humanoid Drumming | Loris Roveda Team | 2507.11498 | null |
| 2025-07-15 | From Production Logistics to Smart Manufacturing: The Vision for a New RoboCup Industrial League | Shohei Yasuda Team | 2507.11402 | null |
| 2025-07-14 | Physics-Informed Neural Networks with Unscented Kalman Filter for Sensorless Joint Torque Estimation in Humanoid Robots | Daniele Pucci Team | 2507.10105 | null |
| 2025-07-11 | Learning Robust Motion Skills via Critical Adversarial Attacks for Humanoid Robots | Yue Gao Team | 2507.08303 | null |
| 2025-07-10 | UniTracker: Learning Universal Whole-Body Motion Tracker for Humanoid Robots | Weinan Zhang Team | 2507.07356 | null |
| 2025-07-09 | ULC: A Unified and Fine-Grained Controller for Humanoid Loco-Manipulation | Zongwu Xie Team | 2507.06905 | null |
| 2025-07-08 | Learning to Evaluate Autonomous Behaviour in Human-Robot Interaction | Alessio Del Bue Team | 2507.06404 | null |
| 2025-07-05 | Learning Humanoid Arm Motion via Centroidal Momentum Regularized Multi-Agent Reinforcement Learning | Sangbae Kim Team | 2507.04140 | null |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | null |
| 2025-06-30 | Mechanical Intelligence-Aware Curriculum Reinforcement Learning for Humanoids with Parallel Actuation | Dennis Hong Team | 2507.00273 | null |
| 2025-07-02 | DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover | Yuexin Ma Team | 2506.23152 | null |
| 2025-06-29 | Learning Motion Skills with Adaptive Assistive Curriculum Force in Humanoid Robots | Yue Gao Team | 2506.23125 | null |
| 2025-07-10 | Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation | Navid Azizan Team | 2506.22827 | null |
| 2025-06-20 | Unsupervised Discovery of Behavioral Primitives from Sensorimotor Dynamic Functional Connectivity | Matej Hoffmann Team | 2506.22473 | null |
| 2025-07-14 | Ark: An Open-source Python-based Framework for Robot Learning | Haitham Bou-Ammar Team | 2506.21628 | null |
| 2025-07-18 | A Survey of Behavior Foundation Model: Next-Generation Whole-Body Control System of Humanoid Robots | Wenjun Zeng Team | 2506.20487 | null |
| 2025-06-19 | DualTHOR: A Dual-Arm Humanoid Simulation Platform for Contingency-Aware Planning | Zongqing Lu Team | 2506.16012 | link |
| 2025-06-18 | TACT: Humanoid Whole-body Contact Manipulation through Deep Imitation Learning with Tactile Modality | Eiichi Yoshida Team | 2506.15146 | null |
| 2025-06-18 | Booster Gym: An End-to-End Reinforcement Learning Framework for Humanoid Robot Locomotion | Mingguo Zhao Team | 2506.15132 | link |
| 2025-06-17 | GMT: General Motion Tracking for Humanoid Whole-Body Control | Xiaolong Wang Team | 2506.14770 | null |
| 2025-06-17 | Whole-Body Control Framework for Humanoid Robots with Heavy Limbs: A Model-Based Approach | Yun-Hui Liu Team | 2506.14278 | null |
| 2025-06-15 | KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills | Xuelong Li Team | 2506.12851 | null |
| 2025-06-19 | From Experts to a Generalist: Toward General Whole-Body Control for Humanoid Robots | Zongqing Lu Team | 2506.12779 | null |
| 2025-06-15 | RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control | Zongqing Lu Team | 2506.12769 | null |
| 2025-06-14 | Explosive Output to Enhance Jumping Ability: A Variable Reduction Ratio Design Paradigm for Humanoid Robots Knee Joint | Qiang Huang Team | 2506.12314 | null |
| 2025-06-13 | mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity | Robert K. Katzschmann Team | 2506.11916 | null |
| 2025-06-11 | Exploring EEG Responses during Observation of Actions Performed by Human Actor and Humanoid Robot | Michelle J. Johnson Team | 2506.10170 | null |
| 2025-06-11 | Locomotion on Constrained Footholds via Layered Architectures and Model Predictive Control | Aaron D. Ames Team | 2506.09979 | null |
| 2025-06-11 | Attention-Based Map Encoding for Learning Generalized Legged Locomotion | Marco Hutter Team | 2506.09588 | null |
| 2025-06-11 | Bipedal Balance Control with Whole-body Musculoskeletal Standing and Falling Simulations | Yanan Sui Team | 2506.09383 | null |
| 2025-06-11 | SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending | Yue Wang Team | 2506.09366 | link |
| 2025-06-10 | Fast Estimation of Globally Optimal Independent Contact Regions for Robust Grasping and Manipulation | Nancy S. Pollard Team | 2506.08856 | null |
| 2025-06-12 | MoRE: Mixture of Residual Experts for Humanoid Lifelike Gaits Learning on Complex Terrains | Xuelong Li Team | 2506.08840 | null |
| 2025-06-10 | Periodic Bipedal Gait Learning Using Reward Composition Based on a Novel Gait Planner for Humanoid Robots | Lijun Zhu Team | 2506.08416 | null |
| 2025-06-05 | Realizing Text-Driven Motion Generation on NAO Robot: A Reinforcement Learning-Optimized Control Pipeline | Qijun Chen Team | 2506.05117 | link |
| 2025-06-04 | Phase-based Nonlinear Model Predictive Control for Humanoid Walking Stabilization with Single and Double Support Time Adjustments | Jaeheung Park Team | 2506.03856 | null |
| 2025-06-03 | AURA: Agentic Upskilling via Reinforced Abstractions | Dennis Hong Team | 2506.02507 | null |
| 2025-06-02 | Reinforcement Learning with Data Bootstrapping for Dynamic Subgoal Pursuit in Humanoid Robot Navigation | Ayonga Hereid Team | 2506.02206 | null |
| 2025-06-02 | Learning with pyCub: A New Simulation and Exercise Framework for Humanoid Robotics | Matej Hoffmann Team | 2506.01756 | null |
| 2025-06-05 | Hierarchical Intention-Aware Expressive Motion Generation for Humanoid Robots | Chengxu Zhou Team | 2506.01563 | null |
| 2025-06-01 | Humanoid World Models: Open World Foundation Models for Humanoid Robotics | Mohammad Al-Sharman Team | 2506.01182 | null |
| 2025-06-01 | iRonCub 3: The Jet-Powered Flying Humanoid Robot | Daniele Pucci Team | 2506.01125 | null |
| 2025-05-30 | Learning Aerodynamics for the Control of Flying Humanoid Robots | Daniele Pucci Team | 2506.00305 | null |
| 2025-05-30 | Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives -- A Survey | Rania Rayyes Team | 2506.00098 | null |
| 2025-06-05 | SignBot: Learning Human-to-Humanoid Sign Language Interaction | Guiliang Liu Team | 2505.24266 | null |
| 2025-05-30 | Humanoid Loco-Manipulations Pattern Generation and Stabilization Control | Abderrahmane Kheddar Team | 2505.24116 | null |
| 2025-05-29 | Humanoid Loco-manipulation Planning based on Graph Search and Reachability Maps | Abderrahmane Kheddar Team | 2505.23505 | null |
| 2025-05-29 | Centroidal Trajectory Generation and Stabilization based on Preview Control for Humanoid Multi-contact Motion | Fumio Kanehiro Team | 2505.23499 | link |
| 2025-06-01 | FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control | Pieter Abbeel Team | 2505.22642 | null |
| 2025-05-27 | Learning Unified Force and Position Control for Legged Loco-Manipulation | Siyuan Huang Team | 2505.20829 | null |
| 2025-05-27 | Gait-Conditioned Reinforcement Learning with Multi-Phase Curriculum for Humanoid Locomotion | CHengxu Zhou Team | 2505.20619 | null |
| 2025-05-26 | Integrating emotional intelligence, memory architecture, and gestures to achieve empathetic humanoid robot interaction in an educational setting | Paul Craig Team | 2505.19803 | null |
| 2025-05-26 | Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning | Jean-Baptiste Mouret Team | 2505.19717 | null |
| 2025-05-26 | Whole-body Multi-contact Motion Control for Humanoid Robots Based on Distributed Tactile Sensors | Eiichi Yoshida Team | 2505.19580 | link |
| 2025-05-26 | Heavy lifting tasks via haptic teleoperation of a wheeled humanoid | Joao Ramos Team | 2505.19530 | null |
| 2025-05-26 | SMAP: Self-supervised Motion Adaptation for Physically Plausible Humanoid Whole-body Control | Junting Dong Team | 2505.19463 | null |
| 2025-05-25 | Towards Humanoid Robot Autonomy: A Dynamic Architecture Integrating Continuous thought Machines (CTM) and Model Context Protocol (MCP) | Libo Wang Team | 2505.19339 | link |
| 2025-05-25 | Staircase Recognition and Location Based on Polarization Vision | Zhiying Tan Team | 2505.19026 | null |
| 2025-05-23 | DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation | Ruqi Huang Team | 2505.18078 | null |
| 2025-05-22 | Unified Multi-Rate Model Predictive Control for a Jet-Powered Humanoid Robot | Daniele Pucci Team | 2505.16478 | null |
| 2025-05-19 | TD-GRPC: Temporal Difference Learning with Group Relative Policy Constraint for Humanoid Locomotion | Minh Nhat Vu Team | 2505.13549 | null |
| 2025-05-19 | DreamGen: Unlocking Generalization in Robot Learning through Neural Trajectories | Linxi Fan Team | 2505.12705 | null |
| 2025-05-19 | Dribble Master: Learning Agile Humanoid Dribbling Through Legged Locomotion | Qi Wu Team | 2505.12679 | null |
| 2025-05-16 | Bracing for Impact: Robust Humanoid Push Recovery and Locomotion with Reduced Order Models | Aaron D. Ames Team | 2505.11495 | null |
| 2025-05-16 | X2C: A Dataset Featuring Nuanced Facial Expressions for Realistic Humanoid Imitation | Xiaohan Yu Team | 2505.11146 | link |
| 2025-05-15 | NavDP: Learning Sim-to-Real Navigation Diffusion Policy with Privileged Information Guidance | Jiangmiao Pang Team | 2505.08712 | null |
| 2025-05-13 | Rethink Repeatable Measures of Robot Performance with Statistical Query | Dylan Khor Team | 2505.08216 | null |
| 2025-05-14 | Neural Brain: A Neuroscience-inspired Framework for Embodied Agents | Lin Wang Team | 2505.07634 | link |
| 2025-05-12 | HuB: Learning Extreme Humanoid Balance | Yang Gao Team | 2505.07294 | null |
| 2025-05-11 | Dynamic Safety in Complex Environments: Synthesizing Safety Filters with Poisson's Equation | Aaron D. Ames Team | 2505.06794 | null |
| 2025-05-10 | JAEGER: Dual-Level Humanoid Whole-Body Controller | Zongqing Lu Team | 2505.06584 | null |
| 2025-05-09 | Let Humanoids Hike! Integrative Skill Development on Complex Trails | Stella X. Yu Team | 2505.06218 | null |
| 2025-05-09 | Safe-EF: Error Feedback for Nonsmooth Constrained Optimization | Ilyas Fatkhullin Team | 2505.06053 | null |
| 2025-05-09 | Human-Robot Collaboration for the Remote Control of Mobile Humanoid Robots with Torso-Arm Coordination | Zhi Li Team | 2505.05773 | null |
| 2025-05-07 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | Manoj Karkee Team | 2505.04769 | null |
| 2025-05-06 | AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control | Xiaolong Wang Team | 2505.03738 | null |
| 2025-05-13 | Visual Imitation Enables Contextual Humanoid Control | Angjoo Kanazawa Team | 2505.03729 | null |
| 2025-05-05 | TWIST: Teleoperated Whole-Body Imitation System | C. Karen Liu Team | 2505.02833 | null |
| 2025-04-30 | LangWBC: Language-directed Humanoid Whole-Body Control via End-to-end Learning | Koushil Sreenath Team | 2504.21738 | null |
| 2025-04-29 | SoccerDiffusion: Toward Learning End-to-End Humanoid Robot Soccer from Gameplay Recordings | Jianwei Zhang Team | 2504.20808 | null |
| 2025-04-27 | Personalized Artificial General Intelligence (AGI) via Neuroscience-Inspired Continuous Learning Systems | Jairaj Singh Shaktawat Team | 2504.20109 | null |
| 2025-04-24 | Demonstrating Berkeley Humanoid Lite: An Open-source, Accessible, and Customizable 3D-printed Humanoid Robot | Koushil Sreenath Team | 2504.17249 | null |
| 2025-04-20 | ExFace: Expressive Facial Control for Humanoid Robots with Diffusion Transformers and Bootstrap Training | Jiahao Chen Team | 2504.14477 | null |
| 2025-04-19 | Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning | Xuelong Li Team | 2504.14305 | null |
| 2025-04-18 | Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning | Fumio Kanehiro Team | 2504.13619 | link |
| 2025-04-16 | EmoACT: a Framework to Embed Emotions into Artificial Agents Based on Affect Control Theory | Carmine Tommaso Recchiuto Team | 2504.12125 | null |
| 2025-04-14 | Teacher Motion Priors: Enhancing Robot Locomotion over Challenging Terrain | Zhengtao Zhang Team | 2504.10390 | null |
| 2025-04-14 | PreCi: Pretraining and Continual Improvement of Humanoid Locomotion via Model-Assumption-Based Regularization | Sehoon Ha Team | 2504.09833 | null |
| 2025-04-13 | Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation | Yi Fang Team | 2504.09532 | null |
| 2025-04-11 | Spectral Normalization for Lipschitz-Constrained Policies on Learning Humanoid Locomotion | Jaeheung Park Team | 2504.08246 | null |
| 2025-04-07 | MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond | Xun Cao Team | 2504.05046 | null |
| 2025-04-07 | A High-Force Gripper with Embedded Multimodal Sensing for Powerful and Perception Driven Grasping | Nikos G. Tsagarakis Team | 2504.04970 | null |
| 2025-04-06 | Public speech recognition transcripts as a configuring parameter | Christian Licoppe Team | 2504.04488 | null |
| 2025-04-02 | The Social Life of Industrial Arms: How Arousal and Attention Shape Human-Robot Interaction | Matthew K. X. J Pan Team | 2504.01260 | null |
| 2025-04-01 | Extended Hybrid Zero Dynamics for Bipedal Walking of the Knee-less Robot SLIDER | Petar Kormushev Team | 2504.01165 | null |
| 2025-04-11 | Learning Bipedal Locomotion on Gear-Driven Humanoid Robot Using Foot-Mounted IMUs | Masaya Kinoshita Team | 2504.00614 | null |
| 2025-03-30 | Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework | Ysobel Sims Team | 2503.23601 | null |
| 2025-03-28 | Control of Humanoid Robots with Parallel Mechanisms using Kinematic Actuation Models | Nicolas Mansard Team | 2503.22459 | null |
| 2025-03-28 | FLAM: Foundation Model-Based Body Stabilization for Humanoid Locomotion and Manipulation | Debin Zhao Team | 2503.22249 | null |
| 2025-03-27 | OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation | Wanting Li Team | 2503.21257 | null |
| 2025-03-26 | Anti Robot Speciesism | Miklos Sarvary Team | 2503.20842 | null |
| 2025-03-25 | Can Vision-Language Models Answer Face to Face Questions in the Real-World? | Roland Memisevic Team | 2503.19356 | null |
| 2025-03-19 | StyleLoco: Generative Adversarial Distillation for Natural Humanoid Robot Locomotion | Siyuan Huang Team | 2503.15082 | null |
| 2025-03-27 | GR00T N1: An Open Foundation Model for Generalist Humanoid Robots | Yuke Zhu Team | 2503.14734 | null |
| 2025-03-24 | Humanoid Policy ~ Human Policy | Xiaolong Wang Team | 2503.13441 | null |
| 2025-03-17 | Humanoids in Hospitals: A Technical Study of Humanoid Surrogates for Dexterous Medical Interventions | Michael Yip Team | 2503.12725 | null |
| 2025-03-16 | Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills | Zongqing Lu Team | 2503.12533 | null |
| 2025-03-14 | Fast and Robust Localization for Humanoid Soccer Robot via Iterative Landmark Matching | Dennis W. Hong Team | 2503.11020 | null |
| 2025-03-13 | NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models | Michael Black Team | 2503.10626 | null |
| 2025-03-13 | NuExo: A Wearable Exoskeleton Covering all Upper Limb ROM for Outdoor Data Collection and Teleoperation of Humanoid Robots | Huimin Lu Team | 2503.10554 | null |
| 2025-03-12 | Natural Humanoid Robot Locomotion with Generative Motion Prior | Rong Xiong Team | 2503.09015 | null |
| 2025-03-13 | HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots | Renjing Xu Team | 2503.09010 | null |
| 2025-03-11 | LiPS: Large-Scale Humanoid Robot Reinforcement Learning with Parallel-Series Structures | Renjing Xu Team | 2503.08349 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-07-19 | A 21-DOF Humanoid Dexterous Hand with Hybrid SMA-Motor Actuation: CYJ Hand-0 | Erbao Dong Team | 2507.14538 | null |
| 2025-07-18 | Improving Low-Cost Teleoperation: Augmenting GELLO with Force | Kai Arulkumaran Team | 2507.13602 | null |
| 2025-07-16 | The Developments and Challenges towards Dexterous and Embodied Robotic Manipulation: A Survey | Jiming Chen Team | 2507.11840 | null |
| 2025-07-14 | Demonstrating the Octopi-1.5 Visual-Tactile-Language Model | Harold Soh Team | 2507.09985 | null |
| 2025-07-09 | Hierarchical Reinforcement Learning for Articulated Tool Manipulation with Multifingered Hand | Xinjun Sheng Team | 2507.06822 | null |
| 2025-07-07 | A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation | Russ Tedrake Team | 2507.05331 | null |
| 2025-07-06 | SimLauncher: Launching Sample-Efficient Real-world Robotic Reinforcement Learning via Simulation Pre-training | Hao Dong Team | 2507.04452 | null |
| 2025-07-03 | DexVLG: Dexterous Vision-Language-Grasp Model at Scale | He Wang Team | 2507.02747 | null |
| 2025-07-02 | TypeTele: Releasing Dexterity in Teleoperation by Dexterous Manipulation Types | Wei-Shi Zheng Team | 2507.01857 | null |
| 2025-07-01 | HumanoidGen: Data Generation for Bimanual Dexterous Manipulation via LLM Reasoning | Chenjia Bai Team | 2507.00833 | null |
| 2025-06-26 | Lightweight Fingernail Haptic Device: Unobstructed Fingerpad Force and Vibration Feedback for Enhanced Virtual Dexterous Manipulation | Shoichi Hasegawa Team | 2506.21417 | null |
| 2025-06-24 | Scaffolding Dexterous Manipulation with Vision-Language Models | Dorsa Sadigh Team | 2506.19212 | null |
| 2025-06-24 | The MOTIF Hand: A Robotic Hand for Multimodal Observations with Thermal, Inertial, and Force Sensors | Daniel Seita Team | 2506.19201 | null |
| 2025-06-21 | VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models | Lin Shao Team | 2506.17561 | null |
| 2025-06-20 | Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation | Xiaolong Wang Team | 2506.17198 | null |
| 2025-06-19 | ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation | Jitendra Malik Team | 2506.15953 | null |
| 2025-06-17 | Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation | Mustafa Mukadam Team | 2506.14754 | null |
| 2025-06-16 | CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding | Haoang Li Team | 2506.13725 | null |
| 2025-06-13 | ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation | Nima Fazeli Team | 2506.12239 | null |
| 2025-06-13 | ExoStart: Efficient learning for dexterous manipulation with sensorized exoskeleton demonstrations | Maria Bauza Villalonga Team | 2506.11775 | null |
| 2025-06-30 | Adaptive event-triggered robust tracking control of soft robots | Marios M. Polycarpou Team | 2506.09523 | null |
| 2025-06-11 | Analyzing Key Objectives in Human-to-Robot Retargeting for Dexterous Manipulation | Xiang Li Team | 2506.09384 | null |
| 2025-06-09 | TensorTouch: Calibration of Tactile Sensors for High Resolution Stress Tensor and Deformation for Dexterous Manipulation | Monroe Kennedy III Team | 2506.08291 | null |
| 2025-06-09 | RAPID Hand: A Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Generalist Robot Autonomy | Hui Cheng Team | 2506.07490 | null |
| 2025-06-05 | GEX: Democratizing Dexterity with Fully-Actuated Dexterous Hand and Exoskeleton Glove | Zelin Deng Team | 2506.04982 | link |
| 2025-06-06 | ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning | Jian Tang Team | 2506.04941 | null |
| 2025-06-03 | Reachability Weighted Offline Goal-conditioned Resampling | Joni Pajarinen Team | 2506.02577 | null |
| 2025-05-30 | Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives -- A Survey | Rania Rayyes Team | 2506.00098 | null |
| 2025-05-30 | DexMachina: Functional Retargeting for Bimanual Dexterous Manipulation | Shuran Song Team | 2505.24853 | null |
| 2025-05-28 | ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation | Wenqiang Zhang Team | 2505.22159 | null |
| 2025-05-29 | DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation | Shuran Song Team | 2505.21864 | null |
| 2025-05-27 | Learning Generalizable Robot Policy with Human Demonstration Video as a Prompt | Jianyu Chen Team | 2505.20795 | null |
| 2025-05-25 | MaskedManipulator: Versatile Whole-Body Control for Loco-Manipulation | Xue Bin Peng Team | 2505.19086 | null |
| 2025-05-24 | Beyond Domain Randomization: Event-Inspired Perception for Visually Robust Adversarial Imitation from Videos | Mario Bijelic Team | 2505.18899 | link |
| 2025-05-24 | DiffusionRL: Efficient Training of Diffusion Policies for Robotic Grasping Using RL-Adapted Large-Scale Datasets | Dzmitry Tsetserukou Team | 2505.18876 | null |
| 2025-05-27 | GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning | Ye Shi Team | 2505.18763 | null |
| 2025-05-22 | TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation | Hengdi Zhang Team | 2505.16289 | null |
| 2025-05-21 | Object-Focus Actor for Data-efficient Robot Generalization Dexterous Manipulation | Xiaodong He Team | 2505.15098 | null |
| 2025-05-20 | Adaptive Visuo-Tactile Fusion with Predictive Force Attention for Dexterous Manipulation | Hao Dong Team | 2505.13982 | null |
| 2025-05-19 | Approximating Global Contact-Implicit MPC via Sampling and Local Complementarity | Michael Posa Team | 2505.13350 | null |
| 2025-05-19 | TeleOpBench: A Simulator-Centric Benchmark for Dual-Arm Dexterous Teleoperation | Jiangmiao Pang Team | 2505.12748 | null |
| 2025-05-18 | PartDexTOG: Generating Dexterous Task-Oriented Grasping via Language-driven Part Analysis | Zhipong Cai Team | 2505.12294 | null |
| 2025-05-17 | OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning | Yang Gao Team | 2505.11917 | null |
| 2025-05-16 | EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video | Jian Zhang Team | 2505.11709 | null |
| 2025-05-16 | Self-supervised perception for tactile skin covered dexterous hands | Mustafa Mukadam Team | 2505.11420 | null |
| 2025-05-16 | Learning Multimodal AI Algorithms for Amplifying Limited User Input into High-dimensional Control Space | Reza Abiri Team | 2505.11366 | link |
| 2025-05-16 | Estimating Deformable-Rigid Contact Interactions for a Deformable Tool via Learning and Model-Based Optimization | Nima Fazeli Team | 2505.10884 | null |
| 2025-05-15 | SRT-H: A Hierarchical Framework for Autonomous Surgery via Language Conditioned Imitation Learning | Axel Krieger Team | 2505.10251 | null |
| 2025-05-13 | HandCept: A Visual-Inertial Fusion Framework for Accurate Proprioception in Dexterous Hands | Yunhui Liu Team | 2505.08213 | null |
| 2025-05-12 | DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies | Deepak Pathak Team | 2505.07813 | null |
| 2025-05-08 | Morphologically Symmetric Reinforcement Learning for Ambidextrous Bimanual Manipulation | Georgia Chalvatzaki Team | 2505.05287 | null |
| 2025-05-04 | Prompt-responsive Object Retrieval with Memory-augmented Student-Teacher Learning | Sven Behnke Team | 2505.02232 | null |
| 2025-05-04 | KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation | Yang Gao Team | 2505.01974 | null |
| 2025-05-02 | DexFlow: A Unified Approach for Dexterous Hand Pose Retargeting and Interaction | Miao Li Team | 2505.01083 | null |
| 2025-05-02 | DexCtrl: Towards Sim-to-Real Dexterity with Adaptive Controller Learning | Masayoshi Tomizuka Team | 2505.00991 | null |
| 2025-04-30 | Multi-Goal Dexterous Hand Manipulation using Probabilistic Model-based Reinforcement Learning | Yunduan Cui Team | 2504.21585 | null |
| 2025-04-27 | PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies | Edward Adelson Team | 2504.19341 | null |
| 2025-04-23 | PP-Tac: Paper Picking Using Tactile Feedback in Dexterous Robotic Hands | Ziyuan Jiao Team | 2504.16649 | null |
| 2025-04-22 | Ury Zhilinsky Team | 2504.16054 | null | |
| 2025-04-21 | LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning | Boyuan Chen Team | 2504.15472 | null |
| 2025-04-21 | SuFIA-BC: Generating High Quality Demonstration Data for Visuomotor Policy Learning in Surgical Subtasks | Animesh Garg Team | 2504.14857 | null |
| 2025-04-20 | BiDexHand: Design and Evaluation of an Open-Source 16-DoF Biomimetic Dexterous Hand | Zhengyang Kris Weng Team | 2504.14712 | null |
| 2025-04-18 | On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting | Jan Peters Team | 2504.13618 | null |
| 2025-04-17 | RUKA: Rethinking the Design of Humanoid Hands with Learning | Lerrel Pinto Team | 2504.13165 | null |
| 2025-04-17 | Adaptive Task Space Non-Singular Terminal Super-Twisting Sliding Mode Control of a 7-DOF Robotic Manipulator | E. Witrant Team | 2504.13056 | null |
| 2025-04-17 | Krysalis Hand: A Lightweight, High-Payload, 18-DoF Anthropomorphic End-Effector for Robotic Learning and Dexterous Manipulation | Iman Soltani Team | 2504.12967 | null |
| 2025-04-22 | Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration | Jeannette Bohg Team | 2504.12609 | null |
| 2025-04-14 | Look-to-Touch: A Vision-Enhanced Proximity and Tactile Sensor for Distance and Geometry Perception in Robotic Manipulation | Guoying Gu Team | 2504.10280 | null |
| 2025-04-08 | Functionally graded keratin facilitates tactile sensing in elephant whiskers | Katherine J. Kuchenbecker Team | 2504.07143 | null |
| 2025-04-08 | ViTaMIn: Learning Contact-Rich Tasks Through Robot-Free Visuo-Tactile Manipulation Interface | Rui Chen Team | 2504.06156 | null |
| 2025-04-06 | DexTOG: Learning Task-Oriented Dexterous Grasp with Language | Cewu Lu Team | 2504.04573 | null |
| 2025-04-06 | DexSinGrasp: Learning a Unified Policy for Dexterous Object Singulation and Grasping in Cluttered Environments | Lin Shao Team | 2504.04516 | null |
| 2025-04-05 | ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning | Robert K. Katzschmann Team | 2504.04259 | null |
| 2025-04-24 | Dexterous Manipulation through Imitation Learning: A Survey | Hong Zhang Team | 2504.03515 | null |
| 2025-03-29 | Dexterous Non-Prehensile Manipulation for Ungraspable Object via Extrinsic Dexterity | Yuanpei Chen Team | 2503.23120 | null |
| 2025-03-27 | ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning | Siyuan Huang Team | 2503.21860 | null |
| 2025-03-25 | G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation | Ruizhen Hu Team | 2503.19457 | null |
| 2025-03-16 | Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills | Zongqing Lu Team | 2503.12533 | null |
| 2025-03-14 | Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping | Haruki Nishimura Team | 2503.10966 | null |
| 2025-03-12 | Sequential Multi-Object Grasping with One Dexterous Hand | Daniel Seita Team | 2503.09078 | null |
| 2025-03-16 | DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness | Yuexin Ma Team | 2503.08257 | link |
| 2025-03-13 | AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems | Jianchao Zhu Team | 2503.06669 | link |
| 2025-03-08 | ReJSHand: Efficient Real-Time Hand Pose Estimation and Mesh Reconstruction Using Refined Joint and Skeleton Features | Hong Zhang Team | 2503.05995 | link |
| 2025-03-07 | Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction | Bin He Team | 2503.05231 | null |
| 2025-03-06 | Dexterous Hand Manipulation via Efficient Imitation-Bootstrapped Online Reinforcement Learning | Xiaodong He Team | 2503.04014 | null |
| 2025-03-05 | LensDFF: Language-enhanced Sparse Feature Distillation for Efficient Few-Shot Dexterous Manipulation | Alois Knoll Team | 2503.03890 | null |
| 2025-03-05 | Selective Tweezing and Immobilization of Colloids for Dexterous Manipulation of Biological Materials | Kimani C. Toussaint Jr Team | 2503.03102 | null |
| 2025-03-03 | TacCap: A Wearable FBG-Based Tactile Sensor for Seamless Human-to-Robot Skill Transfer | Mark R. Cutkosky Team | 2503.01789 | null |
| 2025-03-03 | RoboDexVLM: Visual Language Model-Enabled Task Planning and Motion Control for Dexterous Robot Manipulation | Jun Ma Team | 2503.01616 | null |
| 2025-03-03 | Exo-ViHa: A Cross-Platform Exoskeleton System with Visual and Haptic Feedback for Efficient Dexterous Skill Learning | Wenbo Ding Team | 2503.01543 | null |
| 2025-03-03 | KineSoft: Learning Proprioceptive Manipulation Policies with Soft Robot Hands | Jeffrey Ichnowski Team | 2503.01078 | null |
| 2025-02-27 | Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids | Yuke Zhu Team | 2502.20396 | null |
| 2025-02-28 | ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration | Feifei Feng Team | 2502.19250 | null |
| 2025-02-26 | Retrieval Dexterity: Efficient Object Retrieval in Clutters with Dexterous Hand | Yuanpei Chen Team | 2502.18423 | null |