Usage instructions: here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | null |
| 2025-11-20 | WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue | Zachary Ellis et.al. | 2511.16544 | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | null |
| 2025-11-19 | Universal TT- and TQ-relations via centrally extended q-Onsager algebra | Pascal Baseilhac et.al. | 2511.15876 | null |
| 2025-11-19 | Step-Audio-R1 Technical Report | Fei Tian et.al. | 2511.15848 | null |
| 2025-11-19 | A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification | Mohit Sharma et.al. | 2511.15766 | null |
| 2025-11-19 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-19 | Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding | Mingyue Huo et.al. | 2511.15145 | null |
| 2025-11-19 | Aligning Generative Music AI with Human Preferences: Methods and Challenges | Dorien Herremans et.al. | 2511.15038 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants | Mingkun Yu et.al. | 2511.14852 | null |
| 2025-11-18 | Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech | Nam-Gyu Kim et.al. | 2511.14824 | null |
| 2025-11-18 | Ground Truth Generation for Multilingual Historical NLP using LLMs | Clovis Gladstone et.al. | 2511.14688 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-18 | Periods in equivariant and motivic contexts | Martin Gallauer et.al. | 2511.14325 | null |
| 2025-11-18 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR | Gabrial Zencha Ashungafac et.al. | 2511.14255 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | link |
| 2025-11-18 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-18 | FxSearcher: gradient-free text-driven audio transformation | Hojoon Ki et.al. | 2511.14138 | null |
| 2025-11-17 | Human-centric Maintenance Process Through Integration of AI, Speech, and AR | Parul Khanna et.al. | 2511.13918 | null |
| 2025-11-17 | Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video | Filippo Cenacchi. Longbing Cao et.al. | 2511.13802 | null |
| 2025-11-17 | PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement | Xiaobin Rong et.al. | 2511.13300 | null |
| 2025-11-17 | Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms | Patrick Parschan et.al. | 2511.13238 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis | Zaara Zabeen Arpa et.al. | 2511.13159 | link |
| 2025-11-17 | A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning | Liuyi Jin et.al. | 2511.13078 | null |
| 2025-11-17 | CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models | Mehrab Mustafy Rahman et.al. | 2511.12964 | null |
| 2025-11-16 | Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data | Sina Rashidi et.al. | 2511.12690 | null |
| 2025-11-16 | Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans | Hongbin Huang et.al. | 2511.12662 | null |
| 2025-11-16 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | null |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-14 | Proactive Hearing Assistants that Isolate Egocentric Conversations | Guilin Hu et.al. | 2511.11473 | link |
| 2025-11-14 | Language-Aided State Estimation | Yuki Miyoshi et.al. | 2511.11285 | null |
| 2025-11-14 | Extended-Krylov-subspace methods for trust-region and norm-regularization subproblems | Hussam Al Daas et.al. | 2511.11135 | null |
| 2025-11-14 | Analysing Personal Attacks in U.S. Presidential Debates | Ruban Goyal et.al. | 2511.11108 | null |
| 2025-11-14 | CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation | Crystal Min Hui Poon et.al. | 2511.11104 | null |
| 2025-11-14 | CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding | Yifan Zhuang et.al. | 2511.10935 | null |
| 2025-11-14 | Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio | Guangke Chen et.al. | 2511.10913 | null |
| 2025-11-13 | Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces | Farhan Sheth et.al. | 2511.10793 | null |
| 2025-11-13 | Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning | Girish et.al. | 2511.10790 | null |
| 2025-11-13 | XSNAP: An X-ray Supernova Analysis Pipeline with Application to the Type II Supernova 2024ggi | Ferdinand et.al. | 2511.10744 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-13 | VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction | Yuhao Wang et.al. | 2511.10232 | null |
| 2025-11-13 | Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard | Yudong Yang et.al. | 2511.10222 | null |
| 2025-11-13 | Towards Leveraging Sequential Structure in Animal Vocalizations | Eklavya Sarkar et.al. | 2511.10190 | link |
| 2025-11-13 | FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features | Wenyu Wang et.al. | 2511.10112 | null |
| 2025-11-13 | Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints | Xiangyue Zhang et.al. | 2511.10076 | null |
| 2025-11-13 | Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS | Haoyu Li et.al. | 2511.09995 | null |
| 2025-11-13 | MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection | Pritish Sahu et.al. | 2511.09918 | null |
| 2025-11-12 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages | Omnilingual ASR team et.al. | 2511.09690 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Cognitive Foundations for Reasoning and Their Manifestation in LLMs | Priyanka Kargupta et.al. | 2511.16660 | null |
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | null |
| 2025-11-19 | Step-Audio-R1 Technical Report | Fei Tian et.al. | 2511.15848 | null |
| 2025-11-19 | A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification | Mohit Sharma et.al. | 2511.15766 | null |
| 2025-11-20 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models | Mehran Tamjidi et.al. | 2511.15311 | null |
| 2025-11-19 | Detection of spiking motifs of arbitrary length in neural activity using bounded synaptic delays | Thomas Kronland-Martinet et.al. | 2511.15296 | null |
| 2025-11-19 | SNAP: Low-Latency Test-Time Adaptation with Sparse Updates | Hyeongheon Cha et.al. | 2511.15276 | null |
| 2025-11-19 | LargeSHS: A large-scale dataset of music adaptation | Chih-Pin Tan et.al. | 2511.15270 | null |
| 2025-11-19 | Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding | Mingyue Huo et.al. | 2511.15145 | null |
| 2025-11-19 | Aligning Generative Music AI with Human Preferences: Methods and Challenges | Dorien Herremans et.al. | 2511.15038 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems | Jaro Meyer et.al. | 2511.14948 | null |
| 2025-11-18 | Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report | Daniel Oliveira de Brito et.al. | 2511.14939 | null |
| 2025-11-18 | A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder | Dengyun Huang et.al. | 2511.14600 | null |
| 2025-11-18 | Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning | Trishala Jayesh Ahalpara et.al. | 2511.14445 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-18 | H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata | Chenyang Xu et.al. | 2511.14312 | null |
| 2025-11-18 | Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions | Marcel Gibier et.al. | 2511.14307 | null |
| 2025-11-18 | EBind: a practical approach to space binding | Jim Broadbent et.al. | 2511.14229 | null |
| 2025-11-18 | StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model | Yifan Yang et.al. | 2511.14223 | null |
| 2025-11-18 | FxSearcher: gradient-free text-driven audio transformation | Hojoon Ki et.al. | 2511.14138 | null |
| 2025-11-18 | Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services | Liuyi Jin et.al. | 2511.14119 | null |
| 2025-11-17 | Preference-Based Learning in Audio Applications: A Systematic Analysis | Aaron Broukhim et.al. | 2511.13936 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language | Zonghao Ying et.al. | 2511.13127 | null |
| 2025-11-17 | A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning | Liuyi Jin et.al. | 2511.13078 | null |
| 2025-11-16 | Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine | Ziqiong Liu et.al. | 2511.12607 | null |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-16 | SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs | Shail Desai et.al. | 2511.12404 | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | null |
| 2025-11-15 | Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound | Dengming Zhang et.al. | 2511.12077 | null |
| 2025-11-15 | ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation | Jiahui Sun et.al. | 2511.12072 | null |
| 2025-11-14 | Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering | Tianyu Xu et.al. | 2511.11930 | null |
| 2025-11-14 | Proactive Hearing Assistants that Isolate Egocentric Conversations | Guilin Hu et.al. | 2511.11473 | null |
| 2025-11-14 | AV-Dialog: Spoken Dialogue Models with Audio-Visual Input | Tuochao Chen et.al. | 2511.11124 | null |
| 2025-11-14 | DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition | HongYu Liu et.al. | 2511.11000 | null |
| 2025-11-14 | Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio | Guangke Chen et.al. | 2511.10913 | null |
| 2025-11-13 | Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces | Farhan Sheth et.al. | 2511.10793 | null |
| 2025-11-13 | Panda: Test-Time Adaptation with Negative Data Augmentation | Ruxi Deng et.al. | 2511.10481 | null |
| 2025-11-13 | TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities | Yan Zhuang et.al. | 2511.10325 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-13 | OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models | Yuping Yan et.al. | 2511.10287 | null |
| 2025-11-14 | Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard | Yudong Yang et.al. | 2511.10222 | null |
| 2025-11-13 | Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization | Ashutosh Anshul et.al. | 2511.10212 | null |
| 2025-11-13 | RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo | Jueun Ko et.al. | 2511.10107 | null |
| 2025-11-13 | When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? | Qilang Ye et.al. | 2511.10059 | null |
| 2025-11-13 | Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism | Jinhong Jeong et.al. | 2511.10045 | null |
| 2025-11-13 | Reinforcing Trustworthiness in Multimodal Emotional Support Systems | Huy M. Le et.al. | 2511.10011 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty | Victor Croisfelt et.al. | 2511.16225 | null |
| 2025-11-19 | MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features | Sejuti Rahman et.al. | 2511.15675 | null |
| 2025-11-20 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data | Mauro Larrat et.al. | 2511.15312 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems | Jaro Meyer et.al. | 2511.14948 | null |
| 2025-11-18 | OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models | Keda Tao et.al. | 2511.14582 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | null |
| 2025-11-18 | EBind: a practical approach to space binding | Jim Broadbent et.al. | 2511.14229 | null |
| 2025-11-18 | SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM | An Yu et.al. | 2511.14143 | null |
| 2025-11-18 | Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services | Liuyi Jin et.al. | 2511.14119 | null |
| 2025-11-17 | Segmenting Collision Sound Sources in Egocentric Videos | Kranti Kumar Parida et.al. | 2511.13863 | null |
| 2025-11-17 | Towards Affect-Adaptive Human-Robot Interaction: A Protocol for Multimodal Dataset Collection on Social Anxiety | Vesna Poprcova et.al. | 2511.13530 | null |
| 2025-11-17 | CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving | Enhui Ma et.al. | 2511.13297 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language | Zonghao Ying et.al. | 2511.13127 | null |
| 2025-11-17 | A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning | Liuyi Jin et.al. | 2511.13078 | null |
| 2025-11-17 | Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views | Junyi Ma et.al. | 2511.12878 | null |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-16 | SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs | Shail Desai et.al. | 2511.12404 | null |
| 2025-11-15 | Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound | Dengming Zhang et.al. | 2511.12077 | null |
| 2025-11-15 | ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation | Jiahui Sun et.al. | 2511.12072 | null |
| 2025-11-14 | AV-Dialog: Spoken Dialogue Models with Audio-Visual Input | Tuochao Chen et.al. | 2511.11124 | null |
| 2025-11-14 | AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization | Zhonghua Jiang et.al. | 2511.11106 | null |
| 2025-11-13 | TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities | Yan Zhuang et.al. | 2511.10325 | null |
| 2025-11-13 | OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models | Yuping Yan et.al. | 2511.10287 | null |
| 2025-11-13 | Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization | Ashutosh Anshul et.al. | 2511.10212 | null |
| 2025-11-13 | When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? | Qilang Ye et.al. | 2511.10059 | null |
| 2025-11-13 | Reinforcing Trustworthiness in Multimodal Emotional Support Systems | Huy M. Le et.al. | 2511.10011 | null |
| 2025-11-13 | Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation | Xiangyi Wei et.al. | 2511.09958 | null |
| 2025-11-14 | HI-TransPA: Hearing Impairments Translation Personal Assistant | Zhiming Ma et.al. | 2511.09915 | null |
| 2025-11-12 | Co-Designing Multimodal Systems for Accessible Remote Dance Instruction | Ujjaini Das et.al. | 2511.09658 | null |
| 2025-11-12 | MCAD: Multimodal Context-Aware Audio Description Generation For Soccer | Lipisha Chaudhary et.al. | 2511.09448 | null |
| 2025-11-12 | Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection | Anushka Sanjay Shelke et.al. | 2511.09039 | null |
| 2025-11-05 | UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions | Guozhen Zhang et.al. | 2511.03334 | null |
| 2025-10-28 | Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation | Kang Zhang et.al. | 2510.24103 | null |
| 2025-10-10 | MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation | Akira Takahashi et.al. | 2510.09065 | null |
| 2025-10-28 | Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation | Liyang Chen et.al. | 2510.08078 | null |
| 2025-10-09 | IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries | Harsh Kavediya et.al. | 2510.07837 | null |
| 2025-10-07 | FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders | Riccardo Fosco Gramaccioni et.al. | 2510.05829 | null |
| 2025-10-07 | StereoSync: Spatially-Aware Stereo Audio Generation from Video | Christian Marinoni et.al. | 2510.05828 | null |
| 2025-10-03 | SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos | Amir Dellali et.al. | 2510.02916 | null |
| 2025-10-02 | SoundReactor: Frame-level Online Video-to-Audio Generation | Koichi Saito et.al. | 2510.02110 | null |
| 2025-09-29 | Training-Free Multimodal Guidance for Video to Audio Generation | Eleonora Grassucci et.al. | 2509.24550 | null |
| 2025-09-28 | AudioMoG: Guiding Audio Generation with Mixture-of-Guidance | Junyou Wang et.al. | 2509.23727 | null |
| 2025-09-26 | WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM | Changli Tang et.al. | 2509.21990 | null |
| 2025-09-26 | Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers | Jibin Song et.al. | 2509.21893 | null |
| 2025-09-24 | MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization | Jianxuan Yang et.al. | 2509.19999 | null |
| 2025-10-05 | StereoFoley: Object-Aware Stereo Audio Generation from Video | Tornike Karchkhadze et.al. | 2509.18272 | null |
| 2025-09-19 | Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech | Xinlei Niu et.al. | 2509.15492 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Neutron star heating vs. HST observations | Luis E. Rodríguez et.al. | 2511.16507 | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | null |
| 2025-11-19 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-18 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR | Gabrial Zencha Ashungafac et.al. | 2511.14255 | null |
| 2025-11-17 | Large cliques in graphs with forbidden semi-induced structures | Nannan Chen et.al. | 2511.13073 | null |
| 2025-11-16 | Leave-One-Out Learning with Log-Loss | Yaniv Fogel et.al. | 2511.12718 | null |
| 2025-11-16 | Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back | Alon Cohen et.al. | 2511.12659 | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | null |
| 2025-11-14 | Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers | Vahid Hemmati et.al. | 2511.11834 | null |
| 2025-11-14 | Vortex breakdown and its topologies in turbulent flows within a typical swirl combustor geometry | Nitesh Kumar Sahu et.al. | 2511.11420 | null |
| 2025-11-13 | FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features | Wenyu Wang et.al. | 2511.10112 | null |
| 2025-11-12 | Sample Complexity of Quadratically Regularized Optimal Transport | Alberto González-Sanz et.al. | 2511.09807 | null |
| 2025-11-13 | Reduced-Complexity Model Selection and Rate Allocation for Multiple-Model Electrical Signal Compression | Corentin Presvôts et.al. | 2511.09370 | null |
| 2025-11-12 | VC-dimension of Salem sets over finite fields | Moustapha Diallo et.al. | 2511.08963 | null |
| 2025-11-12 | HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios | Bingsong Bai et.al. | 2511.08496 | null |
| 2025-11-10 | ConvFill: Model Collaboration for Responsive Conversational Voice Agents | Vidya Srinivas et.al. | 2511.07397 | null |
| 2025-11-10 | Generating Novel and Realistic Speakers for Voice Conversion | Meiying Melissa Chen et.al. | 2511.07135 | null |
| 2025-11-10 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis | Zhisheng Zhang et.al. | 2511.07099 | null |
| 2025-11-10 | Personalizing Emotion-aware Conversational Agents? Exploring User Traits-driven Conversational Strategies for Enhanced Interaction | Yuchong Zhang et.al. | 2511.06954 | null |
| 2025-11-09 | How Founder Expertise Shapes the Impact of Generative Artificial Intelligence on Digital Ventures | Ruiqing Cao et.al. | 2511.06545 | null |
| 2025-11-06 | Vector Traits Shape Disease Persistence: A Predator Prey Approach to Dengue | Piyumi Chathurangika et.al. | 2511.04276 | null |
| 2025-11-04 | Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning | David Kattermann et.al. | 2511.02644 | null |
| 2025-10-31 | Consequences of Dependent Dividing on Burden | Yuki Takahashi et.al. | 2511.00282 | null |
| 2025-10-31 | NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion | Zongyang Du et.al. | 2511.00256 | null |
| 2025-10-30 | UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens | Chengwei Liu et.al. | 2510.26372 | null |
| 2025-10-28 | Bayesian Speech synthesizers Can Learn from Multiple Teachers | Ziyang Zhang et.al. | 2510.24372 | null |
| 2025-10-24 | StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks | Jingyue Huang et.al. | 2510.21685 | null |
| 2025-10-23 | Charge-density waves and stripes in quarter metals of graphene heterostructures | Sk Asrap Murshed et.al. | 2510.20816 | null |
| 2025-10-23 | R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion | Junjie Zheng et.al. | 2510.20677 | null |
| 2025-10-22 | VBx for End-to-End Neural and Clustering-based Diarization | Petr Pálka et.al. | 2510.19572 | null |
| 2025-10-20 | Fast Agnostic Learners in the Plane | Talya Eden et.al. | 2510.18057 | null |
| 2025-10-20 | Joint upper Banach density, VC dimensions and Euclidean point configurations | Bruno Predojević et.al. | 2510.17453 | null |
| 2025-10-23 | The Parameterized Complexity of Computing the VC-Dimension | Florent Foucaud et.al. | 2510.17451 | null |
| 2025-10-18 | Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension | Timothy M. Chan et.al. | 2510.16346 | null |
| 2025-10-22 | VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition | Kye Shimizu et.al. | 2510.16192 | null |
| 2025-10-16 | Deadlock-free routing for Full-mesh networks without using Virtual Channels | Alejandro Cano et.al. | 2510.14730 | null |
| 2025-10-15 | The VC-dimension and point configurations in |
Alex Iosevich et.al. | 2510.13984 | null |
| 2025-10-16 | VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions | Fan Chang et.al. | 2510.13705 | null |
| 2025-10-15 | Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity | Ahmad Awad et.al. | 2510.13609 | null |
| 2025-10-15 | Target Controllability Score | Kazuhiro Sato et.al. | 2510.13354 | link |
| 2025-10-14 | VCTR: A Transformer-Based Model for Non-parallel Voice Conversion | Maharnab Saikia et.al. | 2510.12964 | null |
| 2025-10-15 | (R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm | Kevin Krings et.al. | 2510.12364 | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | null |
| 2025-10-13 | VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents | Jiliang Hu et.al. | 2510.11098 | null |
| 2025-10-10 | A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs | Hui Yuan et.al. | 2510.09715 | null |
| 2025-10-10 | SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion | Zhao Guo et.al. | 2510.09245 | null |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | null |
| 2025-10-09 | MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows | Guobin Ma et.al. | 2510.08392 | null |
| 2025-10-09 | What Makes a Visualization Complex? | Mengdi Chu et.al. | 2510.08332 | null |
| 2025-10-09 | VoiceAgentBench: Are Voice Assistants ready for agentic tasks? | Dhruv Jain et.al. | 2510.07978 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO | Junhao Cheng et.al. | 2511.16669 | null |
| 2025-11-20 | V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models | Yang Luo et.al. | 2511.16668 | null |
| 2025-11-20 | SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking | Haofeng Liu et.al. | 2511.16618 | null |
| 2025-11-20 | YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras | Fan Yang et.al. | 2511.16521 | null |
| 2025-11-20 | An analytical and experimental study of the energy transition discourse on YouTube | Aleix Bassolas et.al. | 2511.16497 | null |
| 2025-11-20 | Flow and Depth Assisted Video Prediction with Latent Transformer | Eliyas Suleyman et.al. | 2511.16484 | null |
| 2025-11-20 | PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks | Kewei Chen et.al. | 2511.16200 | null |
| 2025-11-20 | FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos | Jeremie Ochin et.al. | 2511.16183 | null |
| 2025-11-20 | Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight | Yi Yang et.al. | 2511.16175 | null |
| 2025-11-20 | Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning | Yibin Huang et.al. | 2511.16160 | null |
| 2025-11-19 | First Frame Is the Place to Go for Video Content Customization | Jingxi Chen et.al. | 2511.15700 | null |
| 2025-11-19 | Joint Semantic-Channel Coding and Modulation for Token Communications | Jingkai Ying et.al. | 2511.15699 | null |
| 2025-11-19 | The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification | Dante Francisco Wasmuht et.al. | 2511.15622 | null |
| 2025-11-19 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners | Xabier Lekunberri et.al. | 2511.15468 | null |
| 2025-11-19 | ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation | Simon Boeder et.al. | 2511.15396 | null |
| 2025-11-19 | PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback | Sirui Chen et.al. | 2511.15253 | null |
| 2025-11-19 | Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation | Firdavs Nasriddinov et.al. | 2511.15159 | null |
| 2025-11-19 | Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks | Cheng Yang et.al. | 2511.15065 | null |
| 2025-11-19 | Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation | Vladimir Arkhipkin et.al. | 2511.14993 | null |
| 2025-11-18 | Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising | Yifan Wang et.al. | 2511.14719 | null |
| 2025-11-18 | FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation | Yunfeng Wu et.al. | 2511.14712 | null |
| 2025-11-18 | ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection | Mohammad Romani et.al. | 2511.14554 | null |
| 2025-11-18 | DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation | Xiangchen Yin et.al. | 2511.14530 | null |
| 2025-11-18 | FlowRoI A Fast Optical Flow Driven Region of Interest Extraction Framework for High-Throughput Image Compression in Immune Cell Migration Analysis | Xiaowei Xu et.al. | 2511.14419 | null |
| 2025-11-18 | ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries | Junfu Pu et.al. | 2511.14349 | null |
| 2025-11-18 | Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs | Yiyi Miao et.al. | 2511.14315 | null |
| 2025-11-18 | Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning | Rui Liu et.al. | 2511.14249 | null |
| 2025-11-18 | InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior | Weimin Bai et.al. | 2511.14208 | null |
| 2025-11-18 | Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion | Zhuo Li et.al. | 2511.14178 | null |
| 2025-11-17 | Segment Anything Across Shots: A Method and Benchmark | Hengrui Hu et.al. | 2511.13715 | null |
| 2025-11-17 | UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity | Junwei Yu et.al. | 2511.13714 | null |
| 2025-11-17 | TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models | Harold Haodong Chen et.al. | 2511.13704 | null |
| 2025-11-17 | Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting | Jiangnan Ye et.al. | 2511.13684 | null |
| 2025-11-17 | CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding | Shrenik Patel et.al. | 2511.13644 | null |
| 2025-11-17 | Computer Vision based group activity detection and action spotting | Narthana Sivalingam et.al. | 2511.13315 | null |
| 2025-11-17 | CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving | Enhui Ma et.al. | 2511.13297 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-17 | Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification | Rifen Lin et.al. | 2511.13150 | null |
| 2025-11-17 | VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language | Zonghao Ying et.al. | 2511.13127 | null |
| 2025-11-14 | Scalable Policy Evaluation with Video World Models | Wei-Cheng Tseng et.al. | 2511.11520 | null |
| 2025-11-14 | Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis | Feng-Qi Cui et.al. | 2511.11406 | null |
| 2025-11-14 | YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation | Pavel Rojtberg et.al. | 2511.11344 | null |
| 2025-11-14 | RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting | Ruocheng Wu et.al. | 2511.11213 | null |
| 2025-11-14 | VIDEOP2R: Video Understanding from Perception to Reasoning | Yifan Jiang et.al. | 2511.11113 | null |
| 2025-11-14 | LiteAttention: A Temporal Sparse Attention for Diffusion Transformers | Dor Shmilovich et.al. | 2511.11062 | null |
| 2025-11-14 | EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation | Zongyang Qiu et.al. | 2511.11002 | null |
| 2025-11-14 | Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment | Wenbin Bai et.al. | 2511.10987 | null |
| 2025-11-14 | Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition | Gunho Jung et.al. | 2511.10958 | null |
| 2025-11-14 | Language-Guided Graph Representation Learning for Video Summarization | Wenrui Li et.al. | 2511.10953 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Dataset Distillation for Pre-Trained Self-Supervised Vision Models | George Cazenavette et.al. | 2511.16674 | null |
| 2025-11-20 | EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards | Omkat Thawakar et.al. | 2511.16672 | null |
| 2025-11-20 | V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models | Yang Luo et.al. | 2511.16668 | null |
| 2025-11-20 | SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation | Zhenyuan Qin et.al. | 2511.16666 | null |
| 2025-11-20 | Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems | Elias Lumer et.al. | 2511.16654 | null |
| 2025-11-20 | Measurement incompatibility in Bayesian multiparameter quantum estimation | Francesco Albarelli et.al. | 2511.16645 | null |
| 2025-11-20 | SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction | Guolin Huang et.al. | 2511.16635 | null |
| 2025-11-20 | SAM 3D: 3Dfy Anything in Images | SAM 3D Team et.al. | 2511.16624 | null |
| 2025-11-20 | Formal Abductive Latent Explanations for Prototype-Based Networks | Jules Soria et.al. | 2511.16588 | null |
| 2025-11-20 | PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search | Alima Subedi et.al. | 2511.16576 | null |
| 2025-11-19 | GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization | Yikun Wang et.al. | 2511.15705 | null |
| 2025-11-19 | Think Visually, Reason Textually: Vision-Language Synergy in ARC | Beichen Zhang et.al. | 2511.15703 | null |
| 2025-11-19 | Joint Semantic-Channel Coding and Modulation for Token Communications | Jingkai Ying et.al. | 2511.15699 | null |
| 2025-11-19 | VisPlay: Self-Evolving Vision-Language Models from Images | Yicheng He et.al. | 2511.15661 | null |
| 2025-11-19 | When to Think and When to Look: Uncertainty-Guided Lookback | Jing Bi et.al. | 2511.15613 | null |
| 2025-11-19 | MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation | Bin Xie et.al. | 2511.15603 | null |
| 2025-11-19 | US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery | Miruna-Alexandra Gafencu et.al. | 2511.15600 | null |
| 2025-11-19 | Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector | Weiheng Zhu et.al. | 2511.15571 | null |
| 2025-11-19 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | UltraDP: Generalizable Carotid Ultrasound Scanning with Force-Aware Diffusion Policy | Ruoqu Chen et.al. | 2511.15550 | null |
| 2025-11-18 | ARC Is a Vision Problem! | Keya Hu et.al. | 2511.14761 | null |
| 2025-11-18 | UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning | Rui Tian et.al. | 2511.14760 | null |
| 2025-11-18 | Cell Shape Emerges from Motion | Gautham Gopinath et.al. | 2511.14707 | null |
| 2025-11-18 | Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances | Rishu Kumar Singh et.al. | 2511.14693 | null |
| 2025-11-18 | A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases | Tao Yang et.al. | 2511.14638 | null |
| 2025-11-18 | SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction | Meiying Gu et.al. | 2511.14633 | null |
| 2025-11-18 | Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains | Qingwei Ben et.al. | 2511.14625 | null |
| 2025-11-18 | XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation | Yilin Zhang et.al. | 2511.14604 | null |
| 2025-11-18 | Task Addition and Weight Disentanglement in Closed-Vocabulary Models | Adam Hazimeh et.al. | 2511.14569 | null |
| 2025-11-18 | A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement | Yufeng Tian et.al. | 2511.14521 | null |
| 2025-11-17 | Back to Basics: Let Denoising Generative Models Denoise | Tianhong Li et.al. | 2511.13720 | null |
| 2025-11-17 | UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity | Junwei Yu et.al. | 2511.13714 | null |
| 2025-11-17 | Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine | Xincheng Shuai et.al. | 2511.13713 | null |
| 2025-11-17 | TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models | Harold Haodong Chen et.al. | 2511.13704 | null |
| 2025-11-17 | Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation | Sofia Jamil et.al. | 2511.13689 | null |
| 2025-11-17 | Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting | Jiangnan Ye et.al. | 2511.13684 | null |
| 2025-11-17 | Cross-Learning from Scarce Data via Multi-Task Constrained Optimization | Leopoldo Agorio et.al. | 2511.13680 | null |
| 2025-11-17 | PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image | Ziang Cao et.al. | 2511.13648 | null |
| 2025-11-17 | Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures | Haohui Wang et.al. | 2511.13640 | null |
| 2025-11-17 | VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping | Haotian Dong et.al. | 2511.13587 | null |
| 2025-11-14 | LARM: A Large Articulated-Object Reconstruction Model | Sylvia Yuan et.al. | 2511.11563 | null |
| 2025-11-14 | Bridging Hidden States in Vision-Language Models | Benjamin Fein-Ashley et.al. | 2511.11526 | null |
| 2025-11-14 | CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation | Luthira Abeykoon et.al. | 2511.11522 | null |
| 2025-11-14 | SynthSoM-Twin: A Multi-Modal Sensing-Communication Digital-Twin Dataset for Sim2Real Transfer via Synesthesia of Machines | Junlong Chen et.al. | 2511.11503 | null |
| 2025-11-14 | PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models | Nhat Hoang-Xuan et.al. | 2511.11502 | null |
| 2025-11-14 | Visible and Terahertz Nonlinear Responses in the Topological Noble Metal Dichalcogenide PdTe2 | George J. de Coster et.al. | 2511.11493 | null |
| 2025-11-14 | Data-efficient U-Net for Segmentation of Carbide Microstructures in SEM Images of Steel Alloys | Alinda Ezgi Gerçek et.al. | 2511.11485 | null |
| 2025-11-14 | ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation | Kaishen Wang et.al. | 2511.11483 | null |
| 2025-11-14 | Inferring response times of perceptual decisions with Poisson variational autoencoders | Hayden R. Johnson et.al. | 2511.11480 | null |
| 2025-11-14 | Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification | Qinghao Gao et.al. | 2511.11460 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation | Elena V. Epure et.al. | 2511.16478 | null |
| 2025-11-20 | Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education | Pedro Ramoneda et.al. | 2511.16228 | null |
| 2025-11-19 | Step-Audio-R1 Technical Report | Fei Tian et.al. | 2511.15848 | null |
| 2025-11-19 | LargeSHS: A large-scale dataset of music adaptation | Chih-Pin Tan et.al. | 2511.15270 | null |
| 2025-11-19 | Aligning Generative Music AI with Human Preferences: Methods and Challenges | Dorien Herremans et.al. | 2511.15038 | null |
| 2025-11-18 | A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder | Dengyun Huang et.al. | 2511.14600 | null |
| 2025-11-18 | MuCPT: Music-related Natural Language Model Continued Pretraining | Kai Tian et.al. | 2511.14245 | null |
| 2025-11-17 | Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases | Antonio Manuel Martínez-Heredia et.al. | 2511.13987 | null |
| 2025-11-17 | Preference-Based Learning in Audio Applications: A Systematic Analysis | Aaron Broukhim et.al. | 2511.13936 | null |
| 2025-11-17 | FoleyBench: A Benchmark For Video-to-Audio Models | Satvik Dixit et.al. | 2511.13219 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-14 | Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation | Xinyi Tong et.al. | 2511.09585 | null |
| 2025-11-12 | Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation | Shulei Ji et.al. | 2511.09090 | null |
| 2025-11-12 | Design of a Six-band, 2.4-Octave (80--420 GHz) Hierarchically Summed Phased-Array Slot-Dipole Antenna Array for NEW-MUSIC | Xiaolan Huang et.al. | 2511.08990 | null |
| 2025-11-12 | Improved Modeling of Quasi-Static Thermal and Optical Response of Lumped-Element Aluminum Manganese KIDs | Adriana Gavidia et.al. | 2511.08959 | null |
| 2025-11-12 | Low-Frequency Noise Performance of Microstrip-Coupled Lumped-Element Aluminum KIDs using Hydrogenated Amorphous Silicon Parallel-Plate Capacitors for NEW-MUSIC | Simon Hempel-Costello et.al. | 2511.08898 | null |
| 2025-11-11 | Chord-conditioned Melody and Bass Generation | Alexandra C Salem et.al. | 2511.08755 | null |
| 2025-11-14 | Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models | Yi Yang et.al. | 2511.08252 | null |
| 2025-11-11 | Automatic Music Mixing using a Generative Model of Effect Embeddings | Eloi Moliner et.al. | 2511.08040 | null |
| 2025-11-10 | Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics | Jonathan Lehmkuhl et.al. | 2511.07268 | null |
| 2025-11-06 | MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers | Ali Boudaghi et.al. | 2511.04376 | null |
| 2025-11-06 | MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation | Shih-Lun Wu et.al. | 2511.03942 | null |
| 2025-11-02 | Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures | Barathi Subramanian et.al. | 2511.00793 | null |
| 2025-10-28 | GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment | Jinting Wang et.al. | 2510.26818 | null |
| 2025-10-27 | Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders | Nathan Paek et.al. | 2510.23802 | null |
| 2025-10-25 | Streaming Generation for Music Accompaniment | Yusong Wu et.al. | 2510.22105 | null |
| 2025-10-23 | GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer | Jackson Loth et.al. | 2510.21872 | null |
| 2025-10-21 | Steering Autoregressive Music Generation with Recursive Feature Machines | Daniel Zhao et.al. | 2510.19127 | null |
| 2025-10-18 | MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding | Jingyue Huang et.al. | 2510.16273 | null |
| 2025-10-16 | Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? | Qixin Deng et.al. | 2510.14249 | null |
| 2025-10-15 | UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE | Zhenyu Liu et.al. | 2510.13344 | null |
| 2025-10-17 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | null |
| 2025-10-11 | ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis | Stephen Ni-Hahn et.al. | 2510.10249 | null |
| 2025-10-07 | LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment | Jiahao Mei et.al. | 2510.05875 | null |
| 2025-10-02 | Bias beyond Borders: Global Inequalities in AI-Generated Music | Ahmet Solak et.al. | 2510.01963 | null |
| 2025-10-15 | SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing | Jiaye Tan et.al. | 2510.00395 | null |
| 2025-10-04 | HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling | Hung-Ying Chu et.al. | 2509.25694 | null |
| 2025-09-29 | Ethics Statements in AI Music Papers: The Effective and the Ineffective | Julia Barnett et.al. | 2509.25496 | null |
| 2025-09-29 | Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music | Tianle Wang et.al. | 2509.24603 | null |
| 2025-10-01 | An Agent-Based Framework for Automated Higher-Voice Harmony Generation | Nia D'Souza Ganapathy et.al. | 2509.24463 | null |
| 2025-09-28 | Time-Shifted Token Scheduling for Symbolic Music Generation | Ting-Kang Wang et.al. | 2509.23749 | null |
| 2025-09-28 | AudioMoG: Guiding Audio Generation with Mixture-of-Guidance | Junyou Wang et.al. | 2509.23727 | null |
| 2025-09-27 | AI-Assisted Music Production: A User Study on Text-to-Music Models | Francesca Ronchini et.al. | 2509.23364 | null |
| 2025-09-26 | Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach | Zijian Zhao et.al. | 2509.22378 | null |
| 2025-09-26 | MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan | Xuanchen Wang et.al. | 2509.21714 | null |
| 2025-09-21 | Difficulty-Aware Score Generation for Piano Sight-Reading | Pedro Ramoneda et.al. | 2509.16913 | null |
| 2025-09-17 | Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure | Shulei Ji et.al. | 2509.13658 | null |
| 2025-09-13 | A Traditional Approach to Symbolic Piano Continuation | Christian Zhou-Zheng et.al. | 2509.12267 | null |
| 2025-09-14 | Decoding Musical Origins: Distinguishing Human and AI Composers | Cheng-Yang Tsai et.al. | 2509.11369 | null |
| 2025-09-14 | STASE: A spatialized text-to-audio synthesis engine for music generation | Tutti Chi et.al. | 2509.11124 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | null |
| 2025-11-20 | SUNAC: Source-aware Unified Neural Audio Codec | Ryo Aihara et.al. | 2511.16126 | null |
| 2025-11-18 | OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models | Keda Tao et.al. | 2511.14582 | null |
| 2025-11-18 | Segmentwise Pruning in Audio-Language Models | Marcel Gibier et.al. | 2511.14293 | null |
| 2025-11-18 | SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM | An Yu et.al. | 2511.14143 | null |
| 2025-11-17 | PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement | Xiaobin Rong et.al. | 2511.13300 | null |
| 2025-11-16 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | null |
| 2025-11-15 | Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound | Dengming Zhang et.al. | 2511.12077 | null |
| 2025-11-14 | Evaluation of Audio Compression Codecs | Thien T. Duong et.al. | 2511.11527 | null |
| 2025-11-14 | AV-Dialog: Spoken Dialogue Models with Audio-Visual Input | Tuochao Chen et.al. | 2511.11124 | null |
| 2025-11-14 | AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization | Zhonghua Jiang et.al. | 2511.11106 | null |
| 2025-11-14 | TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models | Hualei Wang et.al. | 2511.11039 | null |
| 2025-11-09 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment | Yan Gao et.al. | 2511.10670 | null |
| 2025-11-13 | VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction | Yuhao Wang et.al. | 2511.10232 | null |
| 2025-11-13 | Towards Leveraging Sequential Structure in Animal Vocalizations | Eklavya Sarkar et.al. | 2511.10190 | null |
| 2025-11-12 | POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation | Xuanchen Li et.al. | 2511.09232 | null |
| 2025-11-12 | HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios | Bingsong Bai et.al. | 2511.08496 | null |
| 2025-11-10 | Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models | Umberto Cappellazzo et.al. | 2511.07253 | null |
| 2025-11-10 | Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection | Brage Eilertsen et.al. | 2511.07065 | null |
| 2025-11-08 | BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction | Haoran Wang et.al. | 2511.06150 | null |
| 2025-11-05 | Seeing What You Say: Expressive Image Generation from Speech | Jiyoung Lee et.al. | 2511.03423 | null |
| 2025-11-05 | Open Source State-Of-the-Art Solution for Romanian Speech Recognition | Gabriel Pirlogeanu et.al. | 2511.03361 | null |
| 2025-11-05 | audio2chart: End to End Audio Transcription into playable Guitar Hero charts | Riccardo Tripodi et.al. | 2511.03337 | null |
| 2025-11-04 | An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM | Jiawei Liu et.al. | 2511.02234 | null |
| 2025-11-03 | ADNAC: Audio Denoiser using Neural Audio Codec | Daniel Jimon et.al. | 2511.01773 | null |
| 2025-10-30 | UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens | Chengwei Liu et.al. | 2510.26372 | null |
| 2025-10-30 | Modeling strategies for speech enhancement in the latent space of a neural audio codec | Sofiene Kammoun et.al. | 2510.26299 | null |
| 2025-10-29 | PitchFlower: A flow-based neural audio codec with pitch controllability | Diego Torres et.al. | 2510.25566 | null |
| 2025-10-29 | Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR | Shreyas Gopal et.al. | 2510.25150 | null |
| 2025-10-28 | Bayesian Speech synthesizers Can Learn from Multiple Teachers | Ziyang Zhang et.al. | 2510.24372 | null |
| 2025-10-28 | Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations | Ahmad Ghannam et.al. | 2510.24247 | null |
| 2025-10-28 | Low-Resource Audio Codec (LRAC): 2025 Challenge Description | Kamil Wojcicki et.al. | 2510.23312 | null |
| 2025-10-25 | FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss | Parthasaarathy Sudarsanam et.al. | 2510.22241 | null |
| 2025-10-24 | SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain | Zixiang Wan et.al. | 2510.21209 | null |
| 2025-10-24 | Robust Distortion-Free Watermark for Autoregressive Audio Generation Models | Yihan Wu et.al. | 2510.21115 | null |
| 2025-10-23 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding | Xin Zhang et.al. | 2510.20504 | null |
| 2025-10-23 | UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement | Haoyin Yan et.al. | 2510.20441 | null |
| 2025-10-19 | SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization | Wenxi Chen et.al. | 2510.16841 | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | null |
| 2025-10-17 | LDCodec: A high quality neural audio codec with low-complexity decoder | Jiawei Jiang et.al. | 2510.15364 | null |
| 2025-10-17 | Extending Audio Context for Long-Form Understanding in Large Audio-Language Models | Yuatyong Chaichana et.al. | 2510.15231 | null |
| 2025-10-20 | LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models | Xiaohan Zhao et.al. | 2510.15227 | null |
| 2025-10-16 | TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation | Ming-Hao Hsu et.al. | 2510.14934 | null |
| 2025-10-15 | Acoustic Teleportation via Disentangled Neural Audio Codec Representations | Philipp Grundhuber et.al. | 2510.13221 | null |
| 2025-10-13 | UALM: Unified Audio Language Model for Understanding, Generation and Reasoning | Jinchuan Tian et.al. | 2510.12000 | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | null |
| 2025-10-12 | FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec | Yurii Halychanskyi et.al. | 2510.10785 | null |
| 2025-10-11 | SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation | Zeyu Ling et.al. | 2510.10069 | null |
| 2025-10-11 | MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction | Jianjin Wang et.al. | 2510.10003 | null |
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-11-20 | Cognitive Foundations for Reasoning and Their Manifestation in LLMs | Priyanka Kargupta et.al. | 2511.16660 | null |
| 2025-11-20 | SUNAC: Source-aware Unified Neural Audio Codec | Ryo Aihara et.al. | 2511.16126 | null |
| 2025-11-20 | Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio | Mohan Shi et.al. | 2511.16046 | null |
| 2025-11-20 | Multimodal Evaluation of Russian-language Architectures | Artem Chervyakov et.al. | 2511.15552 | null |
| 2025-11-19 | Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding | Mingyue Huo et.al. | 2511.15145 | null |
| 2025-11-18 | A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder | Dengyun Huang et.al. | 2511.14600 | null |
| 2025-11-18 | OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models | Keda Tao et.al. | 2511.14582 | null |
| 2025-11-18 | Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning | Trishala Jayesh Ahalpara et.al. | 2511.14445 | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | null |
| 2025-11-18 | Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions | Marcel Gibier et.al. | 2511.14307 | null |
| 2025-11-18 | Segmentwise Pruning in Audio-Language Models | Marcel Gibier et.al. | 2511.14293 | null |
| 2025-11-18 | SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM | An Yu et.al. | 2511.14143 | null |
| 2025-11-18 | O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents | Piaohong Wang et.al. | 2511.13593 | null |
| 2025-11-17 | Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs | Zhe Sun et.al. | 2511.13273 | null |
| 2025-11-17 | You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection | Guoyi Zhang et.al. | 2511.13013 | null |
| 2025-11-16 | Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data | Yunxin Li et.al. | 2511.12609 | null |
| 2025-11-16 | DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions | Xiaoyu Lin et.al. | 2511.12452 | null |
| 2025-11-16 | SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs | Shail Desai et.al. | 2511.12404 | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | null |
| 2025-11-15 | Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound | Dengming Zhang et.al. | 2511.12077 | null |
| 2025-11-14 | AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization | Zhonghua Jiang et.al. | 2511.11106 | null |
| 2025-11-14 | TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models | Hualei Wang et.al. | 2511.11039 | null |
| 2025-11-14 | DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition | HongYu Liu et.al. | 2511.11000 | null |
| 2025-11-14 | Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio | Guangke Chen et.al. | 2511.10913 | null |
| 2025-11-14 | OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer | Haosong Peng et.al. | 2511.10560 | null |
| 2025-11-13 | Music Flamingo: Scaling Music Understanding in Audio Language Models | Sreyan Ghosh et.al. | 2511.10289 | null |
| 2025-11-13 | OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models | Yuping Yan et.al. | 2511.10287 | null |
| 2025-11-14 | Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard | Yudong Yang et.al. | 2511.10222 | null |
| 2025-11-13 | When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? | Qilang Ye et.al. | 2511.10059 | null |
| 2025-11-13 | Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism | Jinhong Jeong et.al. | 2511.10045 | null |
| 2025-11-13 | Reinforcing Trustworthiness in Multimodal Emotional Support Systems | Huy M. Le et.al. | 2511.10011 | null |
| 2025-11-13 | Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation | Xiangyi Wei et.al. | 2511.09958 | null |
| 2025-11-13 | HI-TransPA: Hearing Impairments Translation Personal Assistant | Zhiming Ma et.al. | 2511.09915 | null |
| 2025-11-12 | State Space Modeling of Mortgage Default Rates under Natural Hazard Shocks | Samuel J. Eschker et.al. | 2511.09698 | null |
| 2025-11-11 | Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models | Umberto Cappellazzo et.al. | 2511.07253 | link |
| 2025-11-06 | CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese | Dazhong Chen et.al. | 2511.04139 | null |
| 2025-11-06 | WST: Weakly Supervised Transducer for Automatic Speech Recognition | Dongji Gao et.al. | 2511.04035 | null |
| 2025-11-05 | Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything | Huawei Lin et.al. | 2511.02834 | null |
| 2025-11-05 | The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models | Claudia Herambourg et.al. | 2511.02589 | null |
| 2025-11-03 | SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia | Chaoqun Liu et.al. | 2511.01670 | null |
| 2025-11-03 | Classification of motor faults based on transmission coefficient and reflection coefficient of omni-directional antenna using DCNN | Sagar Dutta et.al. | 2511.01371 | null |
| 2025-11-06 | OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation | Heyu Guo et.al. | 2511.01210 | null |
| 2025-11-02 | Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models | Junqi Zhao et.al. | 2511.01091 | null |
| 2025-10-31 | LongCat-Flash-Omni Technical Report | Meituan LongCat Team et.al. | 2511.00279 | null |
| 2025-10-31 | Sensor operating point calibration and monitoring of the ALICE Inner Tracking System during LHC Run 3 | D. Agguiaro et.al. | 2510.27592 | null |
| 2025-10-30 | ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models | Weifei Jin et.al. | 2510.26096 | null |
| 2025-10-29 | Convergence of a Relative-type Inexact Proximal ALM for Convex Nonlinear Programming | Lei Yang et.al. | 2510.25261 | null |
| 2025-10-28 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | null |
| 2025-10-28 | Generative View Stitching | Chonghyuk Song et.al. | 2510.24718 | null |
| 2025-10-28 | STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence | Zihan Liu et.al. | 2510.24693 | null |