Skip to content

ZhikangNiu/arxiv_daily

Repository files navigation

Updated on 2025.11.23

Usage instructions: here

Table of Contents
  1. Text to Speech
  2. Text to Audio
  3. Video to Audio
  4. Voice Conversion
  5. Video Generation
  6. Image Generation
  7. Music Generation
  8. Audio Codec
  9. Large Audio Language Model

Text to Speech

Publish Date Title Authors PDF Code
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 null
2025-11-20 WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue Zachary Ellis et.al. 2511.16544 null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 null
2025-11-19 Universal TT- and TQ-relations via centrally extended q-Onsager algebra Pascal Baseilhac et.al. 2511.15876 null
2025-11-19 Step-Audio-R1 Technical Report Fei Tian et.al. 2511.15848 null
2025-11-19 A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification Mohit Sharma et.al. 2511.15766 null
2025-11-19 PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback Sirui Chen et.al. 2511.15253 null
2025-11-19 Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding Mingyue Huo et.al. 2511.15145 null
2025-11-19 Aligning Generative Music AI with Human Preferences: Methods and Challenges Dorien Herremans et.al. 2511.15038 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants Mingkun Yu et.al. 2511.14852 null
2025-11-18 Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim et.al. 2511.14824 null
2025-11-18 Ground Truth Generation for Multilingual Historical NLP using LLMs Clovis Gladstone et.al. 2511.14688 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-18 Periods in equivariant and motivic contexts Martin Gallauer et.al. 2511.14325 null
2025-11-18 AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Gabrial Zencha Ashungafac et.al. 2511.14255 null
2025-11-18 Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning Rui Liu et.al. 2511.14249 link
2025-11-18 StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Yifan Yang et.al. 2511.14223 null
2025-11-18 FxSearcher: gradient-free text-driven audio transformation Hojoon Ki et.al. 2511.14138 null
2025-11-17 Human-centric Maintenance Process Through Integration of AI, Speech, and AR Parul Khanna et.al. 2511.13918 null
2025-11-17 Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video Filippo Cenacchi. Longbing Cao et.al. 2511.13802 null
2025-11-17 PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement Xiaobin Rong et.al. 2511.13300 null
2025-11-17 Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms Patrick Parschan et.al. 2511.13238 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Zaara Zabeen Arpa et.al. 2511.13159 link
2025-11-17 A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning Liuyi Jin et.al. 2511.13078 null
2025-11-17 CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models Mehrab Mustafy Rahman et.al. 2511.12964 null
2025-11-16 Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data Sina Rashidi et.al. 2511.12690 null
2025-11-16 Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans Hongbin Huang et.al. 2511.12662 null
2025-11-16 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 null
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-14 Proactive Hearing Assistants that Isolate Egocentric Conversations Guilin Hu et.al. 2511.11473 link
2025-11-14 Language-Aided State Estimation Yuki Miyoshi et.al. 2511.11285 null
2025-11-14 Extended-Krylov-subspace methods for trust-region and norm-regularization subproblems Hussam Al Daas et.al. 2511.11135 null
2025-11-14 Analysing Personal Attacks in U.S. Presidential Debates Ruban Goyal et.al. 2511.11108 null
2025-11-14 CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation Crystal Min Hui Poon et.al. 2511.11104 null
2025-11-14 CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding Yifan Zhuang et.al. 2511.10935 null
2025-11-14 Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio Guangke Chen et.al. 2511.10913 null
2025-11-13 Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces Farhan Sheth et.al. 2511.10793 null
2025-11-13 Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning Girish et.al. 2511.10790 null
2025-11-13 XSNAP: An X-ray Supernova Analysis Pipeline with Application to the Type II Supernova 2024ggi Ferdinand et.al. 2511.10744 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-13 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction Yuhao Wang et.al. 2511.10232 null
2025-11-13 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard Yudong Yang et.al. 2511.10222 null
2025-11-13 Towards Leveraging Sequential Structure in Animal Vocalizations Eklavya Sarkar et.al. 2511.10190 link
2025-11-13 FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features Wenyu Wang et.al. 2511.10112 null
2025-11-13 Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints Xiangyue Zhang et.al. 2511.10076 null
2025-11-13 Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS Haoyu Li et.al. 2511.09995 null
2025-11-13 MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection Pritish Sahu et.al. 2511.09918 null
2025-11-12 Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Omnilingual ASR team et.al. 2511.09690 null

(back to top)

Text to Audio

Publish Date Title Authors PDF Code
2025-11-20 Cognitive Foundations for Reasoning and Their Manifestation in LLMs Priyanka Kargupta et.al. 2511.16660 null
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 null
2025-11-19 Step-Audio-R1 Technical Report Fei Tian et.al. 2511.15848 null
2025-11-19 A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification Mohit Sharma et.al. 2511.15766 null
2025-11-20 Multimodal Evaluation of Russian-language Architectures Artem Chervyakov et.al. 2511.15552 null
2025-11-19 Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models Mehran Tamjidi et.al. 2511.15311 null
2025-11-19 Detection of spiking motifs of arbitrary length in neural activity using bounded synaptic delays Thomas Kronland-Martinet et.al. 2511.15296 null
2025-11-19 SNAP: Low-Latency Test-Time Adaptation with Sparse Updates Hyeongheon Cha et.al. 2511.15276 null
2025-11-19 LargeSHS: A large-scale dataset of music adaptation Chih-Pin Tan et.al. 2511.15270 null
2025-11-19 Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding Mingyue Huo et.al. 2511.15145 null
2025-11-19 Aligning Generative Music AI with Human Preferences: Methods and Challenges Dorien Herremans et.al. 2511.15038 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems Jaro Meyer et.al. 2511.14948 null
2025-11-18 Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report Daniel Oliveira de Brito et.al. 2511.14939 null
2025-11-18 A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder Dengyun Huang et.al. 2511.14600 null
2025-11-18 Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning Trishala Jayesh Ahalpara et.al. 2511.14445 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-18 H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata Chenyang Xu et.al. 2511.14312 null
2025-11-18 Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions Marcel Gibier et.al. 2511.14307 null
2025-11-18 EBind: a practical approach to space binding Jim Broadbent et.al. 2511.14229 null
2025-11-18 StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model Yifan Yang et.al. 2511.14223 null
2025-11-18 FxSearcher: gradient-free text-driven audio transformation Hojoon Ki et.al. 2511.14138 null
2025-11-18 Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services Liuyi Jin et.al. 2511.14119 null
2025-11-17 Preference-Based Learning in Audio Applications: A Systematic Analysis Aaron Broukhim et.al. 2511.13936 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language Zonghao Ying et.al. 2511.13127 null
2025-11-17 A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning Liuyi Jin et.al. 2511.13078 null
2025-11-16 Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine Ziqiong Liu et.al. 2511.12607 null
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-16 SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs Shail Desai et.al. 2511.12404 null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 null
2025-11-15 Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound Dengming Zhang et.al. 2511.12077 null
2025-11-15 ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation Jiahui Sun et.al. 2511.12072 null
2025-11-14 Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering Tianyu Xu et.al. 2511.11930 null
2025-11-14 Proactive Hearing Assistants that Isolate Egocentric Conversations Guilin Hu et.al. 2511.11473 null
2025-11-14 AV-Dialog: Spoken Dialogue Models with Audio-Visual Input Tuochao Chen et.al. 2511.11124 null
2025-11-14 DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition HongYu Liu et.al. 2511.11000 null
2025-11-14 Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio Guangke Chen et.al. 2511.10913 null
2025-11-13 Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces Farhan Sheth et.al. 2511.10793 null
2025-11-13 Panda: Test-Time Adaptation with Negative Data Augmentation Ruxi Deng et.al. 2511.10481 null
2025-11-13 TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities Yan Zhuang et.al. 2511.10325 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-13 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Yuping Yan et.al. 2511.10287 null
2025-11-14 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard Yudong Yang et.al. 2511.10222 null
2025-11-13 Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization Ashutosh Anshul et.al. 2511.10212 null
2025-11-13 RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo Jueun Ko et.al. 2511.10107 null
2025-11-13 When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? Qilang Ye et.al. 2511.10059 null
2025-11-13 Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism Jinhong Jeong et.al. 2511.10045 null
2025-11-13 Reinforcing Trustworthiness in Multimodal Emotional Support Systems Huy M. Le et.al. 2511.10011 null

(back to top)

Video to Audio

Publish Date Title Authors PDF Code
2025-11-20 Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty Victor Croisfelt et.al. 2511.16225 null
2025-11-19 MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features Sejuti Rahman et.al. 2511.15675 null
2025-11-20 Multimodal Evaluation of Russian-language Architectures Artem Chervyakov et.al. 2511.15552 null
2025-11-19 A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data Mauro Larrat et.al. 2511.15312 null
2025-11-18 Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion Zanxu Wang et.al. 2511.14969 null
2025-11-18 RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems Jaro Meyer et.al. 2511.14948 null
2025-11-18 OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Keda Tao et.al. 2511.14582 null
2025-11-18 Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning Rui Liu et.al. 2511.14249 null
2025-11-18 EBind: a practical approach to space binding Jim Broadbent et.al. 2511.14229 null
2025-11-18 SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM An Yu et.al. 2511.14143 null
2025-11-18 Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services Liuyi Jin et.al. 2511.14119 null
2025-11-17 Segmenting Collision Sound Sources in Egocentric Videos Kranti Kumar Parida et.al. 2511.13863 null
2025-11-17 Towards Affect-Adaptive Human-Robot Interaction: A Protocol for Multimodal Dataset Collection on Social Anxiety Vesna Poprcova et.al. 2511.13530 null
2025-11-17 CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving Enhui Ma et.al. 2511.13297 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language Zonghao Ying et.al. 2511.13127 null
2025-11-17 A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning Liuyi Jin et.al. 2511.13078 null
2025-11-17 Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views Junyi Ma et.al. 2511.12878 null
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-16 SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs Shail Desai et.al. 2511.12404 null
2025-11-15 Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound Dengming Zhang et.al. 2511.12077 null
2025-11-15 ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation Jiahui Sun et.al. 2511.12072 null
2025-11-14 AV-Dialog: Spoken Dialogue Models with Audio-Visual Input Tuochao Chen et.al. 2511.11124 null
2025-11-14 AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization Zhonghua Jiang et.al. 2511.11106 null
2025-11-13 TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities Yan Zhuang et.al. 2511.10325 null
2025-11-13 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Yuping Yan et.al. 2511.10287 null
2025-11-13 Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization Ashutosh Anshul et.al. 2511.10212 null
2025-11-13 When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? Qilang Ye et.al. 2511.10059 null
2025-11-13 Reinforcing Trustworthiness in Multimodal Emotional Support Systems Huy M. Le et.al. 2511.10011 null
2025-11-13 Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation Xiangyi Wei et.al. 2511.09958 null
2025-11-14 HI-TransPA: Hearing Impairments Translation Personal Assistant Zhiming Ma et.al. 2511.09915 null
2025-11-12 Co-Designing Multimodal Systems for Accessible Remote Dance Instruction Ujjaini Das et.al. 2511.09658 null
2025-11-12 MCAD: Multimodal Context-Aware Audio Description Generation For Soccer Lipisha Chaudhary et.al. 2511.09448 null
2025-11-12 Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection Anushka Sanjay Shelke et.al. 2511.09039 null
2025-11-05 UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions Guozhen Zhang et.al. 2511.03334 null
2025-10-28 Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation Kang Zhang et.al. 2510.24103 null
2025-10-10 MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation Akira Takahashi et.al. 2510.09065 null
2025-10-28 Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation Liyang Chen et.al. 2510.08078 null
2025-10-09 IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries Harsh Kavediya et.al. 2510.07837 null
2025-10-07 FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders Riccardo Fosco Gramaccioni et.al. 2510.05829 null
2025-10-07 StereoSync: Spatially-Aware Stereo Audio Generation from Video Christian Marinoni et.al. 2510.05828 null
2025-10-03 SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos Amir Dellali et.al. 2510.02916 null
2025-10-02 SoundReactor: Frame-level Online Video-to-Audio Generation Koichi Saito et.al. 2510.02110 null
2025-09-29 Training-Free Multimodal Guidance for Video to Audio Generation Eleonora Grassucci et.al. 2509.24550 null
2025-09-28 AudioMoG: Guiding Audio Generation with Mixture-of-Guidance Junyou Wang et.al. 2509.23727 null
2025-09-26 WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM Changli Tang et.al. 2509.21990 null
2025-09-26 Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers Jibin Song et.al. 2509.21893 null
2025-09-24 MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization Jianxuan Yang et.al. 2509.19999 null
2025-10-05 StereoFoley: Object-Aware Stereo Audio Generation from Video Tornike Karchkhadze et.al. 2509.18272 null
2025-09-19 Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech Xinlei Niu et.al. 2509.15492 null

(back to top)

Voice Conversion

Publish Date Title Authors PDF Code
2025-11-20 Neutron star heating vs. HST observations Luis E. Rodríguez et.al. 2511.16507 null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 null
2025-11-19 PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback Sirui Chen et.al. 2511.15253 null
2025-11-18 AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Gabrial Zencha Ashungafac et.al. 2511.14255 null
2025-11-17 Large cliques in graphs with forbidden semi-induced structures Nannan Chen et.al. 2511.13073 null
2025-11-16 Leave-One-Out Learning with Log-Loss Yaniv Fogel et.al. 2511.12718 null
2025-11-16 Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back Alon Cohen et.al. 2511.12659 null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 null
2025-11-14 Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers Vahid Hemmati et.al. 2511.11834 null
2025-11-14 Vortex breakdown and its topologies in turbulent flows within a typical swirl combustor geometry Nitesh Kumar Sahu et.al. 2511.11420 null
2025-11-13 FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features Wenyu Wang et.al. 2511.10112 null
2025-11-12 Sample Complexity of Quadratically Regularized Optimal Transport Alberto González-Sanz et.al. 2511.09807 null
2025-11-13 Reduced-Complexity Model Selection and Rate Allocation for Multiple-Model Electrical Signal Compression Corentin Presvôts et.al. 2511.09370 null
2025-11-12 VC-dimension of Salem sets over finite fields Moustapha Diallo et.al. 2511.08963 null
2025-11-12 HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios Bingsong Bai et.al. 2511.08496 null
2025-11-10 ConvFill: Model Collaboration for Responsive Conversational Voice Agents Vidya Srinivas et.al. 2511.07397 null
2025-11-10 Generating Novel and Realistic Speakers for Voice Conversion Meiying Melissa Chen et.al. 2511.07135 null
2025-11-10 E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Zhisheng Zhang et.al. 2511.07099 null
2025-11-10 Personalizing Emotion-aware Conversational Agents? Exploring User Traits-driven Conversational Strategies for Enhanced Interaction Yuchong Zhang et.al. 2511.06954 null
2025-11-09 How Founder Expertise Shapes the Impact of Generative Artificial Intelligence on Digital Ventures Ruiqing Cao et.al. 2511.06545 null
2025-11-06 Vector Traits Shape Disease Persistence: A Predator Prey Approach to Dengue Piyumi Chathurangika et.al. 2511.04276 null
2025-11-04 Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning David Kattermann et.al. 2511.02644 null
2025-10-31 Consequences of Dependent Dividing on Burden Yuki Takahashi et.al. 2511.00282 null
2025-10-31 NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion Zongyang Du et.al. 2511.00256 null
2025-10-30 UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens Chengwei Liu et.al. 2510.26372 null
2025-10-28 Bayesian Speech synthesizers Can Learn from Multiple Teachers Ziyang Zhang et.al. 2510.24372 null
2025-10-24 StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks Jingyue Huang et.al. 2510.21685 null
2025-10-23 Charge-density waves and stripes in quarter metals of graphene heterostructures Sk Asrap Murshed et.al. 2510.20816 null
2025-10-23 R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion Junjie Zheng et.al. 2510.20677 null
2025-10-22 VBx for End-to-End Neural and Clustering-based Diarization Petr Pálka et.al. 2510.19572 null
2025-10-20 Fast Agnostic Learners in the Plane Talya Eden et.al. 2510.18057 null
2025-10-20 Joint upper Banach density, VC dimensions and Euclidean point configurations Bruno Predojević et.al. 2510.17453 null
2025-10-23 The Parameterized Complexity of Computing the VC-Dimension Florent Foucaud et.al. 2510.17451 null
2025-10-18 Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension Timothy M. Chan et.al. 2510.16346 null
2025-10-22 VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition Kye Shimizu et.al. 2510.16192 null
2025-10-16 Deadlock-free routing for Full-mesh networks without using Virtual Channels Alejandro Cano et.al. 2510.14730 null
2025-10-15 The VC-dimension and point configurations in $\mathbb{R}^d$ Alex Iosevich et.al. 2510.13984 null
2025-10-16 VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions Fan Chang et.al. 2510.13705 null
2025-10-15 Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity Ahmad Awad et.al. 2510.13609 null
2025-10-15 Target Controllability Score Kazuhiro Sato et.al. 2510.13354 link
2025-10-14 VCTR: A Transformer-Based Model for Non-parallel Voice Conversion Maharnab Saikia et.al. 2510.12964 null
2025-10-15 (R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm Kevin Krings et.al. 2510.12364 null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 null
2025-10-13 VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents Jiliang Hu et.al. 2510.11098 null
2025-10-10 A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs Hui Yuan et.al. 2510.09715 null
2025-10-10 SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion Zhao Guo et.al. 2510.09245 null
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 null
2025-10-09 MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows Guobin Ma et.al. 2510.08392 null
2025-10-09 What Makes a Visualization Complex? Mengdi Chu et.al. 2510.08332 null
2025-10-09 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Dhruv Jain et.al. 2510.07978 null

(back to top)

Video Generation

Publish Date Title Authors PDF Code
2025-11-20 Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO Junhao Cheng et.al. 2511.16669 null
2025-11-20 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models Yang Luo et.al. 2511.16668 null
2025-11-20 SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking Haofeng Liu et.al. 2511.16618 null
2025-11-20 YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras Fan Yang et.al. 2511.16521 null
2025-11-20 An analytical and experimental study of the energy transition discourse on YouTube Aleix Bassolas et.al. 2511.16497 null
2025-11-20 Flow and Depth Assisted Video Prediction with Latent Transformer Eliyas Suleyman et.al. 2511.16484 null
2025-11-20 PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks Kewei Chen et.al. 2511.16200 null
2025-11-20 FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos Jeremie Ochin et.al. 2511.16183 null
2025-11-20 Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight Yi Yang et.al. 2511.16175 null
2025-11-20 Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning Yibin Huang et.al. 2511.16160 null
2025-11-19 First Frame Is the Place to Go for Video Content Customization Jingxi Chen et.al. 2511.15700 null
2025-11-19 Joint Semantic-Channel Coding and Modulation for Token Communications Jingkai Ying et.al. 2511.15699 null
2025-11-19 The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification Dante Francisco Wasmuht et.al. 2511.15622 null
2025-11-19 Multimodal Evaluation of Russian-language Architectures Artem Chervyakov et.al. 2511.15552 null
2025-11-19 Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners Xabier Lekunberri et.al. 2511.15468 null
2025-11-19 ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation Simon Boeder et.al. 2511.15396 null
2025-11-19 PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback Sirui Chen et.al. 2511.15253 null
2025-11-19 Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation Firdavs Nasriddinov et.al. 2511.15159 null
2025-11-19 Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks Cheng Yang et.al. 2511.15065 null
2025-11-19 Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation Vladimir Arkhipkin et.al. 2511.14993 null
2025-11-18 Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising Yifan Wang et.al. 2511.14719 null
2025-11-18 FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation Yunfeng Wu et.al. 2511.14712 null
2025-11-18 ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection Mohammad Romani et.al. 2511.14554 null
2025-11-18 DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation Xiangchen Yin et.al. 2511.14530 null
2025-11-18 FlowRoI A Fast Optical Flow Driven Region of Interest Extraction Framework for High-Throughput Image Compression in Immune Cell Migration Analysis Xiaowei Xu et.al. 2511.14419 null
2025-11-18 ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries Junfu Pu et.al. 2511.14349 null
2025-11-18 Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs Yiyi Miao et.al. 2511.14315 null
2025-11-18 Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning Rui Liu et.al. 2511.14249 null
2025-11-18 InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior Weimin Bai et.al. 2511.14208 null
2025-11-18 Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion Zhuo Li et.al. 2511.14178 null
2025-11-17 Segment Anything Across Shots: A Method and Benchmark Hengrui Hu et.al. 2511.13715 null
2025-11-17 UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity Junwei Yu et.al. 2511.13714 null
2025-11-17 TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models Harold Haodong Chen et.al. 2511.13704 null
2025-11-17 Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting Jiangnan Ye et.al. 2511.13684 null
2025-11-17 CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding Shrenik Patel et.al. 2511.13644 null
2025-11-17 Computer Vision based group activity detection and action spotting Narthana Sivalingam et.al. 2511.13315 null
2025-11-17 CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving Enhui Ma et.al. 2511.13297 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-17 Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification Rifen Lin et.al. 2511.13150 null
2025-11-17 VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language Zonghao Ying et.al. 2511.13127 null
2025-11-14 Scalable Policy Evaluation with Video World Models Wei-Cheng Tseng et.al. 2511.11520 null
2025-11-14 Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis Feng-Qi Cui et.al. 2511.11406 null
2025-11-14 YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation Pavel Rojtberg et.al. 2511.11344 null
2025-11-14 RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting Ruocheng Wu et.al. 2511.11213 null
2025-11-14 VIDEOP2R: Video Understanding from Perception to Reasoning Yifan Jiang et.al. 2511.11113 null
2025-11-14 LiteAttention: A Temporal Sparse Attention for Diffusion Transformers Dor Shmilovich et.al. 2511.11062 null
2025-11-14 EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation Zongyang Qiu et.al. 2511.11002 null
2025-11-14 Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment Wenbin Bai et.al. 2511.10987 null
2025-11-14 Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition Gunho Jung et.al. 2511.10958 null
2025-11-14 Language-Guided Graph Representation Learning for Video Summarization Wenrui Li et.al. 2511.10953 null

(back to top)

Image Generation

Publish Date Title Authors PDF Code
2025-11-20 Dataset Distillation for Pre-Trained Self-Supervised Vision Models George Cazenavette et.al. 2511.16674 null
2025-11-20 EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards Omkat Thawakar et.al. 2511.16672 null
2025-11-20 V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models Yang Luo et.al. 2511.16668 null
2025-11-20 SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation Zhenyuan Qin et.al. 2511.16666 null
2025-11-20 Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems Elias Lumer et.al. 2511.16654 null
2025-11-20 Measurement incompatibility in Bayesian multiparameter quantum estimation Francesco Albarelli et.al. 2511.16645 null
2025-11-20 SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction Guolin Huang et.al. 2511.16635 null
2025-11-20 SAM 3D: 3Dfy Anything in Images SAM 3D Team et.al. 2511.16624 null
2025-11-20 Formal Abductive Latent Explanations for Prototype-Based Networks Jules Soria et.al. 2511.16588 null
2025-11-20 PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search Alima Subedi et.al. 2511.16576 null
2025-11-19 GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization Yikun Wang et.al. 2511.15705 null
2025-11-19 Think Visually, Reason Textually: Vision-Language Synergy in ARC Beichen Zhang et.al. 2511.15703 null
2025-11-19 Joint Semantic-Channel Coding and Modulation for Token Communications Jingkai Ying et.al. 2511.15699 null
2025-11-19 VisPlay: Self-Evolving Vision-Language Models from Images Yicheng He et.al. 2511.15661 null
2025-11-19 When to Think and When to Look: Uncertainty-Guided Lookback Jing Bi et.al. 2511.15613 null
2025-11-19 MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation Bin Xie et.al. 2511.15603 null
2025-11-19 US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery Miruna-Alexandra Gafencu et.al. 2511.15600 null
2025-11-19 Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector Weiheng Zhu et.al. 2511.15571 null
2025-11-19 Multimodal Evaluation of Russian-language Architectures Artem Chervyakov et.al. 2511.15552 null
2025-11-19 UltraDP: Generalizable Carotid Ultrasound Scanning with Force-Aware Diffusion Policy Ruoqu Chen et.al. 2511.15550 null
2025-11-18 ARC Is a Vision Problem! Keya Hu et.al. 2511.14761 null
2025-11-18 UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning Rui Tian et.al. 2511.14760 null
2025-11-18 Cell Shape Emerges from Motion Gautham Gopinath et.al. 2511.14707 null
2025-11-18 Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances Rishu Kumar Singh et.al. 2511.14693 null
2025-11-18 A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases Tao Yang et.al. 2511.14638 null
2025-11-18 SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction Meiying Gu et.al. 2511.14633 null
2025-11-18 Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains Qingwei Ben et.al. 2511.14625 null
2025-11-18 XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation Yilin Zhang et.al. 2511.14604 null
2025-11-18 Task Addition and Weight Disentanglement in Closed-Vocabulary Models Adam Hazimeh et.al. 2511.14569 null
2025-11-18 A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement Yufeng Tian et.al. 2511.14521 null
2025-11-17 Back to Basics: Let Denoising Generative Models Denoise Tianhong Li et.al. 2511.13720 null
2025-11-17 UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity Junwei Yu et.al. 2511.13714 null
2025-11-17 Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine Xincheng Shuai et.al. 2511.13713 null
2025-11-17 TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models Harold Haodong Chen et.al. 2511.13704 null
2025-11-17 Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation Sofia Jamil et.al. 2511.13689 null
2025-11-17 Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting Jiangnan Ye et.al. 2511.13684 null
2025-11-17 Cross-Learning from Scarce Data via Multi-Task Constrained Optimization Leopoldo Agorio et.al. 2511.13680 null
2025-11-17 PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Ziang Cao et.al. 2511.13648 null
2025-11-17 Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures Haohui Wang et.al. 2511.13640 null
2025-11-17 VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping Haotian Dong et.al. 2511.13587 null
2025-11-14 LARM: A Large Articulated-Object Reconstruction Model Sylvia Yuan et.al. 2511.11563 null
2025-11-14 Bridging Hidden States in Vision-Language Models Benjamin Fein-Ashley et.al. 2511.11526 null
2025-11-14 CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation Luthira Abeykoon et.al. 2511.11522 null
2025-11-14 SynthSoM-Twin: A Multi-Modal Sensing-Communication Digital-Twin Dataset for Sim2Real Transfer via Synesthesia of Machines Junlong Chen et.al. 2511.11503 null
2025-11-14 PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models Nhat Hoang-Xuan et.al. 2511.11502 null
2025-11-14 Visible and Terahertz Nonlinear Responses in the Topological Noble Metal Dichalcogenide PdTe2 George J. de Coster et.al. 2511.11493 null
2025-11-14 Data-efficient U-Net for Segmentation of Carbide Microstructures in SEM Images of Steel Alloys Alinda Ezgi Gerçek et.al. 2511.11485 null
2025-11-14 ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation Kaishen Wang et.al. 2511.11483 null
2025-11-14 Inferring response times of perceptual decisions with Poisson variational autoencoders Hayden R. Johnson et.al. 2511.11480 null
2025-11-14 Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification Qinghao Gao et.al. 2511.11460 null

(back to top)

Music Generation

Publish Date Title Authors PDF Code
2025-11-20 Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation Elena V. Epure et.al. 2511.16478 null
2025-11-20 Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education Pedro Ramoneda et.al. 2511.16228 null
2025-11-19 Step-Audio-R1 Technical Report Fei Tian et.al. 2511.15848 null
2025-11-19 LargeSHS: A large-scale dataset of music adaptation Chih-Pin Tan et.al. 2511.15270 null
2025-11-19 Aligning Generative Music AI with Human Preferences: Methods and Challenges Dorien Herremans et.al. 2511.15038 null
2025-11-18 A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder Dengyun Huang et.al. 2511.14600 null
2025-11-18 MuCPT: Music-related Natural Language Model Continued Pretraining Kai Tian et.al. 2511.14245 null
2025-11-17 Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases Antonio Manuel Martínez-Heredia et.al. 2511.13987 null
2025-11-17 Preference-Based Learning in Audio Applications: A Systematic Analysis Aaron Broukhim et.al. 2511.13936 null
2025-11-17 FoleyBench: A Benchmark For Video-to-Audio Models Satvik Dixit et.al. 2511.13219 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-14 Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation Xinyi Tong et.al. 2511.09585 null
2025-11-12 Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation Shulei Ji et.al. 2511.09090 null
2025-11-12 Design of a Six-band, 2.4-Octave (80--420 GHz) Hierarchically Summed Phased-Array Slot-Dipole Antenna Array for NEW-MUSIC Xiaolan Huang et.al. 2511.08990 null
2025-11-12 Improved Modeling of Quasi-Static Thermal and Optical Response of Lumped-Element Aluminum Manganese KIDs Adriana Gavidia et.al. 2511.08959 null
2025-11-12 Low-Frequency Noise Performance of Microstrip-Coupled Lumped-Element Aluminum KIDs using Hydrogenated Amorphous Silicon Parallel-Plate Capacitors for NEW-MUSIC Simon Hempel-Costello et.al. 2511.08898 null
2025-11-11 Chord-conditioned Melody and Bass Generation Alexandra C Salem et.al. 2511.08755 null
2025-11-14 Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models Yi Yang et.al. 2511.08252 null
2025-11-11 Automatic Music Mixing using a Generative Model of Effect Embeddings Eloi Moliner et.al. 2511.08040 null
2025-11-10 Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics Jonathan Lehmkuhl et.al. 2511.07268 null
2025-11-06 MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers Ali Boudaghi et.al. 2511.04376 null
2025-11-06 MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation Shih-Lun Wu et.al. 2511.03942 null
2025-11-02 Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures Barathi Subramanian et.al. 2511.00793 null
2025-10-28 GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment Jinting Wang et.al. 2510.26818 null
2025-10-27 Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders Nathan Paek et.al. 2510.23802 null
2025-10-25 Streaming Generation for Music Accompaniment Yusong Wu et.al. 2510.22105 null
2025-10-23 GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer Jackson Loth et.al. 2510.21872 null
2025-10-21 Steering Autoregressive Music Generation with Recursive Feature Machines Daniel Zhao et.al. 2510.19127 null
2025-10-18 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding Jingyue Huang et.al. 2510.16273 null
2025-10-16 Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? Qixin Deng et.al. 2510.14249 null
2025-10-15 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Zhenyu Liu et.al. 2510.13344 null
2025-10-17 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 null
2025-10-11 ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis Stephen Ni-Hahn et.al. 2510.10249 null
2025-10-07 LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment Jiahao Mei et.al. 2510.05875 null
2025-10-02 Bias beyond Borders: Global Inequalities in AI-Generated Music Ahmet Solak et.al. 2510.01963 null
2025-10-15 SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing Jiaye Tan et.al. 2510.00395 null
2025-10-04 HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling Hung-Ying Chu et.al. 2509.25694 null
2025-09-29 Ethics Statements in AI Music Papers: The Effective and the Ineffective Julia Barnett et.al. 2509.25496 null
2025-09-29 Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music Tianle Wang et.al. 2509.24603 null
2025-10-01 An Agent-Based Framework for Automated Higher-Voice Harmony Generation Nia D'Souza Ganapathy et.al. 2509.24463 null
2025-09-28 Time-Shifted Token Scheduling for Symbolic Music Generation Ting-Kang Wang et.al. 2509.23749 null
2025-09-28 AudioMoG: Guiding Audio Generation with Mixture-of-Guidance Junyou Wang et.al. 2509.23727 null
2025-09-27 AI-Assisted Music Production: A User Study on Text-to-Music Models Francesca Ronchini et.al. 2509.23364 null
2025-09-26 Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach Zijian Zhao et.al. 2509.22378 null
2025-09-26 MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan Xuanchen Wang et.al. 2509.21714 null
2025-09-21 Difficulty-Aware Score Generation for Piano Sight-Reading Pedro Ramoneda et.al. 2509.16913 null
2025-09-17 Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure Shulei Ji et.al. 2509.13658 null
2025-09-13 A Traditional Approach to Symbolic Piano Continuation Christian Zhou-Zheng et.al. 2509.12267 null
2025-09-14 Decoding Musical Origins: Distinguishing Human and AI Composers Cheng-Yang Tsai et.al. 2509.11369 null
2025-09-14 STASE: A spatialized text-to-audio synthesis engine for music generation Tutti Chi et.al. 2509.11124 null

(back to top)

Audio Codec

Publish Date Title Authors PDF Code
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 null
2025-11-20 SUNAC: Source-aware Unified Neural Audio Codec Ryo Aihara et.al. 2511.16126 null
2025-11-18 OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Keda Tao et.al. 2511.14582 null
2025-11-18 Segmentwise Pruning in Audio-Language Models Marcel Gibier et.al. 2511.14293 null
2025-11-18 SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM An Yu et.al. 2511.14143 null
2025-11-17 PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement Xiaobin Rong et.al. 2511.13300 null
2025-11-16 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 null
2025-11-15 Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound Dengming Zhang et.al. 2511.12077 null
2025-11-14 Evaluation of Audio Compression Codecs Thien T. Duong et.al. 2511.11527 null
2025-11-14 AV-Dialog: Spoken Dialogue Models with Audio-Visual Input Tuochao Chen et.al. 2511.11124 null
2025-11-14 AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization Zhonghua Jiang et.al. 2511.11106 null
2025-11-14 TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models Hualei Wang et.al. 2511.11039 null
2025-11-09 Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Yan Gao et.al. 2511.10670 null
2025-11-13 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction Yuhao Wang et.al. 2511.10232 null
2025-11-13 Towards Leveraging Sequential Structure in Animal Vocalizations Eklavya Sarkar et.al. 2511.10190 null
2025-11-12 POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation Xuanchen Li et.al. 2511.09232 null
2025-11-12 HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios Bingsong Bai et.al. 2511.08496 null
2025-11-10 Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo et.al. 2511.07253 null
2025-11-10 Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection Brage Eilertsen et.al. 2511.07065 null
2025-11-08 BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction Haoran Wang et.al. 2511.06150 null
2025-11-05 Seeing What You Say: Expressive Image Generation from Speech Jiyoung Lee et.al. 2511.03423 null
2025-11-05 Open Source State-Of-the-Art Solution for Romanian Speech Recognition Gabriel Pirlogeanu et.al. 2511.03361 null
2025-11-05 audio2chart: End to End Audio Transcription into playable Guitar Hero charts Riccardo Tripodi et.al. 2511.03337 null
2025-11-04 An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM Jiawei Liu et.al. 2511.02234 null
2025-11-03 ADNAC: Audio Denoiser using Neural Audio Codec Daniel Jimon et.al. 2511.01773 null
2025-10-30 UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens Chengwei Liu et.al. 2510.26372 null
2025-10-30 Modeling strategies for speech enhancement in the latent space of a neural audio codec Sofiene Kammoun et.al. 2510.26299 null
2025-10-29 PitchFlower: A flow-based neural audio codec with pitch controllability Diego Torres et.al. 2510.25566 null
2025-10-29 Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR Shreyas Gopal et.al. 2510.25150 null
2025-10-28 Bayesian Speech synthesizers Can Learn from Multiple Teachers Ziyang Zhang et.al. 2510.24372 null
2025-10-28 Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations Ahmad Ghannam et.al. 2510.24247 null
2025-10-28 Low-Resource Audio Codec (LRAC): 2025 Challenge Description Kamil Wojcicki et.al. 2510.23312 null
2025-10-25 FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss Parthasaarathy Sudarsanam et.al. 2510.22241 null
2025-10-24 SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain Zixiang Wan et.al. 2510.21209 null
2025-10-24 Robust Distortion-Free Watermark for Autoregressive Audio Generation Models Yihan Wu et.al. 2510.21115 null
2025-10-23 Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Xin Zhang et.al. 2510.20504 null
2025-10-23 UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement Haoyin Yan et.al. 2510.20441 null
2025-10-19 SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization Wenxi Chen et.al. 2510.16841 null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 null
2025-10-17 LDCodec: A high quality neural audio codec with low-complexity decoder Jiawei Jiang et.al. 2510.15364 null
2025-10-17 Extending Audio Context for Long-Form Understanding in Large Audio-Language Models Yuatyong Chaichana et.al. 2510.15231 null
2025-10-20 LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models Xiaohan Zhao et.al. 2510.15227 null
2025-10-16 TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation Ming-Hao Hsu et.al. 2510.14934 null
2025-10-15 Acoustic Teleportation via Disentangled Neural Audio Codec Representations Philipp Grundhuber et.al. 2510.13221 null
2025-10-13 UALM: Unified Audio Language Model for Understanding, Generation and Reasoning Jinchuan Tian et.al. 2510.12000 null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 null
2025-10-12 FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec Yurii Halychanskyi et.al. 2510.10785 null
2025-10-11 SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation Zeyu Ling et.al. 2510.10069 null
2025-10-11 MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction Jianjin Wang et.al. 2510.10003 null

(back to top)

Large Audio Language Model

Publish Date Title Authors PDF Code
2025-11-20 Cognitive Foundations for Reasoning and Their Manifestation in LLMs Priyanka Kargupta et.al. 2511.16660 null
2025-11-20 SUNAC: Source-aware Unified Neural Audio Codec Ryo Aihara et.al. 2511.16126 null
2025-11-20 Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio Mohan Shi et.al. 2511.16046 null
2025-11-20 Multimodal Evaluation of Russian-language Architectures Artem Chervyakov et.al. 2511.15552 null
2025-11-19 Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding Mingyue Huo et.al. 2511.15145 null
2025-11-18 A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder Dengyun Huang et.al. 2511.14600 null
2025-11-18 OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models Keda Tao et.al. 2511.14582 null
2025-11-18 Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning Trishala Jayesh Ahalpara et.al. 2511.14445 null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 null
2025-11-18 Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions Marcel Gibier et.al. 2511.14307 null
2025-11-18 Segmentwise Pruning in Audio-Language Models Marcel Gibier et.al. 2511.14293 null
2025-11-18 SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM An Yu et.al. 2511.14143 null
2025-11-18 O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents Piaohong Wang et.al. 2511.13593 null
2025-11-17 Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs Zhe Sun et.al. 2511.13273 null
2025-11-17 You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection Guoyi Zhang et.al. 2511.13013 null
2025-11-16 Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data Yunxin Li et.al. 2511.12609 null
2025-11-16 DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions Xiaoyu Lin et.al. 2511.12452 null
2025-11-16 SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs Shail Desai et.al. 2511.12404 null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 null
2025-11-15 Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound Dengming Zhang et.al. 2511.12077 null
2025-11-14 AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization Zhonghua Jiang et.al. 2511.11106 null
2025-11-14 TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models Hualei Wang et.al. 2511.11039 null
2025-11-14 DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition HongYu Liu et.al. 2511.11000 null
2025-11-14 Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio Guangke Chen et.al. 2511.10913 null
2025-11-14 OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer Haosong Peng et.al. 2511.10560 null
2025-11-13 Music Flamingo: Scaling Music Understanding in Audio Language Models Sreyan Ghosh et.al. 2511.10289 null
2025-11-13 OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models Yuping Yan et.al. 2511.10287 null
2025-11-14 Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard Yudong Yang et.al. 2511.10222 null
2025-11-13 When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? Qilang Ye et.al. 2511.10059 null
2025-11-13 Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism Jinhong Jeong et.al. 2511.10045 null
2025-11-13 Reinforcing Trustworthiness in Multimodal Emotional Support Systems Huy M. Le et.al. 2511.10011 null
2025-11-13 Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation Xiangyi Wei et.al. 2511.09958 null
2025-11-13 HI-TransPA: Hearing Impairments Translation Personal Assistant Zhiming Ma et.al. 2511.09915 null
2025-11-12 State Space Modeling of Mortgage Default Rates under Natural Hazard Shocks Samuel J. Eschker et.al. 2511.09698 null
2025-11-11 Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo et.al. 2511.07253 link
2025-11-06 CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese Dazhong Chen et.al. 2511.04139 null
2025-11-06 WST: Weakly Supervised Transducer for Automatic Speech Recognition Dongji Gao et.al. 2511.04035 null
2025-11-05 Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything Huawei Lin et.al. 2511.02834 null
2025-11-05 The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models Claudia Herambourg et.al. 2511.02589 null
2025-11-03 SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia Chaoqun Liu et.al. 2511.01670 null
2025-11-03 Classification of motor faults based on transmission coefficient and reflection coefficient of omni-directional antenna using DCNN Sagar Dutta et.al. 2511.01371 null
2025-11-06 OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation Heyu Guo et.al. 2511.01210 null
2025-11-02 Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models Junqi Zhao et.al. 2511.01091 null
2025-10-31 LongCat-Flash-Omni Technical Report Meituan LongCat Team et.al. 2511.00279 null
2025-10-31 Sensor operating point calibration and monitoring of the ALICE Inner Tracking System during LHC Run 3 D. Agguiaro et.al. 2510.27592 null
2025-10-30 ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models Weifei Jin et.al. 2510.26096 null
2025-10-29 Convergence of a Relative-type Inexact Proximal ALM for Convex Nonlinear Programming Lei Yang et.al. 2510.25261 null
2025-10-28 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Inclusion AI et.al. 2510.24821 null
2025-10-28 Generative View Stitching Chonghyuk Song et.al. 2510.24718 null
2025-10-28 STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence Zihan Liu et.al. 2510.24693 null

(back to top)

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages