GitHub

Updated on 2025.11.23

Usage instructions: here

Table of Contents

Text to Speech
Text to Audio
Video to Audio
Voice Conversion
Video Generation
Image Generation
Music Generation
Audio Codec
Large Audio Language Model

Text to Speech

Publish Date	Title	Authors	PDF	Code
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	null
2025-11-20	WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue	Zachary Ellis et.al.	2511.16544	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	null
2025-11-19	Universal TT- and TQ-relations via centrally extended q-Onsager algebra	Pascal Baseilhac et.al.	2511.15876	null
2025-11-19	Step-Audio-R1 Technical Report	Fei Tian et.al.	2511.15848	null
2025-11-19	A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification	Mohit Sharma et.al.	2511.15766	null
2025-11-19	PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback	Sirui Chen et.al.	2511.15253	null
2025-11-19	Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding	Mingyue Huo et.al.	2511.15145	null
2025-11-19	Aligning Generative Music AI with Human Preferences: Methods and Challenges	Dorien Herremans et.al.	2511.15038	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants	Mingkun Yu et.al.	2511.14852	null
2025-11-18	Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech	Nam-Gyu Kim et.al.	2511.14824	null
2025-11-18	Ground Truth Generation for Multilingual Historical NLP using LLMs	Clovis Gladstone et.al.	2511.14688	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-18	Periods in equivariant and motivic contexts	Martin Gallauer et.al.	2511.14325	null
2025-11-18	AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR	Gabrial Zencha Ashungafac et.al.	2511.14255	null
2025-11-18	Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning	Rui Liu et.al.	2511.14249	link
2025-11-18	StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model	Yifan Yang et.al.	2511.14223	null
2025-11-18	FxSearcher: gradient-free text-driven audio transformation	Hojoon Ki et.al.	2511.14138	null
2025-11-17	Human-centric Maintenance Process Through Integration of AI, Speech, and AR	Parul Khanna et.al.	2511.13918	null
2025-11-17	Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video	Filippo Cenacchi. Longbing Cao et.al.	2511.13802	null
2025-11-17	PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement	Xiaobin Rong et.al.	2511.13300	null
2025-11-17	Computational Measurement of Political Positions: A Review of Text-Based Ideal Point Estimation Algorithms	Patrick Parschan et.al.	2511.13238	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis	Zaara Zabeen Arpa et.al.	2511.13159	link
2025-11-17	A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning	Liuyi Jin et.al.	2511.13078	null
2025-11-17	CalibrateMix: Guided-Mixup Calibration of Image Semi-Supervised Models	Mehrab Mustafy Rahman et.al.	2511.12964	null
2025-11-16	Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data	Sina Rashidi et.al.	2511.12690	null
2025-11-16	Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans	Hongbin Huang et.al.	2511.12662	null
2025-11-16	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	null
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-14	Proactive Hearing Assistants that Isolate Egocentric Conversations	Guilin Hu et.al.	2511.11473	link
2025-11-14	Language-Aided State Estimation	Yuki Miyoshi et.al.	2511.11285	null
2025-11-14	Extended-Krylov-subspace methods for trust-region and norm-regularization subproblems	Hussam Al Daas et.al.	2511.11135	null
2025-11-14	Analysing Personal Attacks in U.S. Presidential Debates	Ruban Goyal et.al.	2511.11108	null
2025-11-14	CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation	Crystal Min Hui Poon et.al.	2511.11104	null
2025-11-14	CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding	Yifan Zhuang et.al.	2511.10935	null
2025-11-14	Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio	Guangke Chen et.al.	2511.10913	null
2025-11-13	Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces	Farhan Sheth et.al.	2511.10793	null
2025-11-13	Towards Attribution of Generators and Emotional Manipulation in Cross-Lingual Synthetic Speech using Geometric Learning	Girish et.al.	2511.10790	null
2025-11-13	XSNAP: An X-ray Supernova Analysis Pipeline with Application to the Type II Supernova 2024ggi	Ferdinand et.al.	2511.10744	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-13	VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction	Yuhao Wang et.al.	2511.10232	null
2025-11-13	Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard	Yudong Yang et.al.	2511.10222	null
2025-11-13	Towards Leveraging Sequential Structure in Animal Vocalizations	Eklavya Sarkar et.al.	2511.10190	link
2025-11-13	FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features	Wenyu Wang et.al.	2511.10112	null
2025-11-13	Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints	Xiangyue Zhang et.al.	2511.10076	null
2025-11-13	Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS	Haoyu Li et.al.	2511.09995	null
2025-11-13	MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection	Pritish Sahu et.al.	2511.09918	null
2025-11-12	Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages	Omnilingual ASR team et.al.	2511.09690	null

(back to top)

Text to Audio

Publish Date	Title	Authors	PDF	Code
2025-11-20	Cognitive Foundations for Reasoning and Their Manifestation in LLMs	Priyanka Kargupta et.al.	2511.16660	null
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	null
2025-11-19	Step-Audio-R1 Technical Report	Fei Tian et.al.	2511.15848	null
2025-11-19	A Generalized Weighted Overlap-Add (WOLA) Filter Bank for Improved Subband System Identification	Mohit Sharma et.al.	2511.15766	null
2025-11-20	Multimodal Evaluation of Russian-language Architectures	Artem Chervyakov et.al.	2511.15552	null
2025-11-19	Adapt-As-You-Walk Through the Clouds: Training-Free Online Test-Time Adaptation of 3D Vision-Language Foundation Models	Mehran Tamjidi et.al.	2511.15311	null
2025-11-19	Detection of spiking motifs of arbitrary length in neural activity using bounded synaptic delays	Thomas Kronland-Martinet et.al.	2511.15296	null
2025-11-19	SNAP: Low-Latency Test-Time Adaptation with Sparse Updates	Hyeongheon Cha et.al.	2511.15276	null
2025-11-19	LargeSHS: A large-scale dataset of music adaptation	Chih-Pin Tan et.al.	2511.15270	null
2025-11-19	Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding	Mingyue Huo et.al.	2511.15145	null
2025-11-19	Aligning Generative Music AI with Human Preferences: Methods and Challenges	Dorien Herremans et.al.	2511.15038	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems	Jaro Meyer et.al.	2511.14948	null
2025-11-18	Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report	Daniel Oliveira de Brito et.al.	2511.14939	null
2025-11-18	A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder	Dengyun Huang et.al.	2511.14600	null
2025-11-18	Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning	Trishala Jayesh Ahalpara et.al.	2511.14445	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-18	H-LDM: Hierarchical Latent Diffusion Models for Controllable and Interpretable PCG Synthesis from Clinical Metadata	Chenyang Xu et.al.	2511.14312	null
2025-11-18	Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions	Marcel Gibier et.al.	2511.14307	null
2025-11-18	EBind: a practical approach to space binding	Jim Broadbent et.al.	2511.14229	null
2025-11-18	StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model	Yifan Yang et.al.	2511.14223	null
2025-11-18	FxSearcher: gradient-free text-driven audio transformation	Hojoon Ki et.al.	2511.14138	null
2025-11-18	Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services	Liuyi Jin et.al.	2511.14119	null
2025-11-17	Preference-Based Learning in Audio Applications: A Systematic Analysis	Aaron Broukhim et.al.	2511.13936	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language	Zonghao Ying et.al.	2511.13127	null
2025-11-17	A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning	Liuyi Jin et.al.	2511.13078	null
2025-11-16	Open-World Test-Time Adaptation with Hierarchical Feature Aggregation and Attention Affine	Ziqiong Liu et.al.	2511.12607	null
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-16	SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs	Shail Desai et.al.	2511.12404	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	null
2025-11-15	Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound	Dengming Zhang et.al.	2511.12077	null
2025-11-15	ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation	Jiahui Sun et.al.	2511.12072	null
2025-11-14	Enhancing XR Auditory Realism via Multimodal Scene-Aware Acoustic Rendering	Tianyu Xu et.al.	2511.11930	null
2025-11-14	Proactive Hearing Assistants that Isolate Egocentric Conversations	Guilin Hu et.al.	2511.11473	null
2025-11-14	AV-Dialog: Spoken Dialogue Models with Audio-Visual Input	Tuochao Chen et.al.	2511.11124	null
2025-11-14	DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition	HongYu Liu et.al.	2511.11000	null
2025-11-14	Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio	Guangke Chen et.al.	2511.10913	null
2025-11-13	Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces	Farhan Sheth et.al.	2511.10793	null
2025-11-13	Panda: Test-Time Adaptation with Negative Data Augmentation	Ruxi Deng et.al.	2511.10481	null
2025-11-13	TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities	Yan Zhuang et.al.	2511.10325	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-13	OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models	Yuping Yan et.al.	2511.10287	null
2025-11-14	Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard	Yudong Yang et.al.	2511.10222	null
2025-11-13	Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization	Ashutosh Anshul et.al.	2511.10212	null
2025-11-13	RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo	Jueun Ko et.al.	2511.10107	null
2025-11-13	When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?	Qilang Ye et.al.	2511.10059	null
2025-11-13	Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism	Jinhong Jeong et.al.	2511.10045	null
2025-11-13	Reinforcing Trustworthiness in Multimodal Emotional Support Systems	Huy M. Le et.al.	2511.10011	null

(back to top)

Video to Audio

Publish Date	Title	Authors	PDF	Code
2025-11-20	Real-Time Inference for Distributed Multimodal Systems under Communication Delay Uncertainty	Victor Croisfelt et.al.	2511.16225	null
2025-11-19	MF-GCN: A Multi-Frequency Graph Convolutional Network for Tri-Modal Depression Detection Using Eye-Tracking, Facial, and Acoustic Features	Sejuti Rahman et.al.	2511.15675	null
2025-11-20	Multimodal Evaluation of Russian-language Architectures	Artem Chervyakov et.al.	2511.15552	null
2025-11-19	A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data	Mauro Larrat et.al.	2511.15312	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	null
2025-11-18	RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems	Jaro Meyer et.al.	2511.14948	null
2025-11-18	OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models	Keda Tao et.al.	2511.14582	null
2025-11-18	Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning	Rui Liu et.al.	2511.14249	null
2025-11-18	EBind: a practical approach to space binding	Jim Broadbent et.al.	2511.14229	null
2025-11-18	SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM	An Yu et.al.	2511.14143	null
2025-11-18	Real-Time Mobile Video Analytics for Pre-arrival Emergency Medical Services	Liuyi Jin et.al.	2511.14119	null
2025-11-17	Segmenting Collision Sound Sources in Egocentric Videos	Kranti Kumar Parida et.al.	2511.13863	null
2025-11-17	Towards Affect-Adaptive Human-Robot Interaction: A Protocol for Multimodal Dataset Collection on Social Anxiety	Vesna Poprcova et.al.	2511.13530	null
2025-11-17	CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving	Enhui Ma et.al.	2511.13297	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language	Zonghao Ying et.al.	2511.13127	null
2025-11-17	A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning	Liuyi Jin et.al.	2511.13078	null
2025-11-17	Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views	Junyi Ma et.al.	2511.12878	null
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-16	SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs	Shail Desai et.al.	2511.12404	null
2025-11-15	Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound	Dengming Zhang et.al.	2511.12077	null
2025-11-15	ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation	Jiahui Sun et.al.	2511.12072	null
2025-11-14	AV-Dialog: Spoken Dialogue Models with Audio-Visual Input	Tuochao Chen et.al.	2511.11124	null
2025-11-14	AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization	Zhonghua Jiang et.al.	2511.11106	null
2025-11-13	TMDC: A Two-Stage Modality Denoising and Complementation Framework for Multimodal Sentiment Analysis with Missing and Noisy Modalities	Yan Zhuang et.al.	2511.10325	null
2025-11-13	OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models	Yuping Yan et.al.	2511.10287	null
2025-11-13	Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization	Ashutosh Anshul et.al.	2511.10212	null
2025-11-13	When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?	Qilang Ye et.al.	2511.10059	null
2025-11-13	Reinforcing Trustworthiness in Multimodal Emotional Support Systems	Huy M. Le et.al.	2511.10011	null
2025-11-13	Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation	Xiangyi Wei et.al.	2511.09958	null
2025-11-14	HI-TransPA: Hearing Impairments Translation Personal Assistant	Zhiming Ma et.al.	2511.09915	null
2025-11-12	Co-Designing Multimodal Systems for Accessible Remote Dance Instruction	Ujjaini Das et.al.	2511.09658	null
2025-11-12	MCAD: Multimodal Context-Aware Audio Description Generation For Soccer	Lipisha Chaudhary et.al.	2511.09448	null
2025-11-12	Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection	Anushka Sanjay Shelke et.al.	2511.09039	null
2025-11-05	UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions	Guozhen Zhang et.al.	2511.03334	null
2025-10-28	Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation	Kang Zhang et.al.	2510.24103	null
2025-10-10	MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation	Akira Takahashi et.al.	2510.09065	null
2025-10-28	Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation	Liyang Chen et.al.	2510.08078	null
2025-10-09	IsoSignVid2Aud: Sign Language Video to Audio Conversion without Text Intermediaries	Harsh Kavediya et.al.	2510.07837	null
2025-10-07	FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders	Riccardo Fosco Gramaccioni et.al.	2510.05829	null
2025-10-07	StereoSync: Spatially-Aware Stereo Audio Generation from Video	Christian Marinoni et.al.	2510.05828	null
2025-10-03	SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos	Amir Dellali et.al.	2510.02916	null
2025-10-02	SoundReactor: Frame-level Online Video-to-Audio Generation	Koichi Saito et.al.	2510.02110	null
2025-09-29	Training-Free Multimodal Guidance for Video to Audio Generation	Eleonora Grassucci et.al.	2509.24550	null
2025-09-28	AudioMoG: Guiding Audio Generation with Mixture-of-Guidance	Junyou Wang et.al.	2509.23727	null
2025-09-26	WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM	Changli Tang et.al.	2509.21990	null
2025-09-26	Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers	Jibin Song et.al.	2509.21893	null
2025-09-24	MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization	Jianxuan Yang et.al.	2509.19999	null
2025-10-05	StereoFoley: Object-Aware Stereo Audio Generation from Video	Tornike Karchkhadze et.al.	2509.18272	null
2025-09-19	Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech	Xinlei Niu et.al.	2509.15492	null

(back to top)

Voice Conversion

Publish Date	Title	Authors	PDF	Code
2025-11-20	Neutron star heating vs. HST observations	Luis E. Rodríguez et.al.	2511.16507	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	null
2025-11-19	PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback	Sirui Chen et.al.	2511.15253	null
2025-11-18	AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR	Gabrial Zencha Ashungafac et.al.	2511.14255	null
2025-11-17	Large cliques in graphs with forbidden semi-induced structures	Nannan Chen et.al.	2511.13073	null
2025-11-16	Leave-One-Out Learning with Log-Loss	Yaniv Fogel et.al.	2511.12718	null
2025-11-16	Sample Complexity of Agnostic Multiclass Classification: Natarajan Dimension Strikes Back	Alon Cohen et.al.	2511.12659	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	null
2025-11-14	Volatility in Certainty (VC): A Metric for Detecting Adversarial Perturbations During Inference in Neural Network Classifiers	Vahid Hemmati et.al.	2511.11834	null
2025-11-14	Vortex breakdown and its topologies in turbulent flows within a typical swirl combustor geometry	Nitesh Kumar Sahu et.al.	2511.11420	null
2025-11-13	FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features	Wenyu Wang et.al.	2511.10112	null
2025-11-12	Sample Complexity of Quadratically Regularized Optimal Transport	Alberto González-Sanz et.al.	2511.09807	null
2025-11-13	Reduced-Complexity Model Selection and Rate Allocation for Multiple-Model Electrical Signal Compression	Corentin Presvôts et.al.	2511.09370	null
2025-11-12	VC-dimension of Salem sets over finite fields	Moustapha Diallo et.al.	2511.08963	null
2025-11-12	HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios	Bingsong Bai et.al.	2511.08496	null
2025-11-10	ConvFill: Model Collaboration for Responsive Conversational Voice Agents	Vidya Srinivas et.al.	2511.07397	null
2025-11-10	Generating Novel and Realistic Speakers for Voice Conversion	Meiying Melissa Chen et.al.	2511.07135	null
2025-11-10	E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis	Zhisheng Zhang et.al.	2511.07099	null
2025-11-10	Personalizing Emotion-aware Conversational Agents? Exploring User Traits-driven Conversational Strategies for Enhanced Interaction	Yuchong Zhang et.al.	2511.06954	null
2025-11-09	How Founder Expertise Shapes the Impact of Generative Artificial Intelligence on Digital Ventures	Ruiqing Cao et.al.	2511.06545	null
2025-11-06	Vector Traits Shape Disease Persistence: A Predator Prey Approach to Dengue	Piyumi Chathurangika et.al.	2511.04276	null
2025-11-04	Recursively Enumerably Representable Classes and Computable Versions of the Fundamental Theorem of Statistical Learning	David Kattermann et.al.	2511.02644	null
2025-10-31	Consequences of Dependent Dividing on Burden	Yuki Takahashi et.al.	2511.00282	null
2025-10-31	NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion	Zongyang Du et.al.	2511.00256	null
2025-10-30	UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens	Chengwei Liu et.al.	2510.26372	null
2025-10-28	Bayesian Speech synthesizers Can Learn from Multiple Teachers	Ziyang Zhang et.al.	2510.24372	null
2025-10-24	StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks	Jingyue Huang et.al.	2510.21685	null
2025-10-23	Charge-density waves and stripes in quarter metals of graphene heterostructures	Sk Asrap Murshed et.al.	2510.20816	null
2025-10-23	R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion	Junjie Zheng et.al.	2510.20677	null
2025-10-22	VBx for End-to-End Neural and Clustering-based Diarization	Petr Pálka et.al.	2510.19572	null
2025-10-20	Fast Agnostic Learners in the Plane	Talya Eden et.al.	2510.18057	null
2025-10-20	Joint upper Banach density, VC dimensions and Euclidean point configurations	Bruno Predojević et.al.	2510.17453	null
2025-10-23	The Parameterized Complexity of Computing the VC-Dimension	Florent Foucaud et.al.	2510.17451	null
2025-10-18	Truly Subquadratic Time Algorithms for Diameter and Related Problems in Graphs of Bounded VC-dimension	Timothy M. Chan et.al.	2510.16346	null
2025-10-22	VoiceMorph: How AI Voice Morphing Reveals the Boundaries of Auditory Self-Recognition	Kye Shimizu et.al.	2510.16192	null
2025-10-16	Deadlock-free routing for Full-mesh networks without using Virtual Channels	Alejandro Cano et.al.	2510.14730	null
2025-10-15	The VC-dimension and point configurations in $\mathbb{R}^d$	Alex Iosevich et.al.	2510.13984	null
2025-10-16	VC-Dimension vs Degree: An Uncertainty Principle for Boolean Functions	Fan Chang et.al.	2510.13705	null
2025-10-15	Model-assisted estimation for MRV: How to boost the economics of SOC sequestration projects without compromising on scientific integrity	Ahmad Awad et.al.	2510.13609	null
2025-10-15	Target Controllability Score	Kazuhiro Sato et.al.	2510.13354	link
2025-10-14	VCTR: A Transformer-Based Model for Non-parallel Voice Conversion	Maharnab Saikia et.al.	2510.12964	null
2025-10-15	(R)evolution of Programming: Vibe Coding as a Post-Coding Paradigm	Kevin Krings et.al.	2510.12364	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	null
2025-10-13	VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents	Jiliang Hu et.al.	2510.11098	null
2025-10-10	A Scalable, Privacy-Preserving Decentralized Identity and Verifiable Data Sharing Framework based on Zero-Knowledge Proofs	Hui Yuan et.al.	2510.09715	null
2025-10-10	SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion	Zhao Guo et.al.	2510.09245	null
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	null
2025-10-09	MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows	Guobin Ma et.al.	2510.08392	null
2025-10-09	What Makes a Visualization Complex?	Mengdi Chu et.al.	2510.08332	null
2025-10-09	VoiceAgentBench: Are Voice Assistants ready for agentic tasks?	Dhruv Jain et.al.	2510.07978	null

(back to top)

Video Generation

Publish Date	Title	Authors	PDF	Code
2025-11-20	Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO	Junhao Cheng et.al.	2511.16669	null
2025-11-20	V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models	Yang Luo et.al.	2511.16668	null
2025-11-20	SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking	Haofeng Liu et.al.	2511.16618	null
2025-11-20	YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras	Fan Yang et.al.	2511.16521	null
2025-11-20	An analytical and experimental study of the energy transition discourse on YouTube	Aleix Bassolas et.al.	2511.16497	null
2025-11-20	Flow and Depth Assisted Video Prediction with Latent Transformer	Eliyas Suleyman et.al.	2511.16484	null
2025-11-20	PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks	Kewei Chen et.al.	2511.16200	null
2025-11-20	FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos	Jeremie Ochin et.al.	2511.16183	null
2025-11-20	Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight	Yi Yang et.al.	2511.16175	null
2025-11-20	Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning	Yibin Huang et.al.	2511.16160	null
2025-11-19	First Frame Is the Place to Go for Video Content Customization	Jingxi Chen et.al.	2511.15700	null
2025-11-19	Joint Semantic-Channel Coding and Modulation for Token Communications	Jingkai Ying et.al.	2511.15699	null
2025-11-19	The SA-FARI Dataset: Segment Anything in Footage of Animals for Recognition and Identification	Dante Francisco Wasmuht et.al.	2511.15622	null
2025-11-19	Multimodal Evaluation of Russian-language Architectures	Artem Chervyakov et.al.	2511.15552	null
2025-11-19	Deep Learning for Accurate Vision-based Catch Composition in Tropical Tuna Purse Seiners	Xabier Lekunberri et.al.	2511.15468	null
2025-11-19	ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation	Simon Boeder et.al.	2511.15396	null
2025-11-19	PresentCoach: Dual-Agent Presentation Coaching through Exemplars and Interactive Feedback	Sirui Chen et.al.	2511.15253	null
2025-11-19	Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation	Firdavs Nasriddinov et.al.	2511.15159	null
2025-11-19	Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks	Cheng Yang et.al.	2511.15065	null
2025-11-19	Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation	Vladimir Arkhipkin et.al.	2511.14993	null
2025-11-18	Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising	Yifan Wang et.al.	2511.14719	null
2025-11-18	FreeSwim: Revisiting Sliding-Window Attention Mechanisms for Training-Free Ultra-High-Resolution Video Generation	Yunfeng Wu et.al.	2511.14712	null
2025-11-18	ForensicFlow: A Tri-Modal Adaptive Network for Robust Deepfake Detection	Mohammad Romani et.al.	2511.14554	null
2025-11-18	DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation	Xiangchen Yin et.al.	2511.14530	null
2025-11-18	FlowRoI A Fast Optical Flow Driven Region of Interest Extraction Framework for High-Throughput Image Compression in Immune Cell Migration Analysis	Xiaowei Xu et.al.	2511.14419	null
2025-11-18	ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries	Junfu Pu et.al.	2511.14349	null
2025-11-18	Dental3R: Geometry-Aware Pairing for Intraoral 3D Reconstruction from Sparse-View Photographs	Yiyi Miao et.al.	2511.14315	null
2025-11-18	Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning	Rui Liu et.al.	2511.14249	null
2025-11-18	InstantViR: Real-Time Video Inverse Problem Solver with Distilled Diffusion Prior	Weimin Bai et.al.	2511.14208	null
2025-11-18	Towards Deploying VLA without Fine-Tuning: Plug-and-Play Inference-Time VLA Policy Steering via Embodied Evolutionary Diffusion	Zhuo Li et.al.	2511.14178	null
2025-11-17	Segment Anything Across Shots: A Method and Benchmark	Hengrui Hu et.al.	2511.13715	null
2025-11-17	UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity	Junwei Yu et.al.	2511.13714	null
2025-11-17	TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models	Harold Haodong Chen et.al.	2511.13704	null
2025-11-17	Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting	Jiangnan Ye et.al.	2511.13684	null
2025-11-17	CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding	Shrenik Patel et.al.	2511.13644	null
2025-11-17	Computer Vision based group activity detection and action spotting	Narthana Sivalingam et.al.	2511.13315	null
2025-11-17	CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving	Enhui Ma et.al.	2511.13297	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-17	Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification	Rifen Lin et.al.	2511.13150	null
2025-11-17	VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language	Zonghao Ying et.al.	2511.13127	null
2025-11-14	Scalable Policy Evaluation with Video World Models	Wei-Cheng Tseng et.al.	2511.11520	null
2025-11-14	Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis	Feng-Qi Cui et.al.	2511.11406	null
2025-11-14	YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation	Pavel Rojtberg et.al.	2511.11344	null
2025-11-14	RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting	Ruocheng Wu et.al.	2511.11213	null
2025-11-14	VIDEOP2R: Video Understanding from Perception to Reasoning	Yifan Jiang et.al.	2511.11113	null
2025-11-14	LiteAttention: A Temporal Sparse Attention for Diffusion Transformers	Dor Shmilovich et.al.	2511.11062	null
2025-11-14	EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation	Zongyang Qiu et.al.	2511.11002	null
2025-11-14	Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment	Wenbin Bai et.al.	2511.10987	null
2025-11-14	Text-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition	Gunho Jung et.al.	2511.10958	null
2025-11-14	Language-Guided Graph Representation Learning for Video Summarization	Wenrui Li et.al.	2511.10953	null

(back to top)

Image Generation

Publish Date	Title	Authors	PDF	Code
2025-11-20	Dataset Distillation for Pre-Trained Self-Supervised Vision Models	George Cazenavette et.al.	2511.16674	null
2025-11-20	EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards	Omkat Thawakar et.al.	2511.16672	null
2025-11-20	V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models	Yang Luo et.al.	2511.16668	null
2025-11-20	SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation	Zhenyuan Qin et.al.	2511.16666	null
2025-11-20	Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems	Elias Lumer et.al.	2511.16654	null
2025-11-20	Measurement incompatibility in Bayesian multiparameter quantum estimation	Francesco Albarelli et.al.	2511.16645	null
2025-11-20	SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction	Guolin Huang et.al.	2511.16635	null
2025-11-20	SAM 3D: 3Dfy Anything in Images	SAM 3D Team et.al.	2511.16624	null
2025-11-20	Formal Abductive Latent Explanations for Prototype-Based Networks	Jules Soria et.al.	2511.16588	null
2025-11-20	PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search	Alima Subedi et.al.	2511.16576	null
2025-11-19	GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization	Yikun Wang et.al.	2511.15705	null
2025-11-19	Think Visually, Reason Textually: Vision-Language Synergy in ARC	Beichen Zhang et.al.	2511.15703	null
2025-11-19	Joint Semantic-Channel Coding and Modulation for Token Communications	Jingkai Ying et.al.	2511.15699	null
2025-11-19	VisPlay: Self-Evolving Vision-Language Models from Images	Yicheng He et.al.	2511.15661	null
2025-11-19	When to Think and When to Look: Uncertainty-Guided Lookback	Jing Bi et.al.	2511.15613	null
2025-11-19	MaskMed: Decoupled Mask and Class Prediction for Medical Image Segmentation	Bin Xie et.al.	2511.15603	null
2025-11-19	US-X Complete: A Multi-Modal Approach to Anatomical 3D Shape Recovery	Miruna-Alexandra Gafencu et.al.	2511.15600	null
2025-11-19	Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector	Weiheng Zhu et.al.	2511.15571	null
2025-11-19	Multimodal Evaluation of Russian-language Architectures	Artem Chervyakov et.al.	2511.15552	null
2025-11-19	UltraDP: Generalizable Carotid Ultrasound Scanning with Force-Aware Diffusion Policy	Ruoqu Chen et.al.	2511.15550	null
2025-11-18	ARC Is a Vision Problem!	Keya Hu et.al.	2511.14761	null
2025-11-18	UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning	Rui Tian et.al.	2511.14760	null
2025-11-18	Cell Shape Emerges from Motion	Gautham Gopinath et.al.	2511.14707	null
2025-11-18	Talk, Snap, Complain: Validation-Aware Multimodal Expert Framework for Fine-Grained Customer Grievances	Rishu Kumar Singh et.al.	2511.14693	null
2025-11-18	A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases	Tao Yang et.al.	2511.14638	null
2025-11-18	SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction	Meiying Gu et.al.	2511.14633	null
2025-11-18	Gallant: Voxel Grid-based Humanoid Locomotion and Local-navigation across 3D Constrained Terrains	Qingwei Ben et.al.	2511.14625	null
2025-11-18	XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation	Yilin Zhang et.al.	2511.14604	null
2025-11-18	Task Addition and Weight Disentanglement in Closed-Vocabulary Models	Adam Hazimeh et.al.	2511.14569	null
2025-11-18	A Generative Data Framework with Authentic Supervision for Underwater Image Restoration and Enhancement	Yufeng Tian et.al.	2511.14521	null
2025-11-17	Back to Basics: Let Denoising Generative Models Denoise	Tianhong Li et.al.	2511.13720	null
2025-11-17	UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity	Junwei Yu et.al.	2511.13714	null
2025-11-17	Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine	Xincheng Shuai et.al.	2511.13713	null
2025-11-17	TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models	Harold Haodong Chen et.al.	2511.13704	null
2025-11-17	Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation	Sofia Jamil et.al.	2511.13689	null
2025-11-17	Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting	Jiangnan Ye et.al.	2511.13684	null
2025-11-17	Cross-Learning from Scarce Data via Multi-Task Constrained Optimization	Leopoldo Agorio et.al.	2511.13680	null
2025-11-17	PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image	Ziang Cao et.al.	2511.13648	null
2025-11-17	Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real-Synthetic Data Mixtures	Haohui Wang et.al.	2511.13640	null
2025-11-17	VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping	Haotian Dong et.al.	2511.13587	null
2025-11-14	LARM: A Large Articulated-Object Reconstruction Model	Sylvia Yuan et.al.	2511.11563	null
2025-11-14	Bridging Hidden States in Vision-Language Models	Benjamin Fein-Ashley et.al.	2511.11526	null
2025-11-14	CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation	Luthira Abeykoon et.al.	2511.11522	null
2025-11-14	SynthSoM-Twin: A Multi-Modal Sensing-Communication Digital-Twin Dataset for Sim2Real Transfer via Synesthesia of Machines	Junlong Chen et.al.	2511.11503	null
2025-11-14	PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models	Nhat Hoang-Xuan et.al.	2511.11502	null
2025-11-14	Visible and Terahertz Nonlinear Responses in the Topological Noble Metal Dichalcogenide PdTe2	George J. de Coster et.al.	2511.11493	null
2025-11-14	Data-efficient U-Net for Segmentation of Carbide Microstructures in SEM Images of Steel Alloys	Alinda Ezgi Gerçek et.al.	2511.11485	null
2025-11-14	ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation	Kaishen Wang et.al.	2511.11483	null
2025-11-14	Inferring response times of perceptual decisions with Poisson variational autoencoders	Hayden R. Johnson et.al.	2511.11480	null
2025-11-14	Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification	Qinghao Gao et.al.	2511.11460	null

(back to top)

Music Generation

Publish Date	Title	Authors	PDF	Code
2025-11-20	Music Recommendation with Large Language Models: Challenges, Opportunities, and Evaluation	Elena V. Epure et.al.	2511.16478	null
2025-11-20	Difficulty-Controlled Simplification of Piano Scores with Synthetic Data for Inclusive Music Education	Pedro Ramoneda et.al.	2511.16228	null
2025-11-19	Step-Audio-R1 Technical Report	Fei Tian et.al.	2511.15848	null
2025-11-19	LargeSHS: A large-scale dataset of music adaptation	Chih-Pin Tan et.al.	2511.15270	null
2025-11-19	Aligning Generative Music AI with Human Preferences: Methods and Challenges	Dorien Herremans et.al.	2511.15038	null
2025-11-18	A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder	Dengyun Huang et.al.	2511.14600	null
2025-11-18	MuCPT: Music-related Natural Language Model Continued Pretraining	Kai Tian et.al.	2511.14245	null
2025-11-17	Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases	Antonio Manuel Martínez-Heredia et.al.	2511.13987	null
2025-11-17	Preference-Based Learning in Audio Applications: A Systematic Analysis	Aaron Broukhim et.al.	2511.13936	null
2025-11-17	FoleyBench: A Benchmark For Video-to-Audio Models	Satvik Dixit et.al.	2511.13219	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-14	Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation	Xinyi Tong et.al.	2511.09585	null
2025-11-12	Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation	Shulei Ji et.al.	2511.09090	null
2025-11-12	Design of a Six-band, 2.4-Octave (80--420 GHz) Hierarchically Summed Phased-Array Slot-Dipole Antenna Array for NEW-MUSIC	Xiaolan Huang et.al.	2511.08990	null
2025-11-12	Improved Modeling of Quasi-Static Thermal and Optical Response of Lumped-Element Aluminum Manganese KIDs	Adriana Gavidia et.al.	2511.08959	null
2025-11-12	Low-Frequency Noise Performance of Microstrip-Coupled Lumped-Element Aluminum KIDs using Hydrogenated Amorphous Silicon Parallel-Plate Capacitors for NEW-MUSIC	Simon Hempel-Costello et.al.	2511.08898	null
2025-11-11	Chord-conditioned Melody and Bass Generation	Alexandra C Salem et.al.	2511.08755	null
2025-11-14	Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models	Yi Yang et.al.	2511.08252	null
2025-11-11	Automatic Music Mixing using a Generative Model of Effect Embeddings	Eloi Moliner et.al.	2511.08040	null
2025-11-10	Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics	Jonathan Lehmkuhl et.al.	2511.07268	null
2025-11-06	MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers	Ali Boudaghi et.al.	2511.04376	null
2025-11-06	MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation	Shih-Lun Wu et.al.	2511.03942	null
2025-11-02	Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures	Barathi Subramanian et.al.	2511.00793	null
2025-10-28	GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment	Jinting Wang et.al.	2510.26818	null
2025-10-27	Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders	Nathan Paek et.al.	2510.23802	null
2025-10-25	Streaming Generation for Music Accompaniment	Yusong Wu et.al.	2510.22105	null
2025-10-23	GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer	Jackson Loth et.al.	2510.21872	null
2025-10-21	Steering Autoregressive Music Generation with Recursive Feature Machines	Daniel Zhao et.al.	2510.19127	null
2025-10-18	MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding	Jingyue Huang et.al.	2510.16273	null
2025-10-16	Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?	Qixin Deng et.al.	2510.14249	null
2025-10-15	UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE	Zhenyu Liu et.al.	2510.13344	null
2025-10-17	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	null
2025-10-11	ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis	Stephen Ni-Hahn et.al.	2510.10249	null
2025-10-07	LARA-Gen: Enabling Continuous Emotion Control for Music Generation Models via Latent Affective Representation Alignment	Jiahao Mei et.al.	2510.05875	null
2025-10-02	Bias beyond Borders: Global Inequalities in AI-Generated Music	Ahmet Solak et.al.	2510.01963	null
2025-10-15	SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing	Jiaye Tan et.al.	2510.00395	null
2025-10-04	HNote: Extending YNote with Hexadecimal Encoding for Fine-Tuning LLMs in Music Modeling	Hung-Ying Chu et.al.	2509.25694	null
2025-09-29	Ethics Statements in AI Music Papers: The Effective and the Ineffective	Julia Barnett et.al.	2509.25496	null
2025-09-29	Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music	Tianle Wang et.al.	2509.24603	null
2025-10-01	An Agent-Based Framework for Automated Higher-Voice Harmony Generation	Nia D'Souza Ganapathy et.al.	2509.24463	null
2025-09-28	Time-Shifted Token Scheduling for Symbolic Music Generation	Ting-Kang Wang et.al.	2509.23749	null
2025-09-28	AudioMoG: Guiding Audio Generation with Mixture-of-Guidance	Junyou Wang et.al.	2509.23727	null
2025-09-27	AI-Assisted Music Production: A User Study on Text-to-Music Models	Francesca Ronchini et.al.	2509.23364	null
2025-09-26	Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach	Zijian Zhao et.al.	2509.22378	null
2025-09-26	MusicWeaver: Coherent Long-Range and Editable Music Generation from a Beat-Aligned Structural Plan	Xuanchen Wang et.al.	2509.21714	null
2025-09-21	Difficulty-Aware Score Generation for Piano Sight-Reading	Pedro Ramoneda et.al.	2509.16913	null
2025-09-17	Assessing Data Replication in Symbolic Music via Adapted Structural Similarity Index Measure	Shulei Ji et.al.	2509.13658	null
2025-09-13	A Traditional Approach to Symbolic Piano Continuation	Christian Zhou-Zheng et.al.	2509.12267	null
2025-09-14	Decoding Musical Origins: Distinguishing Human and AI Composers	Cheng-Yang Tsai et.al.	2509.11369	null
2025-09-14	STASE: A spatialized text-to-audio synthesis engine for music generation	Tutti Chi et.al.	2509.11124	null

(back to top)

Audio Codec

Publish Date	Title	Authors	PDF	Code
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	null
2025-11-20	SUNAC: Source-aware Unified Neural Audio Codec	Ryo Aihara et.al.	2511.16126	null
2025-11-18	OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models	Keda Tao et.al.	2511.14582	null
2025-11-18	Segmentwise Pruning in Audio-Language Models	Marcel Gibier et.al.	2511.14293	null
2025-11-18	SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM	An Yu et.al.	2511.14143	null
2025-11-17	PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement	Xiaobin Rong et.al.	2511.13300	null
2025-11-16	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	null
2025-11-15	Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound	Dengming Zhang et.al.	2511.12077	null
2025-11-14	Evaluation of Audio Compression Codecs	Thien T. Duong et.al.	2511.11527	null
2025-11-14	AV-Dialog: Spoken Dialogue Models with Audio-Visual Input	Tuochao Chen et.al.	2511.11124	null
2025-11-14	AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization	Zhonghua Jiang et.al.	2511.11106	null
2025-11-14	TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models	Hualei Wang et.al.	2511.11039	null
2025-11-09	Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment	Yan Gao et.al.	2511.10670	null
2025-11-13	VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction	Yuhao Wang et.al.	2511.10232	null
2025-11-13	Towards Leveraging Sequential Structure in Animal Vocalizations	Eklavya Sarkar et.al.	2511.10190	null
2025-11-12	POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation	Xuanchen Li et.al.	2511.09232	null
2025-11-12	HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios	Bingsong Bai et.al.	2511.08496	null
2025-11-10	Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models	Umberto Cappellazzo et.al.	2511.07253	null
2025-11-10	Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection	Brage Eilertsen et.al.	2511.07065	null
2025-11-08	BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction	Haoran Wang et.al.	2511.06150	null
2025-11-05	Seeing What You Say: Expressive Image Generation from Speech	Jiyoung Lee et.al.	2511.03423	null
2025-11-05	Open Source State-Of-the-Art Solution for Romanian Speech Recognition	Gabriel Pirlogeanu et.al.	2511.03361	null
2025-11-05	audio2chart: End to End Audio Transcription into playable Guitar Hero charts	Riccardo Tripodi et.al.	2511.03337	null
2025-11-04	An Evaluation of Interleaved Instruction Tuning on Semantic Reasoning Performance in an Audio MLLM	Jiawei Liu et.al.	2511.02234	null
2025-11-03	ADNAC: Audio Denoiser using Neural Audio Codec	Daniel Jimon et.al.	2511.01773	null
2025-10-30	UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens	Chengwei Liu et.al.	2510.26372	null
2025-10-30	Modeling strategies for speech enhancement in the latent space of a neural audio codec	Sofiene Kammoun et.al.	2510.26299	null
2025-10-29	PitchFlower: A flow-based neural audio codec with pitch controllability	Diego Torres et.al.	2510.25566	null
2025-10-29	Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR	Shreyas Gopal et.al.	2510.25150	null
2025-10-28	Bayesian Speech synthesizers Can Learn from Multiple Teachers	Ziyang Zhang et.al.	2510.24372	null
2025-10-28	Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations	Ahmad Ghannam et.al.	2510.24247	null
2025-10-28	Low-Resource Audio Codec (LRAC): 2025 Challenge Description	Kamil Wojcicki et.al.	2510.23312	null
2025-10-25	FOA Tokenizer: Low-bitrate Neural Codec for First Order Ambisonics with Spatial Consistency Loss	Parthasaarathy Sudarsanam et.al.	2510.22241	null
2025-10-24	SpecTokenizer: A Lightweight Streaming Codec in the Compressed Spectrum Domain	Zixiang Wan et.al.	2510.21209	null
2025-10-24	Robust Distortion-Free Watermark for Autoregressive Audio Generation Models	Yihan Wu et.al.	2510.21115	null
2025-10-23	Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding	Xin Zhang et.al.	2510.20504	null
2025-10-23	UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement	Haoyin Yan et.al.	2510.20441	null
2025-10-19	SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization	Wenxi Chen et.al.	2510.16841	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	null
2025-10-17	LDCodec: A high quality neural audio codec with low-complexity decoder	Jiawei Jiang et.al.	2510.15364	null
2025-10-17	Extending Audio Context for Long-Form Understanding in Large Audio-Language Models	Yuatyong Chaichana et.al.	2510.15231	null
2025-10-20	LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models	Xiaohan Zhao et.al.	2510.15227	null
2025-10-16	TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation	Ming-Hao Hsu et.al.	2510.14934	null
2025-10-15	Acoustic Teleportation via Disentangled Neural Audio Codec Representations	Philipp Grundhuber et.al.	2510.13221	null
2025-10-13	UALM: Unified Audio Language Model for Understanding, Generation and Reasoning	Jinchuan Tian et.al.	2510.12000	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	null
2025-10-12	FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec	Yurii Halychanskyi et.al.	2510.10785	null
2025-10-11	SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation	Zeyu Ling et.al.	2510.10069	null
2025-10-11	MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction	Jianjin Wang et.al.	2510.10003	null

(back to top)

Large Audio Language Model

Publish Date	Title	Authors	PDF	Code
2025-11-20	Cognitive Foundations for Reasoning and Their Manifestation in LLMs	Priyanka Kargupta et.al.	2511.16660	null
2025-11-20	SUNAC: Source-aware Unified Neural Audio Codec	Ryo Aihara et.al.	2511.16126	null
2025-11-20	Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio	Mohan Shi et.al.	2511.16046	null
2025-11-20	Multimodal Evaluation of Russian-language Architectures	Artem Chervyakov et.al.	2511.15552	null
2025-11-19	Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding	Mingyue Huo et.al.	2511.15145	null
2025-11-18	A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder	Dengyun Huang et.al.	2511.14600	null
2025-11-18	OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models	Keda Tao et.al.	2511.14582	null
2025-11-18	Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning	Trishala Jayesh Ahalpara et.al.	2511.14445	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	null
2025-11-18	Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions	Marcel Gibier et.al.	2511.14307	null
2025-11-18	Segmentwise Pruning in Audio-Language Models	Marcel Gibier et.al.	2511.14293	null
2025-11-18	SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM	An Yu et.al.	2511.14143	null
2025-11-18	O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents	Piaohong Wang et.al.	2511.13593	null
2025-11-17	Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs	Zhe Sun et.al.	2511.13273	null
2025-11-17	You Only Look Omni Gradient Backpropagation for Moving Infrared Small Target Detection	Guoyi Zhang et.al.	2511.13013	null
2025-11-16	Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data	Yunxin Li et.al.	2511.12609	null
2025-11-16	DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions	Xiaoyu Lin et.al.	2511.12452	null
2025-11-16	SynthGuard: An Open Platform for Detecting AI-Generated Multimedia with Multimodal LLMs	Shail Desai et.al.	2511.12404	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	null
2025-11-15	Learning to Hear by Seeing: It's Time for Vision Language Models to Understand Artistic Emotion from Sight and Sound	Dengming Zhang et.al.	2511.12077	null
2025-11-14	AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization	Zhonghua Jiang et.al.	2511.11106	null
2025-11-14	TimeAudio: Bridging Temporal Gaps in Large Audio-Language Models	Hualei Wang et.al.	2511.11039	null
2025-11-14	DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition	HongYu Liu et.al.	2511.11000	null
2025-11-14	Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio	Guangke Chen et.al.	2511.10913	null
2025-11-14	OmniVGGT: Omni-Modality Driven Visual Geometry Grounded Transformer	Haosong Peng et.al.	2511.10560	null
2025-11-13	Music Flamingo: Scaling Music Understanding in Audio Language Models	Sreyan Ghosh et.al.	2511.10289	null
2025-11-13	OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models	Yuping Yan et.al.	2511.10287	null
2025-11-14	Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard	Yudong Yang et.al.	2511.10222	null
2025-11-13	When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?	Qilang Ye et.al.	2511.10059	null
2025-11-13	Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism	Jinhong Jeong et.al.	2511.10045	null
2025-11-13	Reinforcing Trustworthiness in Multimodal Emotional Support Systems	Huy M. Le et.al.	2511.10011	null
2025-11-13	Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation	Xiangyi Wei et.al.	2511.09958	null
2025-11-13	HI-TransPA: Hearing Impairments Translation Personal Assistant	Zhiming Ma et.al.	2511.09915	null
2025-11-12	State Space Modeling of Mortgage Default Rates under Natural Hazard Shocks	Samuel J. Eschker et.al.	2511.09698	null
2025-11-11	Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models	Umberto Cappellazzo et.al.	2511.07253	link
2025-11-06	CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese	Dazhong Chen et.al.	2511.04139	null
2025-11-06	WST: Weakly Supervised Transducer for Automatic Speech Recognition	Dongji Gao et.al.	2511.04035	null
2025-11-05	Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything	Huawei Lin et.al.	2511.02834	null
2025-11-05	The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models	Claudia Herambourg et.al.	2511.02589	null
2025-11-03	SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia	Chaoqun Liu et.al.	2511.01670	null
2025-11-03	Classification of motor faults based on transmission coefficient and reflection coefficient of omni-directional antenna using DCNN	Sagar Dutta et.al.	2511.01371	null
2025-11-06	OmniVLA: Physically-Grounded Multimodal VLA with Unified Multi-Sensor Perception for Robotic Manipulation	Heyu Guo et.al.	2511.01210	null
2025-11-02	Feedback-driven Retrieval-augmented Audio Generation with Large Audio Language Models	Junqi Zhao et.al.	2511.01091	null
2025-10-31	LongCat-Flash-Omni Technical Report	Meituan LongCat Team et.al.	2511.00279	null
2025-10-31	Sensor operating point calibration and monitoring of the ALICE Inner Tracking System during LHC Run 3	D. Agguiaro et.al.	2510.27592	null
2025-10-30	ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models	Weifei Jin et.al.	2510.26096	null
2025-10-29	Convergence of a Relative-type Inexact Proximal ALM for Convex Nonlinear Programming	Lei Yang et.al.	2510.25261	null
2025-10-28	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	Inclusion AI et.al.	2510.24821	null
2025-10-28	Generative View Stitching	Chonghyuk Song et.al.	2510.24718	null
2025-10-28	STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence	Zihan Liu et.al.	2510.24693	null

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 2,419 Commits
.github		.github
assets		assets
docs		docs
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
daily_arxiv.py		daily_arxiv.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Updated on 2025.11.23

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Uh oh!

Releases

Packages

Languages

License

ZhikangNiu/arxiv_daily

Folders and files

Latest commit

History

Repository files navigation

Updated on 2025.11.23

Text to Speech

Text to Audio

Video to Audio

Voice Conversion

Video Generation

Image Generation

Music Generation

Audio Codec

Large Audio Language Model

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages