To support ongoing research in the music generation field, we are maintaining a continuously updated repository containing a comprehensive list of datasets, papers, and other resources on the use of large language models in music generation. This repository aims to include all relevant resources and links for crafting high-fidelity music from multimodal inputs such as text and images.
- Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, & Wei Han. (2023). Noise2Music: Text-conditioned Music Generation with Diffusion Models. paper
- Flavio Schneider, Ojasv Kamal, Zhijing Jin, & Bernhard Schölkopf. (2023). Mousai: Text-to-Music Generation with Long-Context Latent Diffusion. paper
- MubertAI link
- Riffusion link
- Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, & Alexandre Défossez. (2024). Simple and Controllable Music Generation. paper
- Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, & Christian Frank. (2023). MusicLM: Generating Music From Text. paper
- Briot, J.P., & Pachet, F. (2018). Deep learning for music generation: challenges and directions. Neural Computing and Applications, 32(4), 981–993. paper
- MAST Rhythm Dataset dataset
- Indian Classical Music Dataset dataset : contains audio samples for 8 different Raagas
| Name | Description | URL | Data Type | Total Duration | Total Audio Number | Status |
|---|---|---|---|---|---|---|
| AudioSet | The AudioSet dataset is a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. To collect all our data we worked with human annotators who verified the presence of sounds they heard within YouTube segments. To nominate segments for annotation, we relied on YouTube metadata and content-based search. The sound events in the dataset consist of a subset of the AudioSet ontology. You can learn more about the dataset construction in our ICASSP 2017 paper. Explore the dataset annotations by sound class below. There are 2,084,320 YouTube videos containing 527 labels | Click here | class labels, video, audio | 5420hrs | 1951460 | processed |
| AudioSet Strong | Audio events from AudioSet clips with singal class label annotation | Click here | 1 class label, video, audio | 625.93hrs | 1074359 | processed (@marianna13#7139) |
| BBC sound effects | 33066 sound effects with text description. Type: mostly environmental sound. Each audio has a natural text description. (need to see check the license) | Click here | 1 caption, audio | 463.48hrs | 15973 | processed |
| AudioCaps | 40 000 audio clips of 10 seconds, organized in three splits; a training slipt, a validation slipt, and a testing slipt. Type: environmental sound. | Click here | 1 caption, audio | 144.94hrs | 52904 | processed |
| Audio Caption Hospital & Car Dataset | 3700 audio clips from "Hospital" scene and around 3600 audio clips from the "Car" scene. Every audio clip is 10 seconds long and is annotated with five captions. Type: environmental sound. | Click here | 5 captions, audio | 10.64 + 20.91hrs | 3709 + 7336 | we don't need that |
| Clotho dataset | Clotho consists of 6974 audio samples, and each audio sample has five captions (a total of 34 870 captions). Audio samples are of 15 to 30 s duration and captions are eight to 20 words long. Type: environmental sound. | Click here | 5 captions, audio | 37.0hrs | 5929 | processed |
| Audiostock | Royalty Free Music Library. 436864 audio effects(of which 10k available), each with a text description. | Click here | 1 caption & tags, audio | 46.30hrs | 10000 | 10k sound effects processed(@marianna13#7139) |
| ESC-50 | 2000 environmental audio recordings with 50 classes | Click here | 1 class label, audio | 2.78hrs | 2000 | processed(@marianna13#7139) |
| VGG-Sound | VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube | Click here | 1 class label, video, audio | 560hrs | 200,000 + | processed(@marianna13#7139) |
| FUSS | The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus. | Click here | no class label, audio | 61.11hrs | 22000 | |
| UrbanSound8K | 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes | Click here | 1 class label, audio | 8.75hrs | 8732 | processed(@Yuchen Hui#8574) |
| FSD50K | 51,197 audio clips of 200 classes | Click here | class labels, audio | 108.3hrs | 51197 | processed(@Yuchen Hui#8574) |
| YFCC100M | YFCC100M is a that dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license, including 8081 hours of audio. | Click here | title, tags, audio, video, Flickr identifier, owner name, camera, geo, media source | 8081hrs | requested access (@marianna13#7139) | |
| ACAV100M | 100M video clips with audio, each 10 sec, with automatic AudioSet, Kinetics400 and Imagenet labels. -> Noisy, but LARGE. | Click here | class labels/tags, audio | 31 years | 100 million | |
| Free To Use Sounds | 10000+ for 23$ :) | Click here | 1 caption & tags, audio | 175.73hrs | 6370 | |
| MACS - Multi-Annotator Captioned Soundscapes | This is a dataset containing audio captions and corresponding audio tags for a number of 3930 audio files of the TAU Urban Acoustic Scenes 2019 development dataset (airport, public square, and park). The files were annotated using a web-based tool. Each file is annotated by multiple annotators that provided tags and a one-sentence description of the audio content. The data also includes annotator competence estimated using MACE (Multi-Annotator Competence Estimation). | Click here | multiple captions & tags, audio | 10.92hrs | 3930 | processed(@marianna13#7139 & @krishna#1648 & Yuchen Hui#8574) |
| Sonniss Game effects | Sound effects | no link | tags & filenames, audio | 84.6hrs | 5049 | processed |
| WeSoundEffects | Sound effects | no link | tags & filenames, audio | 12.00hrs | 488 | processed |
| Paramount Motion - Odeon Cinematic Sound Effects | Sound effects | no link | 1 tag, audio | 19.49hrs | 4420 | processed |
| Free Sound | Audio with text description (noisy) | Click here | pertinent text, audio | 3003.38hrs | 515581 | processed(@Chr0my#0173 & @Yuchen Hui#8574) |
| Sound Ideas | Sound effects library | Click here | 1 caption, audio | |||
| Boom Library | Sound effects library | Click here | 1 caption, audio | assigned(@marianna13#7139) | ||
| Epidemic Sound (Sound effect part) | Royalty free music and sound effects | Click here | Class labels, audio | 220.41hrs | 75645 | metadata downloaded(@Chr0my#0173), processed (@Yuchen Hui#8547) |
| Audio Grounding dataset | The dataset is an augmented audio captioning dataset. Hard to discribe. Please refer to the URL for details. | Click here | 1 caption, many tags,audio | 12.57hrs | 4590 | |
| Fine-grained Vocal Imitation Set | This dataset includes 763 crowd-sourced vocal imitations of 108 sound events. | Click here | 1 class label, audio | 1.55hrs | 1468 | processed(@marianna13#7139) |
| Vocal Imitation | The VocalImitationSet is a collection of crowd-sourced vocal imitations of a large set of diverse sounds collected from Freesound (https://freesound.org/), which were curated based on Google's AudioSet ontology (https://research.google.com/audioset/). | Click here | 1 class label, audio | 24.06hrs | 9100 files | processed(@marianna13#7139) |
| VocalSketch | Dataset contains thousands of vocal imitations of a large set of diverse sounds.The dataset also contains data on hundreds of people's ability to correctly label these vocal imitations, collected via Amazon's Mechanical Turk | Click here | 1 class label, audio | 18.86hrs | 16645 | processed(@marianna13#7139) |
| VimSketch Dataset | VimSketch Dataset combines two publicly available datasets(VocalSketch + Vocal Imitation, but Vimsketch delete some parts of the previous two datasets), | Click here | class labels, audio | Not important | Not important | |
| OtoMobile Dataset | OtoMobile dataset is a collection of recordings of failing car components, created by the Interactive Audio Lab at Northwestern University. OtoMobile consists of 65 recordings of vehicles with failing components, along with annotations. | Click here (restricted access) |
class labels & tags, audio | Unknown | 59 | |
| DCASE17Task 4 | DCASE Task 4 Large-scale weakly supervised sound event detection for smart cars | Click here | ||||
| Knocking Sound Effects With Emotional Intentions | A dataset of knocking sound effects with emotional intention recorded at a professional foley studio. Five type of emotions to be portrayed in the dataset: anger, fear, happiness, neutral and sadness. | Click here | 1 class label & audio | 500 | processed(@marianna13#7139) | |
| WavText5Ks | WavText5K collection consisting of 4525 audios, 4348 descriptions, 4525 audio titlesand 2058 tags. | Click here | 1 label, tags & audio | 4525 audio files | processed(@marianna13#7139) |
| Name | Description | URL | Text Type | Status |
|---|---|---|---|---|
| Free Music Archive | We introduce the Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections. The community's growing interest in feature and end-to-end learning is however restrained by the limited availability of large audio datasets. The FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-length and high-quality audio, pre-computed features, together with track- and user-level metadata, tags, and free-form text such as biographies. We here describe the dataset and how it was created, propose a train/validation/test split and three subsets, discuss some suitable MIR tasks, and evaluate some baselines for genre recognition. Code, data, and usage examples are available at https://github.com/mdeff/fma. | Click here | tags/class labels, audio | processed(@marianna13#7139) |
| MusicNet | MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; we estimate a labeling error rate of 4%. We offer the MusicNet labels to the machine learning and music communities as a resource for training models and a common benchmark for comparing results. URL: https://homes.cs.washington.edu/~thickstn/musicnet.html | Click here | class labels, audio | processed(@IYWO#9072) |
| MetaMIDI Dataset | We introduce the MetaMIDI Dataset (MMD), a large scale collection of 436,631 MIDI files and metadata. In addition to the MIDI files, we provide artist, title and genre metadata that was collected during the scraping process when available. MIDIs in (MMD) were matched against a collection of 32,000,000 30-second audio clips retrieved from Spotify, resulting in over 10,796,557 audio-MIDI matches. In addition, we linked 600,142 Spotify tracks with 1,094,901 MusicBrainz recordings to produce a set of 168,032 MIDI files that are matched to MusicBrainz database. These links augment many files in the dataset with the extensive metadata available via the Spotify API and the MusicBrainz database. We anticipate that this collection of data will be of great use to MIR researchers addressing a variety of research topics. | Click here | tags, audio | |
| MUSDB18-HQ | MUSDB18 consists of a total of 150 full-track songs of different styles and includes both the stereo mixtures and the original sources, divided between a training subset and a test subset. | Click here | 1 class label, audio | processed(@marianna13#7139) |
| Cambridge-mt Multitrack Dataset | Here’s a list of multitrack projects which can be freely downloaded for mixing practice purposes. All these projects are presented as ZIP archives containing uncompressed WAV files (24-bit or 16-bit resolution and 44.1kHz sample rate). | Click here | 1 class label, audio | processed(@marianna13#7139) |
| Slakh | The Synthesized Lakh (Slakh) Dataset contains 2100 automatically mixed tracks and accompanying MIDI files synthesized using a professional-grade sampling engine. | Click here | 1 class label, audio | processed(krishna#1648) |
| Tunebot | The Tunebot project is an online Query By Humming system. Users sing a song to Tunebot and it returns a ranked list of song candidates available on Apple’s iTunes website. The database that Tunebot compares to sung queries is crowdsourced from users as well. Users contribute new songs to Tunebot by singing them on the Tunebot website. The more songs people contribute, the better Tunebot works. Tunebot is no longer online but the dataset lives on. | Click here | song name(so transcription), audio | processed(@marianna13#7139) |
| Juno | A music review webset | Click here | perinent text/class lables, audio | meatadata downloaded(@dicknascarsixtynine#3885) & processed(@marianna13#7139) |
| Pitch Fork | Music review website | Click here | pertinent text (long paragraphs), audio | |
| Genius | Music lyrics website | pertinent text (long paragraphs), audio | assigned(@marianna13#7139) | |
| IDMT-SMT-Audio-Effects | The IDMT-SMT-Audio-Effects database is a large database for automatic detection of audio effects in recordings of electric guitar and bass and related signal processing. | Click here | class label, audio | |
| MIDI50K | Music generated by MIDIFILES using the synthesizer available at https://pypi.org/project/midi2audio/ | Temporary not available, will be added soon | MIDI files, audio | Processing(@marianna13#7139) |
| MIDI130K | Music generated by MIDIFILES using the synthesizer available at https://pypi.org/project/midi2audio/ | Temporary not available, will be added soon | MIDI files, audio | Processing(@marianna13#7139) |
| MillionSongDataset | 72222 hours of general music as 30 second clips, one million different songs. | Temporarily not available | tags, artist names, song titles, audio | |
| synth1B1 | One million hours of audio: one billion 4-second synthesized sounds. The corpus is multi-modal: Each sound includes its corresponding synthesis parameters. Since it is faster to render synth1B1 in-situ than to download it, torchsynth includes a replicable script for generating synth1B1 within the GPU. | Click here | synthesis parameters, audio | |
| Epidemic Sound (music part) | Royalty free music and sound effects | Click here | class label, tags, audio | assigned(@chr0my#0173) |