Datasets

Filters:
Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.
License Icon

License: CC0-1.0

Locale Icon

Locale: hr

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 285.11 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Danish

A collection of spontaneous responses to questions in Danish.
License Icon

License: CC0-1.0

Locale Icon

Locale: da

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 61.80 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Ruuli

A collection of spontaneous responses to questions in Ruuli.
License Icon

License: CC0-1.0

Locale Icon

Locale: ruc

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 365.95 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Irish

A collection of spontaneous responses to questions in Irish.
License Icon

License: CC0-1.0

Locale Icon

Locale: ga-IE

Task Icon

Task: ASR

Format Icon

Format: MP3

Size Icon

Size: 3.14 MB

EELLAK - GreekFOSS

Istorima

Oral history interviews from Istorima archive (transcriptions+metadata) in Greek on social/cultural/historical topics
License Icon

License: CC BY-NC-ND 4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: PARQUET

Size Icon

Size: 416.02 MB

UP EEEI - Digital Signal Processing Laboratory

UP - DSP - Philippine Languages Database (UP-DSP-PLD)

A multilingual corpora for ten Philippine languages containing over 454 hours of recordings
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: phi

Task Icon

Task: ASR

Format Icon

Format: WAV, LOG

Size Icon

Size: 45.63 GB

Community

Urdu Multi-Speaker TTS Dataset

An Urdu multi-speaker TTS dataset distributed in 36 zip files, each containing audio files and a TSV mapping file, with approximately 10 hours of speech.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: urd

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 514.54 MB

Balochistan Educational and Cultural Organization

BECO Brahui Literature Corpus

A ~355k-token Brahui literary corpus of short stories, novels, and other creative works for linguistic research and NLP.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: brh

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 1.19 MB

Community

Malayalam Time-Aligned Speech Corpus

A Malayalam speech dataset containing 100 audio files with time-aligned .srt transcriptions from 5 speakers.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mal

Task Icon

Task: ASR

Format Icon

Format: WAV, SRT

Size Icon

Size: 1.50 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 8.07 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data
License Icon

License: CC-BY-4.0

Locale Icon

Locale: som

Task Icon

Task: ASR

Format Icon

Format: WAV, TSV

Size Icon

Size: 7.68 GB

Community

TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances. What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: zgh

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 3.27 MB

Community

TTS Balinese Language

This TTS dataset contains Balinese language used in daily activities.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: ban

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 301.05 MB

Community

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. (https://github.com/kaiidams/Kokoro-Speech-Dataset)
License Icon

License: libribox

Locale Icon

Locale: ja

Task Icon

Task: TTS

Format Icon

Format: FLAC

Size Icon

Size: 3.98 GB

Community

Sundanese TTS

This dataset uses the Priangan dialect of West Java with Indonesian code-mixing and code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: sun

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 298.10 MB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: es-US, en-US

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 1.12 GB

Keblagh e Azergi

Elkhani Hazargi Literature Corpus

Hazargi literary corpus (~0.5M tokens) of poetry, folklore, and prose texts representing Hazara linguistic and cultural heritage.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: haz

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.46 MB

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan

A ~1 M-token Dari (Afghan Persian) literary corpus compiled by Anjuman e Adabi Nayestan, covering prose, poetry, and cultural texts in Perso-Arabic script.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: prs

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 12.67 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: trw

Task Icon

Task: NLP

Format Icon

Format: CSV

Size Icon

Size: 312.87 KB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words
License Icon

License: GPL-3.0

Locale Icon

Locale: cym

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA. TSV

Size Icon

Size: 2.13 GB

MDC Community Concierge

Bangor Patagonia Welsh-Spanish Corpus

Welsh-Spanish corpus contains around 195,000 words.
License Icon

License: GPL-3.0

Locale Icon

Locale: cym, spa

Task Icon

Task: ASR

Format Icon

Format: MP3, CHA, TSV

Size Icon

Size: 988.02 MB

Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: mul

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.92 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: skr

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 2.30 MB