Datasets

Filters:

Common Voice

Common Voice Spontaneous Speech 3.0 - Croatian

A collection of spontaneous responses to questions in Croatian.

License: CC0-1.0

Locale: hr

Task: ASR

Format: MP3

Size: 285.11 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Danish

A collection of spontaneous responses to questions in Danish.

License: CC0-1.0

Locale: da

Task: ASR

Format: MP3

Size: 61.80 KB

Common Voice

Common Voice Spontaneous Speech 3.0 - Ruuli

A collection of spontaneous responses to questions in Ruuli.

License: CC0-1.0

Locale: ruc

Task: ASR

Format: MP3

Size: 365.95 MB

Common Voice

Common Voice Spontaneous Speech 3.0 - Irish

A collection of spontaneous responses to questions in Irish.

License: CC0-1.0

Locale: ga-IE

Task: ASR

Format: MP3

Size: 3.14 MB

EELLAK - GreekFOSS

Istorima

Oral history interviews from Istorima archive (transcriptions+metadata) in Greek on social/cultural/historical topics

License: CC BY-NC-ND 4.0

Locale: gr-GR

Task: NLP

Format: PARQUET

Size: 416.02 MB

UP EEEI - Digital Signal Processing Laboratory

UP - DSP - Philippine Languages Database (UP-DSP-PLD)

A multilingual corpora for ten Philippine languages containing over 454 hours of recordings

License: CC-BY-NC-4.0

Locale: phi

Task: ASR

Format: WAV, LOG

Size: 45.63 GB

Community

Urdu Multi-Speaker TTS Dataset

An Urdu multi-speaker TTS dataset distributed in 36 zip files, each containing audio files and a TSV mapping file, with approximately 10 hours of speech.

License: CC-BY-NC-4.0

Locale: urd

Task: TTS

Format: WEBM, TSV

Size: 514.54 MB

Balochistan Educational and Cultural Organization

BECO Brahui Literature Corpus

A ~355k-token Brahui literary corpus of short stories, novels, and other creative works for linguistic research and NLP.

License: CC-BY-NC-SA-4.0

Locale: brh

Task: NLP

Format: TXT

Size: 1.19 MB

Community

Malayalam Time-Aligned Speech Corpus

A Malayalam speech dataset containing 100 audio files with time-aligned .srt transcriptions from 5 speakers.

License: CC-BY-NC-4.0

Locale: mal

Task: ASR

Format: WAV, SRT

Size: 1.50 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part3

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 1.33 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part2

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Somali language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 8.07 GB

Digital Divide Data

ddd-kenya-somali-68hrs-asr-part1

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for Somali language, produced by Digital Divide Data

License: CC-BY-4.0

Locale: som

Task: ASR

Format: WAV, TSV

Size: 7.68 GB

Community

TODa: Tamazight Open Dataset

Welcome to the Tamazight Open Dataset (TODa), a groundbreaking open-source project dedicated to preserving and advancing the Tamazight language. With its extensive collection of linguistic data, TODa stands as a pioneering collaborative project for Tamazight <=> Englis translation, specifically designed for Natural Language Processing applications. TODa's unique approach combines both semantic and syntactic categorization methods, offering a rich representation of words in their various contexts and forms. The dataset encompasses a comprehensive collection of linguistic elements, including detailed verb conjugations across different tenses, noun variations, and an extensive compilation of translated expressions that capture the language's nuances. What sets TODa apart is its inclusive approach to Tamazight's writing systems. The dataset thoughtfully incorporates Latin alphabets, acknowledging and preserving the diverse writing traditions practiced across Amazigh communities. This dual-script approach ensures broader accessibility and cultural authenticity. Our vision is to establish TODa as the cornerstone resource for Tamazight Natural Language Processing. Through this meticulously curated dataset, we strive to empower developers and researchers to create innovative NLP solutions that authentically serve the Amazigh-speaking community. We take pride in our current progress, yet acknowledge that language documentation is an evolving journey. We actively encourage participation from the Amazigh technology community to contribute their expertise in expanding and refining the dataset. Through collaborative effort, we can create a robust foundation for technological innovations that honor and advance Amazigh linguistic heritage.

License: CC-BY-4.0

Locale: zgh

Task: NLP

Format: CSV

Size: 3.27 MB

Community

TTS Balinese Language

This TTS dataset contains Balinese language used in daily activities.

License: CC-BY-SA-4.0

Locale: ban

Task: TTS

Format: WEBM, TSV

Size: 301.05 MB

Community

Kokoro Speech Dataset

Kokoro Speech Dataset is a public domain Japanese speech dataset. (https://github.com/kaiidams/Kokoro-Speech-Dataset)

License: libribox

Locale: ja

Task: TTS

Format: FLAC

Size: 3.98 GB

Community

Sundanese TTS

This dataset uses the Priangan dialect of West Java with Indonesian code-mixing and code-switching.

License: CC-BY-SA-4.0

Locale: sun

Task: TTS

Format: WEBM, TSV

Size: 298.10 MB

MDC Community Concierge

Bangor Miami Spanish-English Corpus

Spanish-English bilingual speech corpus with 35 hours of recorded audio and 240,000 words.

License: GPL-3.0

Locale: es-US, en-US

Task: ASR

Format: MP3, CHA, TSV

Size: 1.12 GB

Keblagh e Azergi

Elkhani Hazargi Literature Corpus

Hazargi literary corpus (~0.5M tokens) of poetry, folklore, and prose texts representing Hazara linguistic and cultural heritage.

License: CC-BY-NC-4.0

Locale: haz

Task: NLP

Format: TXT

Size: 2.46 MB

Aim Foundation

Dari Literature Corpus by Anjuman e Adabi Nayestan

A ~1 M-token Dari (Afghan Persian) literary corpus compiled by Anjuman e Adabi Nayestan, covering prose, poetry, and cultural texts in Perso-Arabic script.

License: CC-BY-NC-4.0

Locale: prs

Task: NLP

Format: TXT

Size: 12.67 MB

Collaborative Action For Research & Development (CARD)

IBT Torwali Wordlist

The IBT Torwali Wordlist contains approximately 20,000 unique entries in Torwali (ISO 639-3: trw), an under-documented Indo-Aryan language spoken in northern Pakistan. The dataset comprises standardized lexical entries covering core vocabulary, function words, and culturally salient terms, with consistent orthography and normalization suitable for linguistic and computational use. Entries are aligned with English and Urdu glosses, and include part-of-speech tag.

License: CC-BY-SA-4.0

Locale: trw

Task: NLP

Format: CSV

Size: 312.87 KB

MDC Community Concierge

Bangor Siarad Welsh-English Corpus

Welsh-English bilingual speech corpus with 40 hours of recorded audio and transcriptions making up 450,000 words

License: GPL-3.0

Locale: cym

Task: ASR

Format: MP3, CHA. TSV

Size: 2.13 GB

MDC Community Concierge

Bangor Patagonia Welsh-Spanish Corpus

Welsh-Spanish corpus contains around 195,000 words.

License: GPL-3.0

Locale: cym, spa

Task: ASR

Format: MP3, CHA, TSV

Size: 988.02 MB

Kaleem Art Press

Saraiki-English Parallel Corpus

English–Saraiki parallel corpus: 51,447 aligned sentence pairs (~0.89M words), translated by Kaleem Art Press for MT and Saraiki NLP research.

License: CC-BY-NC-4.0

Locale: mul

Task: MT

Format: CSV

Size: 1.92 MB

Kaleem Art Press

Jhoke Publisher Multan’s Saraiki Newspaper Corpus

Jhoke Publisher Multan’s Saraiki Newspaper Corpus (~1.25M tokens) is a normalized UTF-8 collection of Saraiki newspaper.

License: CC-BY-NC-4.0

Locale: skr

Task: NLP

Format: TXT

Size: 2.30 MB