Datasets

Filters:
Community

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 440.11 MB

Community

TTS Javanese-Lumajang Dialect

This dataset comprises audio recordings of scripted speech in Javanese of Lumajang Dialect from East Java of Indonesia.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: jav

Task Icon

Task: TTS

Format Icon

Format: WEBM, TSV

Size Icon

Size: 684.32 MB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.
License Icon

License: NOODL-1.0

Locale Icon

Locale: fub

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 112.17 KB

Institute of African Digital Humanities

Ewondo-TTS-Dataset

The dataset consists of four hours of high-quality audio clips, each paired with text and read by a single speaker.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: TTS

Format Icon

Format: MP3, TSV

Size Icon

Size: 152.70 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.
License Icon

License: NOODL-1.0

Locale Icon

Locale: bax

Task Icon

Task: MT

Format Icon

Format: TSV

Size Icon

Size: 99.78 KB

Community

TTS Muna Dataset

This dataset comprises a compilation of cultural narratives and children’s stories from Southeast Sulawesi, Indonesia, presented in the Muna language.
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: mnb

Task Icon

Task: TTS

Format Icon

Format: WEBM & TSV

Size Icon

Size: 316.34 MB

The University of Melbourne

Hawrami Kurdish TTS dataset 1.0

This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish (Hewrami, ISO 639-3:hac), also known as the Gorani language, intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data. Hawrami is classified as Definitely Endangered by UNESCO.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: hac

Task Icon

Task: TTS

Format Icon

Format: WAV

Size Icon

Size: 706.11 MB

Common Voice

Common Voice 7.0 - Single Word Target Segment

This dataset contains the numbers 0 to 9 and the words "yes" and "no" in 34 languages. It contains 84 validated hours of speech.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: ASR

Format Icon

Format: TSV, MP3

Size Icon

Size: 3.51 GB

EELLAK - GreekFOSS

Greek PhD Theses Corpus v1.0

The Greek PhD Theses Corpus is a large, AI-ready dataset
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: JASONL

Size Icon

Size: 7.02 GB

EELLAK - GreekFOSS

openbook.gr v1.0

Greek digital books corpus for NLP and linguistic analysis
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: gr-GR

Task Icon

Task: NLP

Format Icon

Format: Markdown (.md)

Size Icon

Size: 251.63 MB

TidyVoice2026 Challenge

TidyVoiceX2_ASV

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.
License Icon

License: CC0-1.0

Locale Icon

Locale: mul

Task Icon

Task: OTH

Format Icon

Format: WAV

Size Icon

Size: 23.07 GB

Mozilla Data Collective

Sermon-Malaysian-English

7 minutes of Malaysian-accented English speech
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-MY

Task Icon

Task: ASR

Format Icon

Format: MP4, TXT, SRT

Size Icon

Size: 6.63 MB

Christine

Reading Recommendations List

A reading recommendations list of mainly fiction (fantasy, literary, mystery) books read 2022-2025.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-US

Task Icon

Task: OTH

Format Icon

Format: CSV

Size Icon

Size: 16.24 KB

OpenCSG

chinese-cosmopedia

A large-scale high-quality Chinese text dataset developed by OpenCSG, containing ~15 million entries (≈60B tokens) covering multi-domain content (encyclopedia, education, etc.). Cleaned and deduplicated to remove low-quality content, it is optimized for large language model pretraining, text generation, and other Chinese NLP downstream tasks, compatible with mainstream toolchains (Hugging Face Datasets, PyTorch).
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 6.09 GB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.
License Icon

License: Apache-2.0

Locale Icon

Locale: zh

Task Icon

Task: LLM

Format Icon

Format: parquet

Size Icon

Size: 879.81 MB

Common Voice

Common Voice v24 English - en-AU subset for Everything Open 2026

Common Voice v24 English filtered on the `accent` field for Australian-related accents.
License Icon

License: CC0-1.0

Locale Icon

Locale: en-AU

Task Icon

Task: ASR

Format Icon

Format: CSV, MP3

Size Icon

Size: 1.92 GB

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.
License Icon

License: NOODL-1.0

Locale Icon

Locale: ewo

Task Icon

Task: NLP

Format Icon

Format: MP3, TSV

Size Icon

Size: 16.80 MB

Amnesia

Informes de Actividades InfoCDMX (Ponencia Laura Enríquez)

El conjunto de datos se compone de los informes anuales de actividades y resultados del InfoCDMX, los cuales documentan el desempeño del organismo garante, el pleno y las ponencias de las personas comisionadas. Estos documentos son instrumentos fundamentales de rendición de cuentas que detallan la gestión institucional, la actividad cuasi-jurisdiccional (resolución de recursos de revisión y denuncias), y las
License Icon

License: CC-BY-4.0

Locale Icon

Locale: es-MX

Task Icon

Task: NLP

Format Icon

Format: PDF, XSLX

Size Icon

Size: 275.85 MB

Amnesia

Ficha de Documentación de Datos: Resoluciones InfoNL (Ponencia F. Guajardo)

Este conjunto de datos documenta la actividad resolutiva de la ponencia del Consejero Francisco Guajardo Martínez dentro del órgano garante de transparencia de Nuevo León. Cubre un periodo significativo de gestión (identificado preliminarmente entre 2018 y 2025), reflejando las disputas entre ciudadanos (solicitantes de información) y sujetos obligados (gobierno).
License Icon

License: CC-BY-4.0

Locale Icon

Locale: es-MX

Task Icon

Task: NLP

Format Icon

Format: PDF, XSLX

Size Icon

Size: 1.07 GB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs
License Icon

License: Etalab 2.0

Locale Icon

Locale: fr

Task Icon

Task: NLG

Format Icon

Format: PARQUET

Size Icon

Size: 1.81 GB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.
License Icon

License: CC-BY-NC-4.0

Locale Icon

Locale: en-PK, pnb

Task Icon

Task: MT

Format Icon

Format: CSV

Size Icon

Size: 1.08 MB

Fundación Vía Libre

HESEIA Sentence Bias Dataset

This repository contains a dataset collected during the teacher training course HESEIA Sentence Bias (Tools for Exploring Biases and Artificial Intelligence). organized by Vía Libre, the Ministry of Education, and FAMAF-UNC. The course had an initial enrollment of 370 participating teachers, who also involved over 5,000 students in building a dataset that reflects stereotypes present in Argentina.
License Icon

License: CC-BY-SA-4.0

Locale Icon

Locale: es-AR

Task Icon

Task: OTH

Format Icon

Format: CSV

Size Icon

Size: 235.43 KB

RFERL

RFE/RL Tatar-Bashkir News Text Corpus

This dataset is longitudinal news corpus for the Tatar and Bashkir languages, sourced from Azatliq Radiosi, from 2001 to 2025, it contains over 105,000 articles (30M tokens).
License Icon

License: CC-BY-NC-SA-4.0

Locale Icon

Locale: tt,ba,ru

Task Icon

Task: NLP

Format Icon

Format: TXT

Size Icon

Size: 102.44 MB

Digital Divide Data

ddd-kenya-luhya-70hrs-asr

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Luhya language, produced by Digital Divide Data.
License Icon

License: CC-BY-4.0

Locale Icon

Locale: luy

Task Icon

Task: ASR

Format Icon

Format: WAV, XLSX, TSV

Size Icon

Size: 13.90 GB