Datasets

Filters:

TTS Central Javanese

This dataset consists of audio recordings and textual data in Central Javanese (Semarang dialect) including Indonesian and English code-switching.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: WEBM, TSV

Size: 440.11 MB

Community

TTS Javanese-Lumajang Dialect

This dataset comprises audio recordings of scripted speech in Javanese of Lumajang Dialect from East Java of Indonesia.

License: CC-BY-SA-4.0

Locale: jav

Task: TTS

Format: WEBM, TSV

Size: 684.32 MB

Institute of African Digital Humanities

Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2

This dataset is an updated version of the 'Adamawa Fulfulde–French Parallel Corpus of Narratives 1.1'.

License: NOODL-1.0

Locale: fub

Task: MT

Format: TSV

Size: 112.17 KB

Institute of African Digital Humanities

Ewondo-TTS-Dataset

The dataset consists of four hours of high-quality audio clips, each paired with text and read by a single speaker.

License: NOODL-1.0

Locale: ewo

Task: TTS

Format: MP3, TSV

Size: 152.70 MB

Institute of African Digital Humanities

Bamun-French Parallel Corpus 1.1

This dataset is an updated version of the "Bamun-French Parallel Corpus", a parallel corpus of texts in Bamun (Shupament) and French.

License: NOODL-1.0

Locale: bax

Task: MT

Format: TSV

Size: 99.78 KB

Community

TTS Muna Dataset

This dataset comprises a compilation of cultural narratives and children’s stories from Southeast Sulawesi, Indonesia, presented in the Muna language.

License: CC-BY-NC-SA-4.0

Locale: mnb

Task: TTS

Format: WEBM & TSV

Size: 316.34 MB

The University of Melbourne

Hawrami Kurdish TTS dataset 1.0

This dataset contains high-quality single-speaker audio recordings in Hawrami Kurdish (Hewrami, ISO 639-3:hac), also known as the Gorani language, intended for building Text-to-Speech (TTS) and Automatic Speech recognition (ASR) systems. The dataset comprises 5 hours and 15 minutes of aligned audio and text data. Hawrami is classified as Definitely Endangered by UNESCO.

License: CC-BY-4.0

Locale: hac

Task: TTS

Format: WAV

Size: 706.11 MB

Common Voice

Common Voice 7.0 - Single Word Target Segment

This dataset contains the numbers 0 to 9 and the words "yes" and "no" in 34 languages. It contains 84 validated hours of speech.

License: CC0-1.0

Locale: mul

Task: ASR

Format: TSV, MP3

Size: 3.51 GB

EELLAK - GreekFOSS

Greek PhD Theses Corpus v1.0

The Greek PhD Theses Corpus is a large, AI-ready dataset

License: CC-BY-NC-SA-4.0

Locale: gr-GR

Task: NLP

Format: JASONL

Size: 7.02 GB

EELLAK - GreekFOSS

openbook.gr v1.0

Greek digital books corpus for NLP and linguistic analysis

License: CC-BY-NC-SA-4.0

Locale: gr-GR

Task: NLP

Format: Markdown (.md)

Size: 251.63 MB

TidyVoice2026 Challenge

TidyVoiceX2_ASV

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.

License: CC0-1.0

Locale: mul

Task: OTH

Format: WAV

Size: 23.07 GB

Mozilla Data Collective

Sermon-Malaysian-English

7 minutes of Malaysian-accented English speech

License: CC-BY-NC-4.0

Locale: en-MY

Task: ASR

Format: MP4, TXT, SRT

Size: 6.63 MB

Christine

Reading Recommendations List

A reading recommendations list of mainly fiction (fantasy, literary, mystery) books read 2022-2025.

License: CC0-1.0

Locale: en-US

Task: OTH

Format: CSV

Size: 16.24 KB

OpenCSG

chinese-cosmopedia

A large-scale high-quality Chinese text dataset developed by OpenCSG, containing ~15 million entries (≈60B tokens) covering multi-domain content (encyclopedia, education, etc.). Cleaned and deduplicated to remove low-quality content, it is optimized for large language model pretraining, text generation, and other Chinese NLP downstream tasks, compatible with mainstream toolchains (Hugging Face Datasets, PyTorch).

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 6.09 GB

OpenCSG

smoltalk-chinese

SmolTalk-Chinese: A multi-task Chinese conversational dataset covering 19 typical dialogue task scenarios.

License: Apache-2.0

Locale: zh

Task: LLM

Format: parquet

Size: 879.81 MB

Common Voice

Common Voice v24 English - en-AU subset for Everything Open 2026

Common Voice v24 English filtered on the `accent` field for Australian-related accents.

License: CC0-1.0

Locale: en-AU

Task: ASR

Format: CSV, MP3

Size: 1.92 GB

Institute of African Digital Humanities

Ewondo_Fong_ALCAM-MultimodalDataset

A multimodal linguistic resource comprising a curated datasheet of example sentences in Ewondo (Fong variety) and their French equivalents, along with their corresponding audio recordings and a sentence–audio alignment file. It is designed to support research, documentation and pedagogy in the field of speech and language technology for under-resourced African languages.

License: NOODL-1.0

Locale: ewo

Task: NLP

Format: MP3, TSV

Size: 16.80 MB

Amnesia

Informes de Actividades InfoCDMX (Ponencia Laura Enríquez)

El conjunto de datos se compone de los informes anuales de actividades y resultados del InfoCDMX, los cuales documentan el desempeño del organismo garante, el pleno y las ponencias de las personas comisionadas. Estos documentos son instrumentos fundamentales de rendición de cuentas que detallan la gestión institucional, la actividad cuasi-jurisdiccional (resolución de recursos de revisión y denuncias), y las

License: CC-BY-4.0

Locale: es-MX

Task: NLP

Format: PDF, XSLX

Size: 275.85 MB

Amnesia

Ficha de Documentación de Datos: Resoluciones InfoNL (Ponencia F. Guajardo)

Este conjunto de datos documenta la actividad resolutiva de la ponencia del Consejero Francisco Guajardo Martínez dentro del órgano garante de transparencia de Nuevo León. Cubre un periodo significativo de gestión (identificado preliminarmente entre 2018 y 2025), reflejando las disputas entre ciudadanos (solicitantes de información) y sujetos obligados (gobierno).

License: CC-BY-4.0

Locale: es-MX

Task: NLP

Format: PDF, XSLX

Size: 1.07 GB

ComparIA

Compar:IA conversations

French conversational AI conversation and preference dataset with 396K conversations from 50+ LLMs

License: Etalab 2.0

Locale: fr

Task: NLG

Format: PARQUET

Size: 1.81 GB

MEDIAMEN

English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)

This parallel sentences corpus containing 30,405 aligned sentence pairs with a total of approximately 0.62 million tokens, curated from the archival materials of Mediamen (Advertising Agency). The sentences were professionally translated from English into Punjabi (Shahmukhi) and are intended to support machine translation, linguistic research, and Punjabi language technology development, particularly for real-world and contemporary language use.

License: CC-BY-NC-4.0

Locale: en-PK, pnb

Task: MT

Format: CSV

Size: 1.08 MB

Fundación Vía Libre

HESEIA Sentence Bias Dataset

This repository contains a dataset collected during the teacher training course HESEIA Sentence Bias (Tools for Exploring Biases and Artificial Intelligence). organized by Vía Libre, the Ministry of Education, and FAMAF-UNC. The course had an initial enrollment of 370 participating teachers, who also involved over 5,000 students in building a dataset that reflects stereotypes present in Argentina.

License: CC-BY-SA-4.0

Locale: es-AR

Task: OTH

Format: CSV

Size: 235.43 KB

RFERL

RFE/RL Tatar-Bashkir News Text Corpus

This dataset is longitudinal news corpus for the Tatar and Bashkir languages, sourced from Azatliq Radiosi, from 2001 to 2025, it contains over 105,000 articles (30M tokens).

License: CC-BY-NC-SA-4.0

Locale: tt,ba,ru

Task: NLP

Format: TXT

Size: 102.44 MB

Digital Divide Data

ddd-kenya-luhya-70hrs-asr

A collection of high-quality, crowdsourced audio recordings and verified transcriptions for the Luhya language, produced by Digital Divide Data.

License: CC-BY-4.0

Locale: luy

Task: ASR

Format: WAV, XLSX, TSV

Size: 13.90 GB