Datasets

Pro Svizra Rumantscha

Vallader Newspaper Corpus

6.2 million tokens in the Vallader variety of Romansh from the daily newspaper ”La Quotidiana”.

Gear IconTask: OTH

Folder IconFormat: TSV

License IconLicense: CC0-1.0

Size: 18.71 MB

Calendar IconCreated: 1/7/2026

Globe IconLocale: rm-vallader

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem A...

Gear IconTask: MT

Folder IconFormat: CSV

License IconLicense: CC-BY-SA-4.0

Size: 2.27 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: mul

Sindh Line Publishers

Sindh Line Publishers

The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspape...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-SA-4.0

Size: 2.22 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: snd

Institute of African Digital Humanities

Spoken-Congolese-French-Dataset

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recording...

Gear IconTask: NLP

Folder IconFormat: MP3, WAV, TSV

License IconLicense: NOODL-1.0

Size: 3.44 GB

Calendar IconCreated: 1/5/2026

Globe IconLocale: fr-CG

Institute of African Digital Humanities

Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrati...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 19.25 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: ewo

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. I...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-SA-4.0

Size: 1.88 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: bgn

Institute of African Digital Humanities

Mada Narratives

This dataset contains 17 transcribed oral narratives in Mada (mxu), a language belonging to the Afro-Asiatic family that is spoken in Cameroon. The texts, de...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: NOODL-1.0

Size: 65.04 KB

Calendar IconCreated: 1/5/2026

Globe IconLocale: mxu

Institute of African Digital Humanities

Bamun-French Parallel Corpus

This dataset is a parallel corpus of Bamun (Shupamem) to French texts. Text were obtained by transcription of raw audio files. Translation were added to enri...

Gear IconTask: MT

Folder IconFormat: TSV

License IconLicense: NOODL-1.0

Size: 99.24 KB

Calendar IconCreated: 12/24/2025

Globe IconLocale: bax

Pro Svizra Rumantscha

Surmiran Newspaper Corpus

2.9 million tokens in the Surmiran variety of Romansh from the daily newspaper “La Quotidiana”.

Gear IconTask: OTH

Folder IconFormat: TSV

License IconLicense: CC0-1.0

Size: 11.89 MB

Calendar IconCreated: 12/22/2025

Globe IconLocale: rm-surmiran

Maseno Centre for Applied Artificial Intelligence (MCAAI)

DhoNam: Dholuo Speech dataset

DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of ...

Gear IconTask: ASR

Folder IconFormat: WEBM

License IconLicense: NOODL-1.0

Size: 2.49 GB

Calendar IconCreated: 12/20/2025

Globe IconLocale: Luo

Amnesia

Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)

Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Tra...

Gear IconTask: NLP

Folder IconFormat: ZIP, PDF, CSV, XLSX

License IconLicense: CC-BY-4.0

Size: 866.15 MB

Calendar IconCreated: 12/19/2025

Globe IconLocale: es-MX

Common Voice

Common Voice Spontaneous Speech 2.0 - Kenyah

A collection of spontaneous spoken phrases in Kenyah.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 212.06 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: xkl

Common Voice

Common Voice Spontaneous Speech 2.0 - Ushojo

A collection of spontaneous spoken phrases in Ushojo.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 102.83 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: ush

Common Voice

Common Voice Spontaneous Speech 2.0 - Kuku

A collection of spontaneous spoken phrases in Kuku.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 233.85 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: ukv

Common Voice

Common Voice Spontaneous Speech 2.0 - Rutoro

A collection of spontaneous spoken phrases in Rutoro.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 272.63 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: ttj

Common Voice

Common Voice Spontaneous Speech 2.0 - Turkish

A collection of spontaneous spoken phrases in Turkish.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 4.20 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: tr

Common Voice

Common Voice Spontaneous Speech 2.0 - Papantla Totonac

A collection of spontaneous spoken phrases in Papantla Totonac.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 205.51 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: top

Common Voice

Common Voice Spontaneous Speech 2.0 - Toba Qom

A collection of spontaneous spoken phrases in Toba Qom.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 172.41 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: tob

Common Voice

Common Voice Spontaneous Speech 2.0 - Thai

A collection of spontaneous spoken phrases in Thai.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 87.66 KB

Calendar IconCreated: 12/5/2025

Globe IconLocale: th

Common Voice

Common Voice Spontaneous Speech 2.0 - snv

A collection of spontaneous spoken phrases in snv.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 212.72 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: snv

Common Voice

Common Voice Spontaneous Speech 2.0 - Shona

A collection of spontaneous spoken phrases in Shona.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 1.53 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: sn

Common Voice

Common Voice Spontaneous Speech 2.0 - Tashlhiyt

A collection of spontaneous spoken phrases in Tashlhiyt.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 6.50 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: shi

Common Voice

Common Voice Spontaneous Speech 2.0 - Sena

A collection of spontaneous spoken phrases in Sena.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 24.57 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: seh

Common Voice

Common Voice Spontaneous Speech 2.0 - Serian Bidayuh

A collection of spontaneous spoken phrases in Serian Bidayuh.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 199.91 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: sdo