Datasets
Pro Svizra Rumantscha
Vallader Newspaper Corpus
6.2 million tokens in the Vallader variety of Romansh from the daily newspaper ”La Quotidiana”.
Task: OTH
Format: TSV
License: CC0-1.0
Size: 18.71 MB
Created: 1/7/2026
Locale: rm-vallader
Kaleem Art Press
Multilingual Religious Parallel Corpus (Kaleem Art Press)
This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem A...
Task: MT
Format: CSV
License: CC-BY-SA-4.0
Size: 2.27 MB
Created: 1/5/2026
Locale: mul
Sindh Line Publishers
Sindh Line Publishers
The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspape...
Task: NLP
Format: TXT
License: CC-BY-SA-4.0
Size: 2.22 MB
Created: 1/5/2026
Locale: snd
Institute of African Digital Humanities
Spoken-Congolese-French-Dataset
The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recording...
Task: NLP
Format: MP3, WAV, TSV
License: NOODL-1.0
Size: 3.44 GB
Created: 1/5/2026
Locale: fr-CG
Institute of African Digital Humanities
Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset
This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrati...
Task: NLP
Format: MP3, TSV
License: NOODL-1.0
Size: 19.25 MB
Created: 1/5/2026
Locale: ewo
Balochi Academy
Balochi Academy Text Corpus
This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. I...
Task: NLP
Format: TXT
License: CC-BY-NC-SA-4.0
Size: 1.88 MB
Created: 1/5/2026
Locale: bgn
Institute of African Digital Humanities
Mada Narratives
This dataset contains 17 transcribed oral narratives in Mada (mxu), a language belonging to the Afro-Asiatic family that is spoken in Cameroon. The texts, de...
Task: NLP
Format: TXT
License: NOODL-1.0
Size: 65.04 KB
Created: 1/5/2026
Locale: mxu
Institute of African Digital Humanities
Bamun-French Parallel Corpus
This dataset is a parallel corpus of Bamun (Shupamem) to French texts. Text were obtained by transcription of raw audio files. Translation were added to enri...
Task: MT
Format: TSV
License: NOODL-1.0
Size: 99.24 KB
Created: 12/24/2025
Locale: bax
Pro Svizra Rumantscha
Surmiran Newspaper Corpus
2.9 million tokens in the Surmiran variety of Romansh from the daily newspaper “La Quotidiana”.
Task: OTH
Format: TSV
License: CC0-1.0
Size: 11.89 MB
Created: 12/22/2025
Locale: rm-surmiran
Maseno Centre for Applied Artificial Intelligence (MCAAI)
DhoNam: Dholuo Speech dataset
DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of ...
Task: ASR
Format: WEBM
License: NOODL-1.0
Size: 2.49 GB
Created: 12/20/2025
Locale: Luo
Amnesia
Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)
Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Tra...
Task: NLP
Format: ZIP, PDF, CSV, XLSX
License: CC-BY-4.0
Size: 866.15 MB
Created: 12/19/2025
Locale: es-MX
Common Voice
Common Voice Spontaneous Speech 2.0 - Kenyah
A collection of spontaneous spoken phrases in Kenyah.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 212.06 MB
Created: 12/5/2025
Locale: xkl
Common Voice
Common Voice Spontaneous Speech 2.0 - Ushojo
A collection of spontaneous spoken phrases in Ushojo.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 102.83 MB
Created: 12/5/2025
Locale: ush
Common Voice
Common Voice Spontaneous Speech 2.0 - Kuku
A collection of spontaneous spoken phrases in Kuku.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 233.85 MB
Created: 12/5/2025
Locale: ukv
Common Voice
Common Voice Spontaneous Speech 2.0 - Rutoro
A collection of spontaneous spoken phrases in Rutoro.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 272.63 MB
Created: 12/5/2025
Locale: ttj
Common Voice
Common Voice Spontaneous Speech 2.0 - Turkish
A collection of spontaneous spoken phrases in Turkish.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 4.20 MB
Created: 12/5/2025
Locale: tr
Common Voice
Common Voice Spontaneous Speech 2.0 - Papantla Totonac
A collection of spontaneous spoken phrases in Papantla Totonac.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 205.51 MB
Created: 12/5/2025
Locale: top
Common Voice
Common Voice Spontaneous Speech 2.0 - Toba Qom
A collection of spontaneous spoken phrases in Toba Qom.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 172.41 MB
Created: 12/5/2025
Locale: tob
Common Voice
Common Voice Spontaneous Speech 2.0 - Thai
A collection of spontaneous spoken phrases in Thai.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 87.66 KB
Created: 12/5/2025
Locale: th
Common Voice
Common Voice Spontaneous Speech 2.0 - snv
A collection of spontaneous spoken phrases in snv.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 212.72 MB
Created: 12/5/2025
Locale: snv
Common Voice
Common Voice Spontaneous Speech 2.0 - Shona
A collection of spontaneous spoken phrases in Shona.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 1.53 MB
Created: 12/5/2025
Locale: sn
Common Voice
Common Voice Spontaneous Speech 2.0 - Tashlhiyt
A collection of spontaneous spoken phrases in Tashlhiyt.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 6.50 MB
Created: 12/5/2025
Locale: shi
Common Voice
Common Voice Spontaneous Speech 2.0 - Sena
A collection of spontaneous spoken phrases in Sena.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 24.57 MB
Created: 12/5/2025
Locale: seh
Common Voice
Common Voice Spontaneous Speech 2.0 - Serian Bidayuh
A collection of spontaneous spoken phrases in Serian Bidayuh.
Task: ASR
Format: MP3
License: CC0-1.0
Size: 199.91 MB
Created: 12/5/2025
Locale: sdo
