MDC Logo

Mozilla Data Collective is rebuilding the AI data ecosystem with communities at the centre. Access over 300 high-quality global datasets, built by and for the community in a transparent and ethical way.

Hero Line

Datasets

Kaleem Art Press

Multilingual Religious Parallel Corpus (Kaleem Art Press)

This dataset is a multilingual parallel sentences corpus containing 6,465 aligned sentence units with approximately 0.98 million words, curated from Kaleem A...

Gear IconTask: MT

Folder IconFormat: CSV

License IconLicense: CC-BY-SA-4.0

Size: 2.27 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: mul

Sindh Line Publishers

Sindh Line Publishers

The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspape...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-SA-4.0

Size: 2.22 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: snd

Institute of African Digital Humanities

Spoken-Congolese-French-Dataset

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recording...

Gear IconTask: NLP

Folder IconFormat: MP3, WAV, TSV

License IconLicense: NOODL-1.0

Size: 3.44 GB

Calendar IconCreated: 1/5/2026

Globe IconLocale: fr-CG

Institute of African Digital Humanities

Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrati...

Gear IconTask: NLP

Folder IconFormat: MP3, TSV

License IconLicense: NOODL-1.0

Size: 19.25 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: ewo

Balochi Academy

Balochi Academy Text Corpus

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. I...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: CC-BY-NC-SA-4.0

Size: 1.88 MB

Calendar IconCreated: 1/5/2026

Globe IconLocale: bgn

Institute of African Digital Humanities

Mada Narratives

This dataset contains 17 transcribed oral narratives in Mada (mxu), a language belonging to the Afro-Asiatic family that is spoken in Cameroon. The texts, de...

Gear IconTask: NLP

Folder IconFormat: TXT

License IconLicense: NOODL-1.0

Size: 65.04 KB

Calendar IconCreated: 1/5/2026

Globe IconLocale: mxu

Institute of African Digital Humanities

Bamun-French Parallel Corpus

This dataset is a parallel corpus of Bamun (Shupamem) to French texts. Text were obtained by transcription of raw audio files. Translation were added to enri...

Gear IconTask: MT

Folder IconFormat: TSV

License IconLicense: NOODL-1.0

Size: 99.24 KB

Calendar IconCreated: 12/24/2025

Globe IconLocale: bax

Pro Svizra Rumantscha

Surmiran Newspaper Corpus

2.9 million tokens in the Surmiran variety of Romansh from the daily newspaper “La Quotidiana”.

Gear IconTask: OTH

Folder IconFormat: TSV

License IconLicense: CC0-1.0

Size: 11.89 MB

Calendar IconCreated: 12/22/2025

Globe IconLocale: rm-surmiran

Maseno Centre for Applied Artificial Intelligence (MCAAI)

DhoNam: Dholuo Speech dataset

DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of ...

Gear IconTask: ASR

Folder IconFormat: WEBM

License IconLicense: NOODL-1.0

Size: 2.49 GB

Calendar IconCreated: 12/20/2025

Globe IconLocale: Luo

Amnesia

Archivo de la Comisionada María de los Ángeles Guzmán García (COTAI Nuevo León / InfoNL)

Este archivo preserva la memoria institucional y académica de la gestión de la Dra. María de los Ángeles Guzmán García como Comisionada de la Comisión de Tra...

Gear IconTask: NLP

Folder IconFormat: ZIP, PDF, CSV, XLSX

License IconLicense: CC-BY-4.0

Size: 866.15 MB

Calendar IconCreated: 12/19/2025

Globe IconLocale: es-MX

Common Voice

Common Voice Spontaneous Speech 2.0 - Kenyah

A collection of spontaneous spoken phrases in Kenyah.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 212.06 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: xkl

Common Voice

Common Voice Spontaneous Speech 2.0 - Ushojo

A collection of spontaneous spoken phrases in Ushojo.

Gear IconTask: ASR

Folder IconFormat: MP3

License IconLicense: CC0-1.0

Size: 102.83 MB

Calendar IconCreated: 12/5/2025

Globe IconLocale: ush

Line Logo
Line Logo

JOIN THE MOVEMENT

Join Mozilla Data Collective

Community members showing peace signs and smiling

Mozilla Data Collective wants to radically reimagine our data as power. We are anti-extractivism, anti-monopoly and deeply, profoundly pro-people. We are a collective of linguists, technologists, activists, researchers and creatives who want AI to be all it promises to be - not all it threatens to be. Here, you can share your datasets on your own terms.

FAQs

Find answers quickly

What is Mozilla Data Collective?

Mozilla Data Collective is a platform in the truest sense. It’s yours to stand on, and make of it what you will. We have dual roots in two Mozilla projects - Common Voice, a CC0 public dataset to help tech speak your language - and the Data Futures Lab - an experimental space for instigating new approaches to data stewardship challenges. Mozilla Data Collective works by allowing you to share your data, retain ownership of it, and control who uses it.


How does it work?

We partner with organizations and individuals to make their data available through Mozilla Data Collective. You can share openly, using existing licenses like Creative Commons, or you can build your own. You can open up your data for everyone, or just for some types of downloaders, you can set custom constraints, ask for exchange, compensation or recognition. You can govern it as an individual, a co-operative, a trust or something else. After all, it’s your data. The people who access your datasets are authenticated, and held in legally binding contracts, and we have a number of dataset protection features. If you are interested in hosting data on Mozilla Data Collective, please reach out to us at mozilladatacollective@mozillafoundation.org.


Who is behind Mozilla Data Collective?

We are backed and stewarded by Mozilla Foundation - the non-profit, movement-building, and philanthropy arm of Mozilla.