Datasets
TTS Central Javanese
License: CC-BY-SA-4.0
Locale: jav
Task: TTS
Format: WEBM, TSV
Size: 440.11 MB
TTS Javanese-Lumajang Dialect
License: CC-BY-SA-4.0
Locale: jav
Task: TTS
Format: WEBM, TSV
Size: 684.32 MB
Adamawa Fulfulde-French Parallel Corpus of Narratives 1.2
License: NOODL-1.0
Locale: fub
Task: MT
Format: TSV
Size: 112.17 KB
Ewondo-TTS-Dataset
License: NOODL-1.0
Locale: ewo
Task: TTS
Format: MP3, TSV
Size: 152.70 MB
Bamun-French Parallel Corpus 1.1
License: NOODL-1.0
Locale: bax
Task: MT
Format: TSV
Size: 99.78 KB
TTS Muna Dataset
License: CC-BY-NC-SA-4.0
Locale: mnb
Task: TTS
Format: WEBM & TSV
Size: 316.34 MB
Hawrami Kurdish TTS dataset 1.0
License: CC-BY-4.0
Locale: hac
Task: TTS
Format: WAV
Size: 706.11 MB
Common Voice 7.0 - Single Word Target Segment
License: CC0-1.0
Locale: mul
Task: ASR
Format: TSV, MP3
Size: 3.51 GB
Greek PhD Theses Corpus v1.0
License: CC-BY-NC-SA-4.0
Locale: gr-GR
Task: NLP
Format: JASONL
Size: 7.02 GB
openbook.gr v1.0
License: CC-BY-NC-SA-4.0
Locale: gr-GR
Task: NLP
Format: Markdown (.md)
Size: 251.63 MB
TidyVoiceX2_ASV
License: CC0-1.0
Locale: mul
Task: OTH
Format: WAV
Size: 23.07 GB
Sermon-Malaysian-English
License: CC-BY-NC-4.0
Locale: en-MY
Task: ASR
Format: MP4, TXT, SRT
Size: 6.63 MB
Reading Recommendations List
License: CC0-1.0
Locale: en-US
Task: OTH
Format: CSV
Size: 16.24 KB
chinese-cosmopedia
License: Apache-2.0
Locale: zh
Task: LLM
Format: parquet
Size: 6.09 GB
smoltalk-chinese
License: Apache-2.0
Locale: zh
Task: LLM
Format: parquet
Size: 879.81 MB
Common Voice v24 English - en-AU subset for Everything Open 2026
License: CC0-1.0
Locale: en-AU
Task: ASR
Format: CSV, MP3
Size: 1.92 GB
Ewondo_Fong_ALCAM-MultimodalDataset
License: NOODL-1.0
Locale: ewo
Task: NLP
Format: MP3, TSV
Size: 16.80 MB
Informes de Actividades InfoCDMX (Ponencia Laura Enríquez)
License: CC-BY-4.0
Locale: es-MX
Task: NLP
Format: PDF, XSLX
Size: 275.85 MB
Ficha de Documentación de Datos: Resoluciones InfoNL (Ponencia F. Guajardo)
License: CC-BY-4.0
Locale: es-MX
Task: NLP
Format: PDF, XSLX
Size: 1.07 GB
Compar:IA conversations
License: Etalab 2.0
Locale: fr
Task: NLG
Format: PARQUET
Size: 1.81 GB
English–Punjabi (Shahmukhi) Parallel Sentences Corpus (Mediamen Archives)
License: CC-BY-NC-4.0
Locale: en-PK, pnb
Task: MT
Format: CSV
Size: 1.08 MB
HESEIA Sentence Bias Dataset
License: CC-BY-SA-4.0
Locale: es-AR
Task: OTH
Format: CSV
Size: 235.43 KB
RFE/RL Tatar-Bashkir News Text Corpus
License: CC-BY-NC-SA-4.0
Locale: tt,ba,ru
Task: NLP
Format: TXT
Size: 102.44 MB
ddd-kenya-luhya-70hrs-asr
License: CC-BY-4.0
Locale: luy
Task: ASR
Format: WAV, XLSX, TSV
Size: 13.90 GB