TidyVoiceX2_ASV
License:
CC0-1.0
Steward:
TidyVoice2026 ChallengeTask: OTH
Release Date: 1/26/2026
Format: WAV
Size: 23.07 GB
Share
Description
This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.
Specifics
Considerations
Restrictions/Special Constraints
According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity.
Forbidden Usage
According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity. The data MUST only be used for speaker verification tasks.
Processes
Intended Use
All rules and restrictions are the same as those of the original Mozilla Common Voice datasets.
Metadata
[Update — 2 April 2026.] A new version of this dataset is available with the same folder structure as
TidyVoiceX_ASV. The official trial pair lists were released on 2 April 2026.
TidyVoiceX2_ASV Dataset
Overview
TidyVoiceX2_ASV is a multilingual speech corpus curated for cross-lingual speaker verification research. The dataset is organized speaker-wise, with per-speaker language subfolders and WAV utterances. This structure supports controlled experiments on speaker verification under language mismatch conditions.
This dataset is the official evaluation set used in the TidyVoice2026 Challenge, an official challenge at Interspeech 2026, focused on advancing cross-lingual speaker verification systems.
Challenge website: https://tidyvoice2026.github.io/
It is also used as the evaluation benchmark data partition for the TidyLang2026 Challenge, an official challenge at Odyssey 2026, presented at: https://tidylang2026.github.io/.
This dataset is disjoint from the separately released TidyVoiceX_ASV dataset: the speaker set and utterance recordings in TidyVoiceX2_ASV are different from those in TidyVoiceX_ASV.
TidyVoiceX_ASV: https://datacollective.mozillafoundation.org/datasets/cmihtsewu023so207xot1iqqw
Challenge Provenance and Evaluation Benchmarks
The evaluation benchmarks and official trial pair lists for speaker and language recognition on this dataset are publicly released by the TidyVoice/TidyLang challenges.
Official Automatic Speaker Verification (ASV) trial pair list: https://drive.google.com/file/d/1OnfzE_YFMGJKU1LOGsD1NKNaMJOvOvmM/view?usp=share_link
Official LID and low-resource language recognition trial lists: https://drive.google.com/file/d/1XjMURDIe026IapEorg1sVQ5eZ3Jjz9Dv/view?usp=share_link
Dataset Statistics
| Metric | Value |
|---|---|
| Speakers | 2,183 |
| Languages | 62 |
| Utterances (WAV files) | 205,773 |
| Total duration | 287.67 h |
| Audio Format | WAV |
| Dataset Organization | speaker_id/language_code/utterance.wav |
Total duration was computed by summing per-file duration from WAV headers (16 kHz PCM) using scripts/compute_wav_duration.py on 2026-04-02.
Language Coverage
Total languages: 62
| Language | Speakers | Utterances |
|---|---|---|
| am | 2 | 305 |
| ar | 6 | 238 |
| as | 5 | 47 |
| ba | 1 | 31 |
| be | 18 | 1400 |
| bn | 2 | 69 |
| ca | 240 | 14551 |
| ckb | 14 | 3167 |
| cs | 83 | 5708 |
| cv | 1 | 20 |
| cy | 5 | 107 |
| dav | 3 | 224 |
| de | 145 | 3319 |
| el | 3 | 328 |
| en | 1560 | 49995 |
| es | 868 | 22494 |
| et | 60 | 1875 |
| eu | 64 | 3384 |
| fa | 6 | 159 |
| fi | 43 | 1896 |
| fr | 233 | 11071 |
| gl | 35 | 1845 |
| gn | 3 | 166 |
| hi | 5 | 16 |
| hu | 78 | 5101 |
| id | 96 | 4380 |
| it | 319 | 11584 |
| ja | 17 | 526 |
| ka | 1 | 80 |
| kab | 55 | 11257 |
| kk | 19 | 389 |
| kmr | 22 | 1874 |
| ko | 9 | 456 |
| ky | 31 | 2758 |
| lg | 2 | 7 |
| lij | 1 | 300 |
| luo | 3 | 524 |
| mn | 31 | 663 |
| mt | 13 | 954 |
| myv | 2 | 100 |
| nan-tw | 11 | 896 |
| nl | 18 | 811 |
| pa-IN | 11 | 413 |
| pl | 27 | 1543 |
| pt | 49 | 1798 |
| ro | 57 | 2826 |
| ru | 137 | 4210 |
| rw | 16 | 2078 |
| sah | 3 | 107 |
| sk | 16 | 264 |
| sl | 26 | 1014 |
| sr | 16 | 715 |
| sv-SE | 122 | 6771 |
| sw | 59 | 6209 |
| ta | 2 | 104 |
| tr | 28 | 679 |
| uk | 120 | 7180 |
| ur | 42 | 4001 |
| uz | 2 | 38 |
| vi | 24 | 191 |
| yue | 1 | 12 |
| zh-CN | 14 | 545 |
Key Features
Multilingual coverage with 62 language codes
Speaker-wise directory structure for verification workflows
Cross-lingual speaker verification focus
Pseudonymized speaker IDs
WAV audio format
Citation
If you use this dataset in your research, please cite:
@inproceedings{farhadipour2026tidyvoice,
title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Vukovic, Teodora and Dellwo, Volker and Reid, Kathy and Tyers, Francis M. and Siegert, Ingo and Chodroff, Eleanor},
booktitle={Interspeech 2026},
year={2026}
}
@article{farhadipour2026tidyvoice_arxiv,
title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Chodroff, Eleanor},
journal={arXiv preprint arXiv:2601.16358},
year={2026}
}
License
This dataset is derived from Mozilla Common Voice. Please refer to the Mozilla Common Voice terms for usage conditions: https://commonvoice.mozilla.org/en/terms
Usage Restrictions
Per the stewarded usage rules for this dataset, speaker identification is not allowed.