TidyVoiceX2_ASV

License:

CC0-1.0

Steward:

TidyVoice2026 Challenge

Task: OTH

Release Date: 1/26/2026

Format: WAV

Size: 23.07 GB

Description

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity.

Forbidden Usage

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity. The data MUST only be used for speaker verification tasks.

Processes

Intended Use

All rules and restrictions are the same as those of the original Mozilla Common Voice datasets.

Metadata

[Update — 2 April 2026.] A new version of this dataset is available with the same folder structure as TidyVoiceX_ASV. The official trial pair lists were released on 2 April 2026.

TidyVoiceX2_ASV Dataset

Overview

TidyVoiceX2_ASV is a multilingual speech corpus curated for cross-lingual speaker verification research. The dataset is organized speaker-wise, with per-speaker language subfolders and WAV utterances. This structure supports controlled experiments on speaker verification under language mismatch conditions.

This dataset is the official evaluation set used in the TidyVoice2026 Challenge, an official challenge at Interspeech 2026, focused on advancing cross-lingual speaker verification systems.

Challenge website: https://tidyvoice2026.github.io/

It is also used as the evaluation benchmark data partition for the TidyLang2026 Challenge, an official challenge at Odyssey 2026, presented at: https://tidylang2026.github.io/.

This dataset is disjoint from the separately released TidyVoiceX_ASV dataset: the speaker set and utterance recordings in TidyVoiceX2_ASV are different from those in TidyVoiceX_ASV.
TidyVoiceX_ASV: https://datacollective.mozillafoundation.org/datasets/cmihtsewu023so207xot1iqqw

Challenge Provenance and Evaluation Benchmarks

The evaluation benchmarks and official trial pair lists for speaker and language recognition on this dataset are publicly released by the TidyVoice/TidyLang challenges.

Official Automatic Speaker Verification (ASV) trial pair list: https://drive.google.com/file/d/1OnfzE_YFMGJKU1LOGsD1NKNaMJOvOvmM/view?usp=share_link
Official LID and low-resource language recognition trial lists: https://drive.google.com/file/d/1XjMURDIe026IapEorg1sVQ5eZ3Jjz9Dv/view?usp=share_link

Dataset Statistics

Metric	Value
Speakers	2,183
Languages	62
Utterances (WAV files)	205,773
Total duration	287.67 h
Audio Format	WAV
Dataset Organization	`speaker_id/language_code/utterance.wav`

Total duration was computed by summing per-file duration from WAV headers (16 kHz PCM) using scripts/compute_wav_duration.py on 2026-04-02.

Language Coverage

Total languages: 62

Language	Speakers	Utterances
am	2	305
ar	6	238
as	5	47
ba	1	31
be	18	1400
bn	2	69
ca	240	14551
ckb	14	3167
cs	83	5708
cv	1	20
cy	5	107
dav	3	224
de	145	3319
el	3	328
en	1560	49995
es	868	22494
et	60	1875
eu	64	3384
fa	6	159
fi	43	1896
fr	233	11071
gl	35	1845
gn	3	166
hi	5	16
hu	78	5101
id	96	4380
it	319	11584
ja	17	526
ka	1	80
kab	55	11257
kk	19	389
kmr	22	1874
ko	9	456
ky	31	2758
lg	2	7
lij	1	300
luo	3	524
mn	31	663
mt	13	954
myv	2	100
nan-tw	11	896
nl	18	811
pa-IN	11	413
pl	27	1543
pt	49	1798
ro	57	2826
ru	137	4210
rw	16	2078
sah	3	107
sk	16	264
sl	26	1014
sr	16	715
sv-SE	122	6771
sw	59	6209
ta	2	104
tr	28	679
uk	120	7180
ur	42	4001
uz	2	38
vi	24	191
yue	1	12
zh-CN	14	545

Key Features

Multilingual coverage with 62 language codes
Speaker-wise directory structure for verification workflows
Cross-lingual speaker verification focus
Pseudonymized speaker IDs
WAV audio format

Citation

If you use this dataset in your research, please cite:

@inproceedings{farhadipour2026tidyvoice,
  title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Vukovic, Teodora and Dellwo, Volker and Reid, Kathy and Tyers, Francis M. and Siegert, Ingo and Chodroff, Eleanor},
  booktitle={Interspeech 2026},
  year={2026}
}

@article{farhadipour2026tidyvoice_arxiv,
  title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Chodroff, Eleanor},
  journal={arXiv preprint arXiv:2601.16358},
  year={2026}
}

License

This dataset is derived from Mozilla Common Voice. Please refer to the Mozilla Common Voice terms for usage conditions: https://commonvoice.mozilla.org/en/terms

Usage Restrictions

Per the stewarded usage rules for this dataset, speaker identification is not allowed.

Contact

Email: aref.farhadipour@uzh.ch, areffarhadi@gmail.com