TidyVoiceX2_ASV

License icon

License:

CC0-1.0

Shield icon

Steward:

TidyVoice2026 Challenge

Task: OTH

Release Date: 1/26/2026

Format: WAV

Size: 23.07 GB


Share

Description

This dataset is designed for speaker verification using the Mozilla Common Voice corpus, covering approximately 40 additional languages beyond those included in TidyVoiceX_ASV. It comprises recordings from different speakers, each of whom appears in multiple languages. Leveraging this multilingual overlap, we construct trial pairs to investigate cross-lingual variation in the speaker verification task. This dataset served as the evaluation set for the TidyVoice 2026 Challenge.

Specifics

Licensing

Creative Commons Zero v1.0 Universal (CC0-1.0)

https://spdx.org/licenses/CC0-1.0.html

Considerations

Restrictions/Special Constraints

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity.

Forbidden Usage

According to Mozilla’s usage rules, it is forbidden to use this dataset for speaker identification or for recovering a speaker’s identity. The data MUST only be used for speaker verification tasks.

Processes

Intended Use

All rules and restrictions are the same as those of the original Mozilla Common Voice datasets.

Metadata

[Update — 2 April 2026.] A new version of this dataset is available with the same folder structure as TidyVoiceX_ASV. The official trial pair lists were released on 2 April 2026.

TidyVoiceX2_ASV Dataset

Overview

TidyVoiceX2_ASV is a multilingual speech corpus curated for cross-lingual speaker verification research. The dataset is organized speaker-wise, with per-speaker language subfolders and WAV utterances. This structure supports controlled experiments on speaker verification under language mismatch conditions.

This dataset is the official evaluation set used in the TidyVoice2026 Challenge, an official challenge at Interspeech 2026, focused on advancing cross-lingual speaker verification systems.

Challenge website: https://tidyvoice2026.github.io/

It is also used as the evaluation benchmark data partition for the TidyLang2026 Challenge, an official challenge at Odyssey 2026, presented at: https://tidylang2026.github.io/.

This dataset is disjoint from the separately released TidyVoiceX_ASV dataset: the speaker set and utterance recordings in TidyVoiceX2_ASV are different from those in TidyVoiceX_ASV.
TidyVoiceX_ASV: https://datacollective.mozillafoundation.org/datasets/cmihtsewu023so207xot1iqqw

Challenge Provenance and Evaluation Benchmarks

The evaluation benchmarks and official trial pair lists for speaker and language recognition on this dataset are publicly released by the TidyVoice/TidyLang challenges.

Dataset Statistics

MetricValue
Speakers2,183
Languages62
Utterances (WAV files)205,773
Total duration287.67 h
Audio FormatWAV
Dataset Organizationspeaker_id/language_code/utterance.wav

Total duration was computed by summing per-file duration from WAV headers (16 kHz PCM) using scripts/compute_wav_duration.py on 2026-04-02.

Language Coverage

Total languages: 62

LanguageSpeakersUtterances
am2305
ar6238
as547
ba131
be181400
bn269
ca24014551
ckb143167
cs835708
cv120
cy5107
dav3224
de1453319
el3328
en156049995
es86822494
et601875
eu643384
fa6159
fi431896
fr23311071
gl351845
gn3166
hi516
hu785101
id964380
it31911584
ja17526
ka180
kab5511257
kk19389
kmr221874
ko9456
ky312758
lg27
lij1300
luo3524
mn31663
mt13954
myv2100
nan-tw11896
nl18811
pa-IN11413
pl271543
pt491798
ro572826
ru1374210
rw162078
sah3107
sk16264
sl261014
sr16715
sv-SE1226771
sw596209
ta2104
tr28679
uk1207180
ur424001
uz238
vi24191
yue112
zh-CN14545

Key Features

  • Multilingual coverage with 62 language codes

  • Speaker-wise directory structure for verification workflows

  • Cross-lingual speaker verification focus

  • Pseudonymized speaker IDs

  • WAV audio format

Citation

If you use this dataset in your research, please cite:

@inproceedings{farhadipour2026tidyvoice,
  title={TidyVoice Challenge: Cross-Lingual Speaker Verification},
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Vukovic, Teodora and Dellwo, Volker and Reid, Kathy and Tyers, Francis M. and Siegert, Ingo and Chodroff, Eleanor},
  booktitle={Interspeech 2026},
  year={2026}
}

@article{farhadipour2026tidyvoice_arxiv,
  title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
  author={Farhadipour, Aref and Marquenie, Jan and Madikeri, Srikanth and Chodroff, Eleanor},
  journal={arXiv preprint arXiv:2601.16358},
  year={2026}
}

License

This dataset is derived from Mozilla Common Voice. Please refer to the Mozilla Common Voice terms for usage conditions: https://commonvoice.mozilla.org/en/terms

Usage Restrictions

Per the stewarded usage rules for this dataset, speaker identification is not allowed.

Contact