Ewondo_Mbida-Mbani_ALCAM-MultimodalDataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 1/5/2026

Format: MP3, TSV

Size: 19.25 MB


Description

This dataset comprises a datasheet of Ewondo (Ewo) lexical entries collected in the speech area known as Mbida Mbani. Each entry is accompanied by illustrative sentences, word-by-word glosses and French translations. The resource is enriched with aligned audio recordings, making it suitable for linguistic analysis and speech technology development.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

By downloading this dataset, you agree: - To use it for research and scientific use only - that you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: - Determining the identity of the speakers in the dataset - Attempt to clone the voice or train models that imitate the speakers in this dataset - Generative AI, rreproduction, duplication, modification, augmentation, copying, distribution; transmission; display; sale; transfer; publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

(a) Speech-related tasks: - Automatic speech recognition (ASR): Audio–text alignment allows the evaluation of speech recognition models for Ewondo. However, I should be noted that the read sentences are transcribed phonetically. There is at least one competing orthographic standard for Ewondo; the General Alphabet of Cameroon's Languages is the one that is closest to phonetic transcription. The other is the Catholic Missionaries orthography inspired by the model laid out by François Pichon in 1950. - Text-to-speech (TTS): As the dataset contains clean sentence–audio pairs, it can also be used to evaluate speech synthesis or text-to-speech models. Here again, it should be noted that the alphabet used to write the sentences is the IPA alphabet and not the General Alphabet of Cameroon's Languages, the Protestant alphabet or the Catholic alphabet. - Speech–text alignment/forced alignment benchmarking: Fine-grained, word-level segmentation provides ideal ground truth for evaluating phoneme - or word-level aligners. | (b) Translation and multilingual tasks: - Machine translation (Ewondo ↔ French): The sentence-level alignment between Ewondo and French makes it a parallel corpus for evaluating translation models with the limitations of the employed phonetic orthographic standard. - Speech translation (speech-to-text): (c) Linguistic and lexicographic tasks - Morphological analysis/glossed corpus studies: The morpheme-level glosses are valuable for computational morphology, interlinear text modelling (ILTs) and grammar induction tasks. - Lexicon and part-of-speech tagging: These are useful for building linguistic resources such as dictionaries, morphological analysers or taggers for Ewondo.

Metadata

Language

Ewondo is a Narrow Bantu language which is indigenous to a population mainly located in the Centre Region of Cameroon, with pockets of settlements in the South, and East Regions. Ewondo is vehicular to populations in the South and East Regions of Cameroon, and has also developed into a creole known as Mongo Ewondo;

Variants

The term 'Ewondo' is used to describe a set of linguistic varieties whose speakers may or may not identify with the term. It is very difficult to determine with confidence, based on which variables, a particular linguistic variety can be categorised as Ewondo without distorting reality. For this reason, the author of this dataset has deemed it worthwhile to refer to the specific geographical locations or particular subgroups in which the data presented in this dataset was collected. This dataset was collected in the Mbida-Mbani subgroup, located southeast of Yaoundé primarily in the Nyong-et-Mfoumou and Nyong-et-So'o Divisions (around Akonolinga/Endom and Mbalmayo/Dzeng).

Writing System

The writing system used for the transcription of Ewondo in this dataset is the International Phonetic Alphabet (IPA).

1. Vowels

i, e, ɛ, a, ɔ, o, u, ə

2. Consonants

b, d, dz, f, g, ɣ, h, k, l, m, mb, mf, mv, n, ɲ, ŋ, nd, ŋg, ŋk, ndz, r, p, s, t, ts, v, w, y, z

3. Tone system

The datasheet shows lexical and grammatical contrastive tones, marked directly on vowels and sonorants:

  • High tone (H): á, é, í, ó, ú, ɛ́, ɔ́, ń

  • Low tone (L): à, è, ì, ò, ù, ɛ̀, ɔ̀, ǹ

  • Falling contour tone (HL): â, ê, î, ô, û, ɛ̂, ɔ̂

  • Rising contour tone (LH): ǎ, ě, ǒ, ǔ, ɛ̌, ɔ̌

  • Mid / level tone: ā, ē, ī, ō, ū, ɛ̄, ɔ̄

Source

The dataset was collected through a questionnaire that was designed to gather basic information about the Ewondo lexicon and grammar. This was done as part of the Atlas Linguistique du Cameroun (ALCAM) project.

Domain

The dataset represents a linguistic questionnaire designed to elicit the basic lexicon and grammatical information.

Size

Total size is 19,25 MB

Structure

The dataset comprises: 1) a datasheet with 399 lines and 19 columns; 2) 378 voice clips read by a single female native speaker; 3) sentence-to-audio mapping with 383 lines and three columns.

Description of columns
  • #OrigID: original number of lexical entry on paper questionnaire

  • #EditID: modification of #OrigID

  • #FrenchRef: reference entry (originally provided in French)

  • #FrenchComm: Original comments about reference entry (#FrenchRef)

  • #French: Lexical entry in French (overlaps with #FrenchRef)

  • #Note: note of researcher on the lexical entry

  • #POS: part of speech

  • #Class: noun class (where applicable)

  • #Morph: morphological attribute (ex. plural, singular)

  • #Var: (na)

  • #Word: Lexical entry in Ewondo, Yanda variety

  • #CrossRef: Cross-referencing of lexical entry number

  • #FrenchEx: Example sentence in French

  • #LangEx: Example sentence in Ewondo

  • #LangExEdit: manual editing of #LangEx

  • #LangPars: word for word parsing in Ewondo

  • #LangParsEdit: editing of #LangPars

  • #FrenchPars: French equivalent of #LangParsEdit -#FrenchParsdit: editing of #FrenchPars

Sample

  1. 4255e148b8cee6ddd6265e51658516e6.mp3 à.ɲù; à bə̀lə́ màán á.ɲù

  2. 7dbca7b76f6fa62b106e5f195a4556c4.mp3 mə̀.ɲù;

  3. cda6cbacc562ff769013dfca139f0a96.mp3 dís; bə́ fìgì mís mábán

  4. bfd576d59d0ccec1f2600df92cc81cd9.mp3 mís;

  5. 08e5e1fa476404337a0479d10

  6. 144024b5894347c27380ee15081ef56f.mp3 ǹ.mvùàd; mì.mvùàd-míé hín / vín

  7. cb68b80d4dafce70c7ec6ab980c424a4.mp3 mì.mvùàd;

  8. 4708358936ded26934c7dacfc8df7c9e.mp3 à.sòŋ; é nə̀ mə̀.sòŋ mə́-mvú

  9. abf2cbfbac3b6d0e8a89a4f76b0f56e4.mp3 mə̀.sòŋ;

  10. abe24163f1297cee96fba5f0bcfa463d.mp3 ò.yém; à lób ó.yém - wúé

  11. 5422a654859ed74666010f1cf720ce89.mp3 à.yém;

  12. d6686a7fd1f3bc3829ecbd8a7404bc89.mp3 dzúé; à yàì dzúé díé