Spoken-Congolese-French-Dataset

License icon

License:

NOODL-1.0

Shield icon

Steward:

Institute of African Digital Humanities

Task: NLP

Release Date: 1/5/2026

Format: MP3, WAV, TSV

Size: 3.44 GB


Description

The dataset consists of paired audio and text resources on spoken French from the Republic of the Congo. The audio files were extracted from longer recordings of semi-guided interviews conducted in Brazzaville, and orthographic transcriptions were added. The long audio recordings and their corresponding TRJS transcription files were automatically clipped alongside their respective transcriptions. The dataset comprises ten folders containing audio files and ten audio/text mapping files.

Specifics

Licensing

Nwulite Obodo Open Data Licence 1.0 (NOODL-1.0)

https://licensingafricandatasets.com/nwulite-obodo-license

Considerations

Restrictions/Special Constraints

You agree : - To use this dataset for research and scientific use only - That you will not re-host or re-share this dataset

Forbidden Usage

You agree not to use the data for: - Determining the identity of the speakers in the dataset - Attempting to clone the voice or train models that imitate the speakers in this dataset - Generative AI - Reproduction - Duplication - Modification - Augmentation - Copying - Distribution - Transmission - Display - Sale - Transfer - Publication or creation of derivative works without the explicit permission of the the legal owner of the dataset.

Processes

Intended Use

The audio-text alignment in this dataset enables speech recognition models to be trained or evaluated for the development of a more inclusive and representative ASR model for French.

Metadata

Language

Global linguistic governance does not formally recognise Congolese French as a language or language variety. The locale 'fr-CG' (Congolese French) used in this dataset is an attempt to reclaim the term 'Congolese French' for this variety of French. Despite the French variety used in the Congo not having formal recognition, people from the Republic of Congo who speak French can be identified by French speakers from other countries, particularly in Africa, as speaking a specific variety of French. In some cases, it may even be possible to identify someone as Congolese on the basis of their accent and/or linguistic repertoire. Numerous documentation and description efforts have been carried out by researchers in local universities and abroad with regard to the Congolese variety of French.

Variants

Although Congolese French may be subject to internal variation, there is insufficient information available to substantiate the existence of particular sub-varieties or their peculiarities. Nevertheless, the audio recordings in this dataset were made in Brazzaville with people who had lived there long enough to recount the origins of their settlement.

Alphabet

The alphabet used in the transcription of audio recordings is that of standard French: a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z.

Source

This dataset was compiled from transcriptions of audio recordings made by a research group at the Université Marien Ngouabi in Brazzaville as part of a project led by Professor Edouard Ngamountsika. Researchers conducted audio recordings in some parts of Brazzaville using prompts such as 'How did you come to settle in this place?'. The recordings were then transcribed using a multimodal annotation framework, including indications of silence and noise.

Domain

The dataset consists of spontaneous spoken Congolese French collected in urban settings, primarily through informal sociolinguistic interviews. The texts cover everyday conversational interaction, interview discourse, personal and situational narration, and metacommunicative speech related to the recording process.

Size

Total size is 3,27 GB

Structure

This dataset comprises audio clips and audio/text mapping files. There are 13,344 audio clips totalling 6 hours, 8 minutes and 12.286 seconds, as well as 44 audio/text mapping files totalling 13,346 lines.

Sample

  1. Spoken-Congolese-French-Dataset_111_0002.wav Bonjour, madame !

  2. Spoken-Congolese-French-Dataset_111_0003.wav Comment allez-vous ?

  3. Spoken-Congolese-French-Dataset_111_0004.wav Je me porte à merveille. Et vous ?

  4. Spoken-Congolese-French-Dataset_111_0005.wav ça va bien. Merci !

  5. Spoken-Congolese-French-Dataset_111_0006.wav (Sifflotement) Euh...J'ai quelques questions à vous poser, si cela ne vous dérange pas.

  6. Spoken-Congolese-French-Dataset_111_0007.wav Cela ne me dérange pas. Allez-y !

  7. Spoken-Congolese-French-Dataset_111_0008.wav Comment vous ou vos parents avez-vous fait pour arriver ici ?

  8. Spoken-Congolese-French-Dataset_111_0009.wav (((Silence)))

  9. Spoken-Congolese-French-Dataset_111_0010.wav Mon père avait acheté une maison ici à Mfilou,

  10. Spoken-Congolese-French-Dataset_111_0011.wav et il a construit,

  11. Spoken-Congolese-French-Dataset_111_0012.wav et nous sommes ici depuis 2010.

  12. Spoken-Congolese-French-Dataset_111_0013.wav C'est pour cela que nous sommes arrivés à Mfilou.

  13. Spoken-Congolese-French-Dataset_111_0014.wav (((Silence)))

  14. Spoken-Congolese-French-Dataset_111_0015.wav Habitez-vous avec vos deux parents ?