Balochi Academy Text Corpus

License icon

License:

CC-BY-NC-SA-4.0

Shield icon

Steward:

Balochi Academy

Task: NLP

Release Date: 1/5/2026

Format: TXT

Size: 1.88 MB


Description

This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.

Specifics

Licensing

Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)

https://spdx.org/licenses/CC-BY-NC-SA-4.0.html

Metadata

Language

Balochi is an Iranian language (Indo-European family) spoken primarily across Balochistan (Pakistan and Iran) and parts of Afghanistan, with large diaspora communities in the Gulf and elsewhere. It is commonly written in a Perso-Arabic script, and it is widely used in oral traditions such as poetry, proverbs, and riddles, alongside modern writing in novels and journalism. Balochi has several major regional varieties, often grouped as Western, Eastern, and Southern that differ in pronunciation, vocabulary, and some grammatical patterns. In and around Quetta, the variety most commonly associated with everyday use is Western Balochi (bgn).

Domains of the Text

  • Literature (Creative writing)

  • Poetry (Aesthetic / cultural expression)

  • Journalism & General Writing

  • Folklore & Oral Tradition (Textual form)

  • Everyday Social Themes (as reflected in texts)

  • Cultural Knowledge & Heritage

  • Language Variation & Style

Balochi Script

آ ا ب پ ت ٹ ج چ د ڈ ر ڑ ز ژ س ش ک گ ل ن م ۆ و ه ئ ی ێ ے َ ِ ُ ْ ص ض ط ظ خ ث ع غ ذ ف ق

Recommended Processing

Dataset structure

  • The dataset has 14 files.

  • Each file name matches the content inside (e.g., novel.txt, sentences.txt, proverbs.txt, riddles.txt).

  • Treat each file as a separate genre/domain container.

Keep two layers

  • Raw: original 14 files (unchanged)

  • Clean: same 14 files after normalization (same filenames)

Add file-level metadata (one row per file)

Include: file_id, file_name, content_type, language_iso639_3 (bgn), variety (Quetta), script, word_count, cleaning_level, rights_status, license, notes

Cleaning (Clean layer)

  • UTF-8, Unicode normalization, white-space/punctuation cleanup

  • Remove stray symbols/markup if needed

Sample Text

بلوچی عہدی شاعری ءِ تہا ڈرامہ ءِ کُلّیں سپت موجود انت۔ اگاں ما مروچی حانی ءُ شے مرید ءِ شعری داستان ءَ واناں یا اشکناں تہ اے پیمیں شاعری مارا ہما عہد ءُ دئور ءُ باری ءِ گوازینگ ءِ راہبندانی چپّ ءُ چاگرد ءَ پیش داریت۔ سیوی ءِ جلگہیں شہر انت، میر چاکر ءِ ماڑی ءَ رندانی کچہری ءُ دیوان انت، مُچّی ءُ مراگاہ انت،بندات ءَ شاعر میر چاکر ءِ کردار ءَ دیما کاریت کہ آ دیوان ءِ دیما یک بندے بندیت کہ دیوان ءِ نندوک ہما بند ءِ بوجگ ءُ تچک کنگ ءَ حیران انت: