Balochi Academy Text Corpus
License:
CC-BY-NC-SA-4.0
Steward:
Balochi Academy
Task: NLP
Release Date: 1/5/2026
Format: TXT
Size: 1.88 MB
Description
This corpus contains approximately 500k tokens of text from novels, poetry, articles, riddles, and proverbs, covering both literary and traditional genres. It is intended for linguistic research, NLP tasks (e.g., language modeling and text analysis), and cultural documentation.
Specifics
Licensing
Creative Commons Attribution Non Commercial Share Alike 4.0 International (CC-BY-NC-SA-4.0)
https://spdx.org/licenses/CC-BY-NC-SA-4.0.htmlMetadata
Language
Balochi is an Iranian language (Indo-European family) spoken primarily across Balochistan (Pakistan and Iran) and parts of Afghanistan, with large diaspora communities in the Gulf and elsewhere. It is commonly written in a Perso-Arabic script, and it is widely used in oral traditions such as poetry, proverbs, and riddles, alongside modern writing in novels and journalism. Balochi has several major regional varieties, often grouped as Western, Eastern, and Southern that differ in pronunciation, vocabulary, and some grammatical patterns. In and around Quetta, the variety most commonly associated with everyday use is Western Balochi (bgn).
Domains of the Text
Literature (Creative writing)
Poetry (Aesthetic / cultural expression)
Journalism & General Writing
Folklore & Oral Tradition (Textual form)
Everyday Social Themes (as reflected in texts)
Cultural Knowledge & Heritage
Language Variation & Style
Balochi Script
آ ا ب پ ت ٹ ج چ د ڈ ر ڑ ز ژ س ش ک گ ل ن م ۆ و ه ئ ی ێ ے َ ِ ُ ْ ص ض ط ظ خ ث ع غ ذ ف ق
Recommended Processing
Dataset structure
The dataset has 14 files.
Each file name matches the content inside (e.g.,
novel.txt,sentences.txt,proverbs.txt,riddles.txt).Treat each file as a separate genre/domain container.
Keep two layers
Raw: original 14 files (unchanged)
Clean: same 14 files after normalization (same filenames)
Add file-level metadata (one row per file)
Include:
file_id, file_name, content_type, language_iso639_3 (bgn), variety (Quetta), script, word_count, cleaning_level, rights_status, license, notes
Cleaning (Clean layer)
UTF-8, Unicode normalization, white-space/punctuation cleanup
Remove stray symbols/markup if needed
Sample Text
بلوچی عہدی شاعری ءِ تہا ڈرامہ ءِ کُلّیں سپت موجود انت۔ اگاں ما مروچی حانی ءُ شے مرید ءِ شعری داستان ءَ واناں یا اشکناں تہ اے پیمیں شاعری مارا ہما عہد ءُ دئور ءُ باری ءِ گوازینگ ءِ راہبندانی چپّ ءُ چاگرد ءَ پیش داریت۔ سیوی ءِ جلگہیں شہر انت، میر چاکر ءِ ماڑی ءَ رندانی کچہری ءُ دیوان انت، مُچّی ءُ مراگاہ انت،بندات ءَ شاعر میر چاکر ءِ کردار ءَ دیما کاریت کہ آ دیوان ءِ دیما یک بندے بندیت کہ دیوان ءِ نندوک ہما بند ءِ بوجگ ءُ تچک کنگ ءَ حیران انت:
