Sindh Line Publishers
License:
CC-BY-SA-4.0
Steward:
Sindh Line Publishers
Task: NLP
Release Date: 1/5/2026
Format: TXT
Size: 2.22 MB
Description
The corpus contains 1.029 million tokens from the Sindh Line a Sindhi Newspaper published from the year 2024-2025. The text consists of the complete newspaper content including headlines, editorials, finance news and advertisements. The newspaper published in Karachi, Pakistan on daily basis
Specifics
Licensing
Creative Commons Attribution Share Alike 4.0 International (CC-BY-SA-4.0)
https://spdx.org/licenses/CC-BY-SA-4.0.htmlMetadata
Overview
Dataset name: Sindh Line Publisher Sindhi Newspaper Corpus (2024–2025)
Language: Sindhi (may include some Urdu/English in finance, names, ads)
Location / publisher: Karachi, Pakistan (daily publication)
Time coverage: 2024–2025
Size: ~1.029 million tokens
Content included: complete newspaper text — headlines, editorials, finance news, advertisements
Language
Sindhi (سِنڌِي, Sindhī, [sɪndʱi} is an Indo-Aryan language spoken by the Sindhi in the province of Sindh, Pakistan. It is the official language of the province and constitutes the mother tongue of over 34 million people in Pakistan and 1.7 million people in India.
Script
ا ب ٻ ڀ ت ٿ ٽ ٺ ث پ ج ڄ جھ ڃ چ ڇ ح خ د ڌ ڏ ڊ ڍ ذ ر ڙ ز س ش ص ض ط ظ ع غ ف ڦ ق ڪ ک گ ڳ گھ ڱ ل م ن ڻ و ھ ء ي
Sample
هن جڏهن ميثاق معيشت جي ڳالهه ڪئي ته کيس توهين سان رد ڪيو ويو، اڄ به ميثاق معيشت لاءِ تيار آهن، 9 مهينن ۾ اسان وڏين چئلينجن کي منهن ڏنو
روئڻ جو ڪو به فائدو ناهي،پاليسي ريٽ ۾ وڌيڪ گهٽتائي ڪئي وڃي، مان چاهيان ٿو ته ٽيڪسن کي گهٽايو وڃي ته جيئن ٽيڪس چوري نه ٿئي: خطاب
اڏار پاڪستان جو محور برآمداتي ترقي آهي، معاشي استحڪام اچي چڪو آهي، هاڻي اسان کي ترقي ڏانهن وڌڻو آهي، برآمدات وڌائڻ لاءِ ڪاروبار دوست ماحول پيدا ڪرڻو پوندو
Why this dataset
A modern, real-world Sindhi news corpus for Sindhi NLP, linguistic research, and digital preservation, covering multiple registers (formal editorials → mixed-style ads).
Data Composition
What’s included: headlines, editorials, finance/business items, advertisements (complete textual content)
Granularity: one combined corpus file (all issues/content concatenated in a single
.txtfile)
Processing (recommended)
Single combined TXT file: Keep the original file as Raw (unchanged) and create a second Clean version derived from it.
Raw: the full newspaper text as collected (one combined
.txtfile)Clean: preprocessing on the combined file, including:
remove or normalize alphanumeric strings, extra symbols, and non-Sindhi characters (as needed)
Unicode normalization + whitespace/punctuation cleanup
optional removal of repeated boilerplate (if present)
sentence segmentation / parsing to create training-ready units
Optional (recommended): redact or mask PII that may appear in advertisements/classifieds (phone numbers, emails, addresses) before release or model training.
Note: Since the file contains the entire newspaper content in one text, it may require the above cleaning and sentence parsing to be used reliably for training purposes.
Ethics & privacy
Ads/classifieds may contain personal details. Avoid releasing unredacted PII; don’t enable doxxing/targeting uses.
Limitations
Written-news register (not speech), Karachi-centric coverage, ads can skew vocabulary, OCR may introduce systematic errors.
