Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| xahitya_dump | 2 items | ||
| .gitattributes | 2.59 kB xet | 76381b18 | |
| README.md | 2.91 kB xet | 4fe1d2cc | |
| backup_data.tar.gz | 2.1 GB xet | 6d308847 | |
| scrapping_Xahitya_org.py | 17.9 kB xet | 208efb23 | |
| web_scraping.py | 7.89 kB xet | ff32075a |
Xahitya Assamese Corpus
A large-scale Assamese literary text corpus scraped from Xahitya.org, containing Assamese prose, essays, stories, poems, and other long-form literary writings.
This dataset is intended for:
- Assamese NLP research
- Language model pretraining
- Tokenizer training
- Text generation
- Linguistic analysis
- Low-resource language AI research
Dataset Structure
The dataset currently contains:
xahitya_dump/
├── articles.jsonl
└── corpus.txt
Files
articles.jsonl
Structured JSON Lines dataset.
Each line contains:
{
"url": "...",
"title": "...",
"date": "...",
"source": "html/wp-rest",
"text": "..."
}
corpus.txt
Plain-text corpus version intended for:
- tokenizer training
- language model pretraining
- raw corpus processing pipelines
The file contains cleaned Assamese literary text extracted from the website.
Source
Primary source:
Xahitya.org is a well-known Assamese e-literature platform containing:
- stories
- essays
- poems
- interviews
- serialized writings
- literary articles
Data Collection
The dataset was collected using a custom Python scraper with:
- WordPress API extraction
- HTML parsing fallback
- text cleaning
- duplicate filtering
- Unicode normalization
The scraping pipeline removes:
- HTML markup
- navigation text
- scripts/styles
- duplicated content
- unnecessary formatting artifacts
Language
- Assamese (
as) - Script: Bengali-Assamese script
Intended Use
This dataset is designed for:
- pretraining Assamese language models
- tokenizer development
- linguistic research
- educational and research purposes
Possible applications:
- LLM pretraining
- SLM training
- text generation
- embeddings
- language understanding systems
License
It is fully open and freely available for: - commercial use - research use - educational use - modification - redistribution ### use it as want
Disclaimer
This dataset was automatically collected from publicly accessible web pages.
All original content rights belong to their respective authors and publishers.
If you are a rights holder and want content removed, please open an issue or contact the repository owner.
Citation
@dataset{xahitya_assamese_corpus,
title={Xahitya Assamese Corpus},
author={Ranjit89},
year={2026},
publisher={Hugging Face}
}
Acknowledgements
Special thanks to:
- the Assamese literary community
- contributors of Xahitya.org
- open-source NLP ecosystems
- Hugging Face
Repository
Hugging Face Dataset Repository:
https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus
- Total size
- 2.39 GB
- Files
- 7
- Last updated
- May 24
- Pre-warmed CDN
- US EU US EU