Buckets:
language:
- as
license: cc0-1.0
pretty_name: Xahitya Assamese Corpus
size_categories:
- unknown
task_categories:
- text-generation
- fill-mask
- token-classification
Xahitya Assamese Corpus
A large-scale Assamese literary text corpus scraped from Xahitya.org, containing Assamese prose, essays, stories, poems, and other long-form literary writings.
This dataset is intended for:
- Assamese NLP research
- Language model pretraining
- Tokenizer training
- Text generation
- Linguistic analysis
- Low-resource language AI research
Dataset Structure
The dataset currently contains:
xahitya_dump/
├── articles.jsonl
└── corpus.txt
Files
articles.jsonl
Structured JSON Lines dataset.
Each line contains:
{
"url": "...",
"title": "...",
"date": "...",
"source": "html/wp-rest",
"text": "..."
}
corpus.txt
Plain-text corpus version intended for:
- tokenizer training
- language model pretraining
- raw corpus processing pipelines
The file contains cleaned Assamese literary text extracted from the website.
Source
Primary source:
Xahitya.org is a well-known Assamese e-literature platform containing:
- stories
- essays
- poems
- interviews
- serialized writings
- literary articles
Data Collection
The dataset was collected using a custom Python scraper with:
- WordPress API extraction
- HTML parsing fallback
- text cleaning
- duplicate filtering
- Unicode normalization
The scraping pipeline removes:
- HTML markup
- navigation text
- scripts/styles
- duplicated content
- unnecessary formatting artifacts
Language
- Assamese (
as) - Script: Bengali-Assamese script
Intended Use
This dataset is designed for:
- pretraining Assamese language models
- tokenizer development
- linguistic research
- educational and research purposes
Possible applications:
- LLM pretraining
- SLM training
- text generation
- embeddings
- language understanding systems
License
It is fully open and freely available for: - commercial use - research use - educational use - modification - redistribution ### use it as want
Disclaimer
This dataset was automatically collected from publicly accessible web pages.
All original content rights belong to their respective authors and publishers.
If you are a rights holder and want content removed, please open an issue or contact the repository owner.
Citation
@dataset{xahitya_assamese_corpus,
title={Xahitya Assamese Corpus},
author={Ranjit89},
year={2026},
publisher={Hugging Face}
}
Acknowledgements
Special thanks to:
- the Assamese literary community
- contributors of Xahitya.org
- open-source NLP ecosystems
- Hugging Face
Repository
Hugging Face Dataset Repository:
https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus
Xet Storage Details
- Size:
- 2.91 kB
- Xet hash:
- 4fe1d2cc50580a0b128f62d073970a3d63fc6dd2c245a81c7287cd69b7f64643
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.