Ranjit89's picture
|
download
raw
2.91 kB
---
language:
- as
license: cc0-1.0
pretty_name: Xahitya Assamese Corpus
size_categories:
- unknown
task_categories:
- text-generation
- fill-mask
- token-classification
---
# Xahitya Assamese Corpus
A large-scale Assamese literary text corpus scraped from [Xahitya.org](https://xahitya.org), containing Assamese prose, essays, stories, poems, and other long-form literary writings.
This dataset is intended for:
- Assamese NLP research
- Language model pretraining
- Tokenizer training
- Text generation
- Linguistic analysis
- Low-resource language AI research
---
# Dataset Structure
The dataset currently contains:
```text
xahitya_dump/
├── articles.jsonl
└── corpus.txt
```
## Files
### `articles.jsonl`
Structured JSON Lines dataset.
Each line contains:
```json
{
"url": "...",
"title": "...",
"date": "...",
"source": "html/wp-rest",
"text": "..."
}
```
---
### `corpus.txt`
Plain-text corpus version intended for:
- tokenizer training
- language model pretraining
- raw corpus processing pipelines
The file contains cleaned Assamese literary text extracted from the website.
---
# Source
Primary source:
- https://xahitya.org
Xahitya.org is a well-known Assamese e-literature platform containing:
- stories
- essays
- poems
- interviews
- serialized writings
- literary articles
---
# Data Collection
The dataset was collected using a custom Python scraper with:
- WordPress API extraction
- HTML parsing fallback
- text cleaning
- duplicate filtering
- Unicode normalization
The scraping pipeline removes:
- HTML markup
- navigation text
- scripts/styles
- duplicated content
- unnecessary formatting artifacts
---
# Language
- Assamese (`as`)
- Script: Bengali-Assamese script
---
# Intended Use
This dataset is designed for:
- pretraining Assamese language models
- tokenizer development
- linguistic research
- educational and research purposes
Possible applications:
- LLM pretraining
- SLM training
- text generation
- embeddings
- language understanding systems
---
# License
It is fully open and freely available for:
- commercial use
- research use
- educational use
- modification
- redistribution
### use it as want
---
# Disclaimer
This dataset was automatically collected from publicly accessible web pages.
All original content rights belong to their respective authors and publishers.
If you are a rights holder and want content removed, please open an issue or contact the repository owner.
---
# Citation
```bibtex
@dataset{xahitya_assamese_corpus,
title={Xahitya Assamese Corpus},
author={Ranjit89},
year={2026},
publisher={Hugging Face}
}
```
---
# Acknowledgements
Special thanks to:
- the Assamese literary community
- contributors of Xahitya.org
- open-source NLP ecosystems
- Hugging Face
---
# Repository
Hugging Face Dataset Repository:
https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus

Xet Storage Details

Size:
2.91 kB
·
Xet hash:
4fe1d2cc50580a0b128f62d073970a3d63fc6dd2c245a81c7287cd69b7f64643

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.