Buckets:

Ranjit89
/

Assamese-Text-Dataset-bucket

Files

xet

Ranjit89/Assamese-Text-Dataset-bucket / README.md

Ranjit89

3 days ago

preview code

download

raw

2.91 kB

metadata

language:
  - as
license: cc0-1.0
pretty_name: Xahitya Assamese Corpus
size_categories:
  - unknown
task_categories:
  - text-generation
  - fill-mask
  - token-classification

Xahitya Assamese Corpus

A large-scale Assamese literary text corpus scraped from Xahitya.org, containing Assamese prose, essays, stories, poems, and other long-form literary writings.

This dataset is intended for:

Assamese NLP research
Language model pretraining
Tokenizer training
Text generation
Linguistic analysis
Low-resource language AI research

Dataset Structure

The dataset currently contains:

xahitya_dump/
├── articles.jsonl
└── corpus.txt

Files

`articles.jsonl`

Structured JSON Lines dataset.

Each line contains:

{
  "url": "...",
  "title": "...",
  "date": "...",
  "source": "html/wp-rest",
  "text": "..."
}

`corpus.txt`

Plain-text corpus version intended for:

tokenizer training
language model pretraining
raw corpus processing pipelines

The file contains cleaned Assamese literary text extracted from the website.

Source

Primary source:

https://xahitya.org

Xahitya.org is a well-known Assamese e-literature platform containing:

stories
essays
poems
interviews
serialized writings
literary articles

Data Collection

The dataset was collected using a custom Python scraper with:

WordPress API extraction
HTML parsing fallback
text cleaning
duplicate filtering
Unicode normalization

The scraping pipeline removes:

HTML markup
navigation text
scripts/styles
duplicated content
unnecessary formatting artifacts

Language

Assamese (as)
Script: Bengali-Assamese script

Intended Use

This dataset is designed for:

pretraining Assamese language models
tokenizer development
linguistic research
educational and research purposes

Possible applications:

LLM pretraining
SLM training
text generation
embeddings
language understanding systems

License

It is fully open and freely available for: - commercial use - research use - educational use - modification - redistribution ### use it as want

Disclaimer

This dataset was automatically collected from publicly accessible web pages.

All original content rights belong to their respective authors and publishers.

If you are a rights holder and want content removed, please open an issue or contact the repository owner.

Citation

@dataset{xahitya_assamese_corpus,
  title={Xahitya Assamese Corpus},
  author={Ranjit89},
  year={2026},
  publisher={Hugging Face}
}

Acknowledgements

Special thanks to:

the Assamese literary community
contributors of Xahitya.org
open-source NLP ecosystems
Hugging Face

Repository

Hugging Face Dataset Repository:

https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus

Xet Storage Details

Size:: 2.91 kB
Xet hash:: 4fe1d2cc50580a0b128f62d073970a3d63fc6dd2c245a81c7287cd69b7f64643

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.