2.39 GB
7 files
Updated 3 days ago
NameSize
xahitya_dump
.gitattributes2.59 kB
xet
README.md2.91 kB
xet
backup_data.tar.gz2.1 GB
xet
scrapping_Xahitya_org.py17.9 kB
xet
web_scraping.py7.89 kB
xet
README.md

Xahitya Assamese Corpus

A large-scale Assamese literary text corpus scraped from Xahitya.org, containing Assamese prose, essays, stories, poems, and other long-form literary writings.

This dataset is intended for:

  • Assamese NLP research
  • Language model pretraining
  • Tokenizer training
  • Text generation
  • Linguistic analysis
  • Low-resource language AI research

Dataset Structure

The dataset currently contains:

xahitya_dump/
├── articles.jsonl
└── corpus.txt

Files

articles.jsonl

Structured JSON Lines dataset.

Each line contains:

{
  "url": "...",
  "title": "...",
  "date": "...",
  "source": "html/wp-rest",
  "text": "..."
}

corpus.txt

Plain-text corpus version intended for:

  • tokenizer training
  • language model pretraining
  • raw corpus processing pipelines

The file contains cleaned Assamese literary text extracted from the website.


Source

Primary source:

Xahitya.org is a well-known Assamese e-literature platform containing:

  • stories
  • essays
  • poems
  • interviews
  • serialized writings
  • literary articles

Data Collection

The dataset was collected using a custom Python scraper with:

  • WordPress API extraction
  • HTML parsing fallback
  • text cleaning
  • duplicate filtering
  • Unicode normalization

The scraping pipeline removes:

  • HTML markup
  • navigation text
  • scripts/styles
  • duplicated content
  • unnecessary formatting artifacts

Language

  • Assamese (as)
  • Script: Bengali-Assamese script

Intended Use

This dataset is designed for:

  • pretraining Assamese language models
  • tokenizer development
  • linguistic research
  • educational and research purposes

Possible applications:

  • LLM pretraining
  • SLM training
  • text generation
  • embeddings
  • language understanding systems

License

It is fully open and freely available for: - commercial use - research use - educational use - modification - redistribution ### use it as want

Disclaimer

This dataset was automatically collected from publicly accessible web pages.

All original content rights belong to their respective authors and publishers.

If you are a rights holder and want content removed, please open an issue or contact the repository owner.


Citation

@dataset{xahitya_assamese_corpus,
  title={Xahitya Assamese Corpus},
  author={Ranjit89},
  year={2026},
  publisher={Hugging Face}
}

Acknowledgements

Special thanks to:

  • the Assamese literary community
  • contributors of Xahitya.org
  • open-source NLP ecosystems
  • Hugging Face

Repository

Hugging Face Dataset Repository:

https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus

Total size
2.39 GB
Files
7
Last updated
May 24
Pre-warmed CDN
US EU US EU

Contributors