Buckets:

Ranjit89
/

Assamese-Text-Dataset-bucket

Files

xet

Ranjit89/Assamese-Text-Dataset-bucket / README.md

Ranjit89

3 days ago

preview code

download

raw

2.91 kB

	---
	language:
	- as
	license: cc0-1.0
	pretty_name: Xahitya Assamese Corpus
	size_categories:
	- unknown
	task_categories:
	- text-generation
	- fill-mask
	- token-classification
	---

	# Xahitya Assamese Corpus

	A large-scale Assamese literary text corpus scraped from [Xahitya.org](https://xahitya.org), containing Assamese prose, essays, stories, poems, and other long-form literary writings.

	This dataset is intended for:
	- Assamese NLP research
	- Language model pretraining
	- Tokenizer training
	- Text generation
	- Linguistic analysis
	- Low-resource language AI research

	---

	# Dataset Structure

	The dataset currently contains:

	```text
	xahitya_dump/
	├── articles.jsonl
	└── corpus.txt
	```

	## Files

	### `articles.jsonl`

	Structured JSON Lines dataset.

	Each line contains:

	```json
	{
	"url": "...",
	"title": "...",
	"date": "...",
	"source": "html/wp-rest",
	"text": "..."
	}
	```

	---

	### `corpus.txt`

	Plain-text corpus version intended for:
	- tokenizer training
	- language model pretraining
	- raw corpus processing pipelines

	The file contains cleaned Assamese literary text extracted from the website.

	---

	# Source

	Primary source:

	- https://xahitya.org

	Xahitya.org is a well-known Assamese e-literature platform containing:
	- stories
	- essays
	- poems
	- interviews
	- serialized writings
	- literary articles

	---

	# Data Collection

	The dataset was collected using a custom Python scraper with:
	- WordPress API extraction
	- HTML parsing fallback
	- text cleaning
	- duplicate filtering
	- Unicode normalization

	The scraping pipeline removes:
	- HTML markup
	- navigation text
	- scripts/styles
	- duplicated content
	- unnecessary formatting artifacts

	---

	# Language

	- Assamese (`as`)
	- Script: Bengali-Assamese script

	---

	# Intended Use

	This dataset is designed for:
	- pretraining Assamese language models
	- tokenizer development
	- linguistic research
	- educational and research purposes

	Possible applications:
	- LLM pretraining
	- SLM training
	- text generation
	- embeddings
	- language understanding systems

	---

	# License


	It is fully open and freely available for:
	- commercial use
	- research use
	- educational use
	- modification
	- redistribution
	### use it as want
	---

	# Disclaimer

	This dataset was automatically collected from publicly accessible web pages.

	All original content rights belong to their respective authors and publishers.

	If you are a rights holder and want content removed, please open an issue or contact the repository owner.

	---

	# Citation

	```bibtex
	@dataset{xahitya_assamese_corpus,
	title={Xahitya Assamese Corpus},
	author={Ranjit89},
	year={2026},
	publisher={Hugging Face}
	}
	```

	---

	# Acknowledgements

	Special thanks to:
	- the Assamese literary community
	- contributors of Xahitya.org
	- open-source NLP ecosystems
	- Hugging Face

	---

	# Repository

	Hugging Face Dataset Repository:

	https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus

Xet Storage Details

Size:: 2.91 kB
Xet hash:: 4fe1d2cc50580a0b128f62d073970a3d63fc6dd2c245a81c7287cd69b7f64643

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.