Buckets:
| language: | |
| - as | |
| license: cc0-1.0 | |
| pretty_name: Xahitya Assamese Corpus | |
| size_categories: | |
| - unknown | |
| task_categories: | |
| - text-generation | |
| - fill-mask | |
| - token-classification | |
| # Xahitya Assamese Corpus | |
| A large-scale Assamese literary text corpus scraped from [Xahitya.org](https://xahitya.org), containing Assamese prose, essays, stories, poems, and other long-form literary writings. | |
| This dataset is intended for: | |
| - Assamese NLP research | |
| - Language model pretraining | |
| - Tokenizer training | |
| - Text generation | |
| - Linguistic analysis | |
| - Low-resource language AI research | |
| --- | |
| # Dataset Structure | |
| The dataset currently contains: | |
| ```text | |
| xahitya_dump/ | |
| ├── articles.jsonl | |
| └── corpus.txt | |
| ``` | |
| ## Files | |
| ### `articles.jsonl` | |
| Structured JSON Lines dataset. | |
| Each line contains: | |
| ```json | |
| { | |
| "url": "...", | |
| "title": "...", | |
| "date": "...", | |
| "source": "html/wp-rest", | |
| "text": "..." | |
| } | |
| ``` | |
| --- | |
| ### `corpus.txt` | |
| Plain-text corpus version intended for: | |
| - tokenizer training | |
| - language model pretraining | |
| - raw corpus processing pipelines | |
| The file contains cleaned Assamese literary text extracted from the website. | |
| --- | |
| # Source | |
| Primary source: | |
| - https://xahitya.org | |
| Xahitya.org is a well-known Assamese e-literature platform containing: | |
| - stories | |
| - essays | |
| - poems | |
| - interviews | |
| - serialized writings | |
| - literary articles | |
| --- | |
| # Data Collection | |
| The dataset was collected using a custom Python scraper with: | |
| - WordPress API extraction | |
| - HTML parsing fallback | |
| - text cleaning | |
| - duplicate filtering | |
| - Unicode normalization | |
| The scraping pipeline removes: | |
| - HTML markup | |
| - navigation text | |
| - scripts/styles | |
| - duplicated content | |
| - unnecessary formatting artifacts | |
| --- | |
| # Language | |
| - Assamese (`as`) | |
| - Script: Bengali-Assamese script | |
| --- | |
| # Intended Use | |
| This dataset is designed for: | |
| - pretraining Assamese language models | |
| - tokenizer development | |
| - linguistic research | |
| - educational and research purposes | |
| Possible applications: | |
| - LLM pretraining | |
| - SLM training | |
| - text generation | |
| - embeddings | |
| - language understanding systems | |
| --- | |
| # License | |
| It is fully open and freely available for: | |
| - commercial use | |
| - research use | |
| - educational use | |
| - modification | |
| - redistribution | |
| ### use it as want | |
| --- | |
| # Disclaimer | |
| This dataset was automatically collected from publicly accessible web pages. | |
| All original content rights belong to their respective authors and publishers. | |
| If you are a rights holder and want content removed, please open an issue or contact the repository owner. | |
| --- | |
| # Citation | |
| ```bibtex | |
| @dataset{xahitya_assamese_corpus, | |
| title={Xahitya Assamese Corpus}, | |
| author={Ranjit89}, | |
| year={2026}, | |
| publisher={Hugging Face} | |
| } | |
| ``` | |
| --- | |
| # Acknowledgements | |
| Special thanks to: | |
| - the Assamese literary community | |
| - contributors of Xahitya.org | |
| - open-source NLP ecosystems | |
| - Hugging Face | |
| --- | |
| # Repository | |
| Hugging Face Dataset Repository: | |
| https://huggingface.co/datasets/Ranjit89/xahitya-assamese-corpus |
Xet Storage Details
- Size:
- 2.91 kB
- Xet hash:
- 4fe1d2cc50580a0b128f62d073970a3d63fc6dd2c245a81c7287cd69b7f64643
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.