English-Doc-Topic-BERT

Engish-Doc-Topic-BERT model is a BERT-Base-uncased model fine-tuned on Engish documents from the L3Cube-IndicNews Corpus [dataset link]https://github.com/l3cube-pune/indic-nlp.
This dataset consists of sub-datasets like LDC (Long Document Classification), LPC (Long Paragraph Classification), and SHC (Short Headlines Classification), each having different document lengths.
This model is trained on a combination of all three variants and works well across different document sizes.

More details on the dataset, models, and baseline results can be found in our [paper]https://arxiv.org/abs/2401.02254

Citing:

@article{mirashi2024l3cube,
  title={L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages},
  author={Mirashi, Aishwarya and Sonavane, Srushti and Lingayat, Purva and Padhiyar, Tejas and Joshi, Raviraj},
  journal={arXiv preprint arXiv:2401.02254},
  year={2024}
}

Other document topic models for different Indic languages are listed below:
Hindi-Doc-Topic-BERT
Marathi-Doc-Topic-BERT
Bengali-Doc-Topic-BERT
Telugu-Doc-Topic-BERT
Tamil-Doc-Topic-BERT
Gujarati-Doc-Topic-BERT
Kannada-Doc-Topic-BERT
Odia-Doc-Topic-BERT
Malayalam-Doc-Topic-BERT
Punjabi-Doc-Topic-BERT
English-Doc-Topic-BERT

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

Paper for l3cube-pune/english-topic-all-doc

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Paper • 2401.02254 • Published Jan 4, 2024