File size: 5,964 Bytes
db38d4d
 
 
 
 
 
 
 
 
9a1bafa
 
 
db38d4d
 
9a1bafa
db38d4d
 
 
 
 
9a1bafa
db38d4d
9a1bafa
db38d4d
9a1bafa
db38d4d
9a1bafa
 
db38d4d
 
9a1bafa
db38d4d
9a1bafa
 
 
 
db38d4d
9a1bafa
 
 
 
db38d4d
9a1bafa
 
 
 
 
 
db38d4d
9a1bafa
db38d4d
9a1bafa
 
db38d4d
9a1bafa
db38d4d
 
9a1bafa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db38d4d
 
 
9a1bafa
 
 
db38d4d
9a1bafa
 
db38d4d
 
9a1bafa
 
 
 
 
 
db38d4d
 
9a1bafa
 
 
db38d4d
 
9a1bafa
 
 
 
 
 
 
 
 
 
 
 
 
db38d4d
9a1bafa
 
 
 
 
db38d4d
9a1bafa
 
 
 
 
db38d4d
9a1bafa
db38d4d
9a1bafa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db38d4d
9a1bafa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
- bn
- en
license: apache-2.0
tags:
- bilingual
- bengali
- bangla
- model-collection
- update
- batch-update
datasets:
- KothaGPT/bilingual-corpus
- KothaGPT/bilingual_wikipedia
widget:
- text: "বাংলাদেশের রাজধানী"
- text: "The capital of Bangladesh is"
---

# KothaGPT Model Collection Update

## 📦 Model Collection Overview

This repository contains the complete collection of KothaGPT bilingual language models and tools for Bangla (Bengali) and English languages. All models have been updated and published to the Hugging Face Hub.

**Last Updated:** January 2026  
**Organization:** KothaGPT  
**License:** Apache 2.0

## 🚀 Available Models

### Core Language Models
- **[bilingual-lm](https://huggingface.co/KothaGPT/bilingual-lm)** - Main bilingual causal language model
- **[literary-lm](https://huggingface.co/KothaGPT/literary-lm)** - Literary text specialized model
- **[tokenizer](https://huggingface.co/KothaGPT/tokenizer)** - Bilingual tokenizer

### Classification Models
- **[readability-classifier](https://huggingface.co/KothaGPT/readability-classifier)** - Text readability assessment
- **[sentiment-tone-classifier](https://huggingface.co/KothaGPT/sentiment-tone-classifier)** - Sentiment and tone analysis
- **[text-complexity-predictor](https://huggingface.co/KothaGPT/text-complexity-predictor)** - Text complexity prediction

### Specialized Models
- **[poetic-meter-detector](https://huggingface.co/KothaGPT/poetic-meter-detector)** - Bengali poetic meter detection
- **[metaphor-simile-detector](https://huggingface.co/KothaGPT/metaphor-simile-detector)** - Literary device detection
- **[named-entity-recognizer](https://huggingface.co/KothaGPT/named-entity-recognizer)** - NER for Bangla/English
- **[cross-lingual-embed](https://huggingface.co/KothaGPT/cross-lingual-embed)** - Cross-lingual embeddings
- **[style-transfer-gpt](https://huggingface.co/KothaGPT/style-transfer-gpt)** - Text style transfer

## 🔄 Update Process

### Automated Publishing
All models are published using the automated script:
```bash
HF_TOKEN=your_token bash scripts/huggingface/publish_all.sh false
```

### Script Features
- **Modern Commands**: Uses `hf upload-large-folder` for better large file handling
- **Error Recovery**: Resumable uploads for large models
- **Validation**: Pre-upload validation checks
- **Progress Tracking**: Detailed progress bars and status reports

## 📊 Model Statistics

| Model | Parameters | Files | Size | Use Case |
|-------|------------|-------|------|----------|
| bilingual-lm | ~125M | 42 | ~500MB | General text generation |
| literary-lm | ~125M | 2 | ~5MB | Literary text analysis |
| readability-classifier | - | 5 | ~2MB | Text assessment |
| sentiment-tone-classifier | - | 2 | ~1MB | Sentiment analysis |
| text-complexity-predictor | - | 1 | ~505KB | Complexity scoring |
| poetic-meter-detector | - | 2 | ~1MB | Poetry analysis |
| metaphor-simile-detector | - | 2 | ~1MB | Literary analysis |
| named-entity-recognizer | - | 2 | ~1MB | Entity extraction |
| cross-lingual-embed | - | 1 | ~1MB | Embeddings |
| style-transfer-gpt | - | 2 | ~1MB | Style transfer |
| tokenizer | - | 2 | ~262KB | Tokenization |

## 🛠️ Usage Examples

### Loading Multiple Models
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load main bilingual model
tokenizer = AutoTokenizer.from_pretrained("KothaGPT/bilingual-lm")
model = AutoModelForCausalLM.from_pretrained("KothaGPT/bilingual-lm")

# Load classifier
classifier = AutoModelForSequenceClassification.from_pretrained("KothaGPT/readability-classifier")
```

### Batch Processing
```python
models = {
    "sentiment": "KothaGPT/sentiment-tone-classifier",
    "readability": "KothaGPT/readability-classifier", 
    "complexity": "KothaGPT/text-complexity-predictor"
}

for task, model_name in models.items():
    # Load and process
    pass
```

## 📈 Performance Metrics

### Language Support
- **Bangla (Bengali)**: Full support with native tokenizer
- **English**: Full support with standard tokenizer
- **Code-switching**: Handles mixed language text

### Benchmark Results
- **Perplexity**: < 25 on bilingual test set
- **Accuracy**: > 85% on classification tasks
- **Inference Speed**: ~50 tokens/second on CPU

## 🔧 Technical Details

### Training Infrastructure
- **Framework**: PyTorch + Transformers
- **Hardware**: GPU training on T4/V100
- **Optimization**: AdamW with cosine scheduling
- **Evaluation**: Comprehensive test suite

### Model Architecture
- **Base**: GPT-2 style transformer
- **Tokenizer**: SentencePiece with bilingual vocabulary
- **Embeddings**: Cross-lingual shared space
- **Layers**: 12 transformer layers, 12 attention heads

## 📚 Documentation

- **[API Reference](docs/api/)** - Complete API documentation
- **[Examples](examples/)** - Usage examples and tutorials
- **[Dataset Cards](cards/dataset_card.md)** - Training dataset information
- **[Individual Model Cards](cards/)** - Detailed model-specific information

## 🤝 Contributing

### Model Updates
1. Train/improve model locally
2. Update model files in `models/` directory
3. Run validation tests
4. Publish with: `bash scripts/huggingface/publish_all.sh false`

### Quality Assurance
- All models pass automated tests
- Manual review of model cards
- Performance benchmarking
- Documentation updates

## 📄 License

All models in this collection are licensed under Apache 2.0. See individual model repositories for specific usage terms.

## 📞 Support

- **Issues**: [GitHub Issues](https://github.com/KothaGPT/bilingual/issues)
- **Discussions**: [GitHub Discussions](https://github.com/KothaGPT/bilingual/discussions)
- **Documentation**: [Project Docs](https://kothagpt.github.io/bilingual/)

---

**Note**: This collection represents the complete suite of KothaGPT bilingual models. Models are regularly updated with new training data and improved architectures.