Bantam-285m / README.md

Update README.md

b966053 verified 3 months ago

11.4 kB

	---
	language:
	- am # Amharic
	- ar # Arabic
	- hy # Armenian
	- az # Azerbaijani
	- bn # Bengali
	- my # Burmese
	- zh # Mandarin_Chinese (also includes Wu_Chinese)
	- ca # Catalan
	- da # Danish
	- nl # Dutch
	- en # English
	- fil # Filipino_Tagalog
	- fo # Faroese
	- fi # Finnish
	- fr # French
	- de # German
	- el # Greek
	- gu # Gujarati
	- ha # Hausa
	- he # Hebrew
	- hi # Hindi
	- hu # Hungarian
	- id # Indonesian
	- it # Italian
	- ja # Japanese
	- jv # Javanese
	- km # Khmer
	- ko # Korean
	- lo # Lao
	- ms # Malay
	- mr # Marathi
	- 'no' # Norwegian
	- ps # Pashto
	- fa # Persian
	- pl # Polish
	- pt # Portuguese
	- pa # Punjabi
	- ro # Romanian
	- ru # Russian
	- sr # Serbian
	- sk # Slovak
	- es # Spanish
	- sw # Swahili
	- sv # Swedish
	- ta # Tamil
	- te # Telugu
	- th # Thai
	- bo # Tibetan
	- tr # Turkish
	- ur # Urdu
	- uz # Uzbek
	- vi # Vietnamese
	- yo # Yoruba
	- ceb # Cebuano**
	- cs # Czech**
	license: other
	task_categories:
	- text-generation
	---
	# Bantam Language Model

	This model card provides a detailed overview of the BantamForCausalLM model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation.

	## Model Details

	### Model Description

	The Bantam model is a 20-layer causal language model combining dense Transformer blocks with Mixture-of-Experts (MoE) layers. It features layer-wise dynamic attention, progressive context scaling, and grouped multi-query attention, designed for efficient large-scale language modeling with balanced compute utilization.

	* Developed by: Theoistic
	* Lead Developer: Theodor Solbjorg ([theo@theoistic.com](mailto:theo@theoistic.com))
	* Funded by: Theoistic
	* Shared by: Theoistic
	* Model type: Causal Language Model (Transformer-based)
	* Language(s): Multilingual (55 languages, see dataset summary below)

	### Model Sources

	* Paper: Pending publication
	* Demo: Coming soon

	## Uses

	### Direct Use

	The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to 2048 tokens and operates efficiently on GPUs using bfloat16 precision.

	### Downstream Use

	Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation.

	### Out-of-Scope Use

	The model is not intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight.

	## Bias, Risks, and Limitations

	Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects.

	Additionally, as a relatively small model (≈285M parameters), hallucinations and factual inaccuracies are expected — especially when reasoning beyond the scope of the training data.

	### Recommendations

	Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues.

	## Model Capabilities

	Bantam demonstrates strong multilingual competence across 55 languages and is capable of generating informative, coherent, and contextually aware text in each of them.

	The model was designed to leverage many small attention heads in early layers to capture linguistic and grammatical structures, transitioning to larger, more abstract reasoning in later layers. This design improves logical coherence and narrative flow across diverse languages despite the model’s compact size.

	## How to Get Started with the Model

	Before loading the model, install the Bantam CLI:

	```bash
	pip install bantam-cli
	```

	if you want to inference it directly via the bantam-cli you can run:

	```bash
	bantam-cli chat --model Theoistic/Bantam-285m
	```

	or initialize the model in Python:

	```python
	import bantam # lazy imports
	import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer
	import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel


	from transformers import AutoModelForCausalLM, AutoTokenizer


	tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m")
	model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m")


	prompt = "Once upon a time,"
	inputs = tokenizer(prompt, return_tensors="pt")


	# Remove unsupported keys (like token_type_ids) before generation
	if "token_type_ids" in inputs:
	del inputs["token_type_ids"]


	outputs = model.generate(**inputs, max_length=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```


	## Training Details

	### Training Data

	The model was trained on the Bantam Dataset, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity.

	#### Languages Covered

	The dataset spans 55 languages, including:

	* English
	* Chinese (Mandarin, Wu, Cantonese)
	* Romance: Spanish, French, Portuguese, Italian, Romanian, Catalan
	* Indic & Dravidian: Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati
	* Slavic & Germanic: Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese
	* Others: Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more.

	Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.

	### Training Procedure

	#### Preprocessing

	A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models
	to format in to concise, detailed markdown articles.
	The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow
	or any low resource language impact happened due to catastrophic forgetting.

	#### Training Hyperparameters

	* Parameters: 285 million
	* Precision: bfloat16 mixed precision
	* Optimizer: AdamW with weight decay
	* Batch size: 2048 tokens per GPU
	* Learning rate schedule: Cosine decay with warmup
	* Context length: 2048 tokens

	#### Speeds, Sizes, Times

	* Training hardware: NVIDIA RTX 5090
	* Training duration: ~50 hours
	* Checkpoint size: ~285M parameters

	## Evaluation

	Bantam is a pretrained base model, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits:

	* Strong multilingual understanding and generation across 55 languages.
	* Coherent reasoning and informative responses.
	* Expected hallucinations due to small model size.

	No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet.

	## Environmental Impact

	* Hardware Type: NVIDIA RTX 5090
	* Hours used: ~50
	* Cloud Provider: Local compute
	* Compute Region: N/A (local training)
	* Carbon Emitted: Estimated <0.05 tCO₂eq

	## Technical Specifications

	### Model Architecture and Objective

	\| Layer \| Type \| Query Heads \| KV Heads \| Head Dim \| Groups \| Intermediate Size \| Window \| MoE \| Notes \|
	\| ----- \| ----- \| ----------- \| -------- \| -------- \| ------ \| ----------------- \| ------ \| --- \| ------------------------------- \|
	\| 0 \| Dense \| 12 \| 3 \| 64 \| 1 \| 2304 \| 128 \| ❌ \| Dense local linguistic encoding \|
	\| 1 \| Dense \| 12 \| 3 \| 64 \| 1 \| 2304 \| 128 \| ❌ \| Dense local linguistic encoding \|
	\| 2 \| Dense \| 12 \| 3 \| 64 \| 1 \| 2368 \| 128 \| ❌ \| Dense local attention \|
	\| 3 \| Dense \| 12 \| 3 \| 64 \| 1 \| 2400 \| – \| ❌ \| Transition layer \|
	\| 4 \| MoE \| (6+3) \| 3 \| (80/96) \| 2 \| 2432 \| 256 \| ✅ \| 6 experts, top-2 routing \|
	\| 5 \| Dense \| (6+3) \| 3 \| (80/96) \| 2 \| 2368 \| 256 \| ❌ \| Hybrid attention \|
	\| 6 \| Dense \| (6+3) \| 3 \| (80/96) \| 2 \| 2432 \| 256 \| ❌ \| Hybrid attention \|
	\| 7 \| Dense \| (6+3) \| 3 \| (80/96) \| 2 \| 2368 \| – \| ❌ \| Expanding context \|
	\| 8 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2304 \| 256 \| ❌ \| Default grouped attention \|
	\| 9 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2368 \| 256 \| ❌ \| Default grouped attention \|
	\| 10 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2400 \| 256 \| ❌ \| Default grouped attention \|
	\| 11 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| 256 \| ❌ \| Default grouped attention \|
	\| 12 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| – \| ❌ \| Expanding context \|
	\| 13 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2400 \| 512 \| ❌ \| Logical attention expansion \|
	\| 14 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| 512 \| ❌ \| Logical attention expansion \|
	\| 15 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| 512 \| ❌ \| Logical attention expansion \|
	\| 16 \| MoE \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| 512 \| ✅ \| 8 experts, top-2 routing \|
	\| 17 \| MoE \| 9 \| 3 \| 64/128 \| 2 \| 2432 \| – \| ✅ \| 8 experts, top-2 routing \|
	\| 18 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2368 \| 512 \| ❌ \| Output stabilization \|
	\| 19 \| Dense \| 9 \| 3 \| 64/128 \| 2 \| 2400 \| – \| ❌ \| Final dense layer \|

	### Attention Group Defaults

	\| Group \| Query Heads \| KV Heads \| Head Dim \|
	\| ------- \| ----------- \| -------- \| -------- \|
	\| Group 1 \| 3 \| 1 \| 128 \|
	\| Group 2 \| 6 \| 2 \| 64 \|

	These defaults apply to all layers unless explicitly overridden in layer-specific configurations.

	* Objective: Causal next-token prediction
	* Routing: Top-2 expert routing with load-balancing loss 0.01

	### Compute Infrastructure

	#### Hardware

	* 1 × NVIDIA RTX 5090 GPU

	#### Software

	* PyTorch 2.8
	* Transformers >=4.41
	* Bantam CLI (required for import registration)

	## Model Card Authors

	* Theodor Solbjorg — Lead Developer, Theoistic

	## Model Card Contact

	For inquiries: [theo@theoistic.com](mailto:theo@theoistic.com)