---
language:
  - am   # Amharic
  - ar   # Arabic
  - hy   # Armenian
  - az   # Azerbaijani
  - bn   # Bengali
  - my   # Burmese
  - zh   # Mandarin_Chinese (also includes Wu_Chinese)
  - ca   # Catalan
  - da   # Danish
  - nl   # Dutch
  - en   # English
  - fil  # Filipino_Tagalog
  - fo   # Faroese
  - fi   # Finnish
  - fr   # French
  - de   # German
  - el   # Greek
  - gu   # Gujarati
  - ha   # Hausa
  - he   # Hebrew
  - hi   # Hindi
  - hu   # Hungarian
  - id   # Indonesian
  - it   # Italian
  - ja   # Japanese
  - jv   # Javanese
  - km   # Khmer
  - ko   # Korean
  - lo   # Lao
  - ms   # Malay
  - mr   # Marathi
  - 'no' # Norwegian
  - ps   # Pashto
  - fa   # Persian
  - pl   # Polish
  - pt   # Portuguese
  - pa   # Punjabi
  - ro   # Romanian
  - ru   # Russian
  - sr   # Serbian
  - sk   # Slovak
  - es   # Spanish
  - sw   # Swahili
  - sv   # Swedish
  - ta   # Tamil
  - te   # Telugu
  - th   # Thai
  - bo   # Tibetan
  - tr   # Turkish
  - ur   # Urdu
  - uz   # Uzbek
  - vi   # Vietnamese
  - yo   # Yoruba
  - ceb  # Cebuano**
  - cs   # Czech**
license: other
task_categories:
- text-generation
---
# Bantam Language Model

This model card provides a detailed overview of the **BantamForCausalLM** model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation.

## Model Details

### Model Description

The **Bantam** model is a 20-layer causal language model combining **dense Transformer blocks** with **Mixture-of-Experts (MoE)** layers. It features **layer-wise dynamic attention**, **progressive context scaling**, and **grouped multi-query attention**, designed for efficient large-scale language modeling with balanced compute utilization.

* **Developed by:** Theoistic
* **Lead Developer:** Theodor Solbjorg ([theo@theoistic.com](mailto:theo@theoistic.com))
* **Funded by:** Theoistic
* **Shared by:** Theoistic
* **Model type:** Causal Language Model (Transformer-based)
* **Language(s):** Multilingual (55 languages, see dataset summary below)

### Model Sources

* **Paper:** Pending publication
* **Demo:** Coming soon

## Uses

### Direct Use

The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to **2048 tokens** and operates efficiently on GPUs using **bfloat16 precision**.

### Downstream Use

Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation.

### Out-of-Scope Use

The model is **not** intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight.

## Bias, Risks, and Limitations

Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects.

Additionally, as a relatively small model (≈285M parameters), **hallucinations and factual inaccuracies are expected** — especially when reasoning beyond the scope of the training data.

### Recommendations

Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues.

## Model Capabilities

Bantam demonstrates strong **multilingual competence** across 55 languages and is capable of generating **informative, coherent, and contextually aware text** in each of them.

The model was designed to leverage **many small attention heads in early layers** to capture linguistic and grammatical structures, transitioning to **larger, more abstract reasoning** in later layers. This design improves logical coherence and narrative flow across diverse languages despite the model’s compact size.

## How to Get Started with the Model

Before loading the model, install the Bantam CLI:

```bash
pip install bantam-cli
```

if you want to inference it directly via the bantam-cli you can run:

```bash
bantam-cli chat --model Theoistic/Bantam-285m
```

or initialize the model in Python:

```python
import bantam # lazy imports
import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer
import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel


from transformers import AutoModelForCausalLM, AutoTokenizer


tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m")
model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m")


prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt")


# Remove unsupported keys (like token_type_ids) before generation
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]


outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```


## Training Details

### Training Data

The model was trained on the **Bantam Dataset**, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity.

#### Languages Covered

The dataset spans **55 languages**, including:

* **English**
* **Chinese (Mandarin, Wu, Cantonese)**
* **Romance:** Spanish, French, Portuguese, Italian, Romanian, Catalan
* **Indic & Dravidian:** Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati
* **Slavic & Germanic:** Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese
* **Others:** Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more.

*Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.*

### Training Procedure

#### Preprocessing

A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models
to format in to concise, detailed markdown articles.
The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow 
or any low resource language impact happened due to catastrophic forgetting.

#### Training Hyperparameters

* **Parameters:** 285 million
* **Precision:** bfloat16 mixed precision
* **Optimizer:** AdamW with weight decay
* **Batch size:** 2048 tokens per GPU
* **Learning rate schedule:** Cosine decay with warmup
* **Context length:** 2048 tokens

#### Speeds, Sizes, Times

* **Training hardware:** NVIDIA RTX 5090
* **Training duration:** ~50 hours
* **Checkpoint size:** ~285M parameters

## Evaluation

Bantam is a **pretrained base model**, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits:

* Strong multilingual understanding and generation across 55 languages.
* Coherent reasoning and informative responses.
* Expected hallucinations due to small model size.

No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet.

## Environmental Impact

* **Hardware Type:** NVIDIA RTX 5090
* **Hours used:** ~50
* **Cloud Provider:** Local compute
* **Compute Region:** N/A (local training)
* **Carbon Emitted:** Estimated <0.05 tCO₂eq

## Technical Specifications

### Model Architecture and Objective

| Layer | Type  | Query Heads | KV Heads | Head Dim | Groups | Intermediate Size | Window | MoE | Notes                           |
| ----- | ----- | ----------- | -------- | -------- | ------ | ----------------- | ------ | --- | ------------------------------- |
| 0     | Dense | 12          | 3        | 64       | 1      | 2304              | 128    | ❌   | Dense local linguistic encoding |
| 1     | Dense | 12          | 3        | 64       | 1      | 2304              | 128    | ❌   | Dense local linguistic encoding |
| 2     | Dense | 12          | 3        | 64       | 1      | 2368              | 128    | ❌   | Dense local attention           |
| 3     | Dense | 12          | 3        | 64       | 1      | 2400              | –      | ❌   | Transition layer                |
| 4     | MoE   | (6+3)       | 3        | (80/96)  | 2      | 2432              | 256    | ✅   | 6 experts, top-2 routing        |
| 5     | Dense | (6+3)       | 3        | (80/96)  | 2      | 2368              | 256    | ❌   | Hybrid attention                |
| 6     | Dense | (6+3)       | 3        | (80/96)  | 2      | 2432              | 256    | ❌   | Hybrid attention                |
| 7     | Dense | (6+3)       | 3        | (80/96)  | 2      | 2368              | –      | ❌   | Expanding context               |
| 8     | Dense | 9           | 3        | 64/128   | 2      | 2304              | 256    | ❌   | Default grouped attention       |
| 9     | Dense | 9           | 3        | 64/128   | 2      | 2368              | 256    | ❌   | Default grouped attention       |
| 10    | Dense | 9           | 3        | 64/128   | 2      | 2400              | 256    | ❌   | Default grouped attention       |
| 11    | Dense | 9           | 3        | 64/128   | 2      | 2432              | 256    | ❌   | Default grouped attention       |
| 12    | Dense | 9           | 3        | 64/128   | 2      | 2432              | –      | ❌   | Expanding context               |
| 13    | Dense | 9           | 3        | 64/128   | 2      | 2400              | 512    | ❌   | Logical attention expansion     |
| 14    | Dense | 9           | 3        | 64/128   | 2      | 2432              | 512    | ❌   | Logical attention expansion     |
| 15    | Dense | 9           | 3        | 64/128   | 2      | 2432              | 512    | ❌   | Logical attention expansion     |
| 16    | MoE   | 9           | 3        | 64/128   | 2      | 2432              | 512    | ✅   | 8 experts, top-2 routing        |
| 17    | MoE   | 9           | 3        | 64/128   | 2      | 2432              | –      | ✅   | 8 experts, top-2 routing        |
| 18    | Dense | 9           | 3        | 64/128   | 2      | 2368              | 512    | ❌   | Output stabilization            |
| 19    | Dense | 9           | 3        | 64/128   | 2      | 2400              | –      | ❌   | Final dense layer               |

### Attention Group Defaults

| Group   | Query Heads | KV Heads | Head Dim |
| ------- | ----------- | -------- | -------- |
| Group 1 | 3           | 1        | 128      |
| Group 2 | 6           | 2        | 64       |

These defaults apply to all layers unless explicitly overridden in layer-specific configurations.

* **Objective:** Causal next-token prediction
* **Routing:** Top-2 expert routing with load-balancing loss 0.01

### Compute Infrastructure

#### Hardware

* 1 × NVIDIA RTX 5090 GPU

#### Software

* PyTorch 2.8
* Transformers >=4.41
* Bantam CLI (required for import registration)

## Model Card Authors

* **Theodor Solbjorg** — Lead Developer, Theoistic

## Model Card Contact

For inquiries: [theo@theoistic.com](mailto:theo@theoistic.com)