|
|
--- |
|
|
language: |
|
|
- am |
|
|
- ar |
|
|
- hy |
|
|
- az |
|
|
- bn |
|
|
- my |
|
|
- zh |
|
|
- ca |
|
|
- da |
|
|
- nl |
|
|
- en |
|
|
- fil |
|
|
- fo |
|
|
- fi |
|
|
- fr |
|
|
- de |
|
|
- el |
|
|
- gu |
|
|
- ha |
|
|
- he |
|
|
- hi |
|
|
- hu |
|
|
- id |
|
|
- it |
|
|
- ja |
|
|
- jv |
|
|
- km |
|
|
- ko |
|
|
- lo |
|
|
- ms |
|
|
- mr |
|
|
- 'no' |
|
|
- ps |
|
|
- fa |
|
|
- pl |
|
|
- pt |
|
|
- pa |
|
|
- ro |
|
|
- ru |
|
|
- sr |
|
|
- sk |
|
|
- es |
|
|
- sw |
|
|
- sv |
|
|
- ta |
|
|
- te |
|
|
- th |
|
|
- bo |
|
|
- tr |
|
|
- ur |
|
|
- uz |
|
|
- vi |
|
|
- yo |
|
|
- ceb |
|
|
- cs |
|
|
license: other |
|
|
task_categories: |
|
|
- text-generation |
|
|
--- |
|
|
# Bantam Language Model |
|
|
|
|
|
This model card provides a detailed overview of the **BantamForCausalLM** model, a transformer-based architecture designed for adaptive, efficient language modeling through hybrid dense and sparse computation. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
The **Bantam** model is a 20-layer causal language model combining **dense Transformer blocks** with **Mixture-of-Experts (MoE)** layers. It features **layer-wise dynamic attention**, **progressive context scaling**, and **grouped multi-query attention**, designed for efficient large-scale language modeling with balanced compute utilization. |
|
|
|
|
|
* **Developed by:** Theoistic |
|
|
* **Lead Developer:** Theodor Solbjorg ([theo@theoistic.com](mailto:theo@theoistic.com)) |
|
|
* **Funded by:** Theoistic |
|
|
* **Shared by:** Theoistic |
|
|
* **Model type:** Causal Language Model (Transformer-based) |
|
|
* **Language(s):** Multilingual (55 languages, see dataset summary below) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
* **Paper:** Pending publication |
|
|
* **Demo:** Coming soon |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The Bantam model can be used directly for text generation, completion, summarization, and instruction following. It supports context windows up to **2048 tokens** and operates efficiently on GPUs using **bfloat16 precision**. |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
Bantam can be fine-tuned for downstream NLP tasks, such as translation, dialogue modeling, educational content generation, or creative writing. Its multilingual and mixed-domain dataset allows flexible adaptation. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
The model is **not** intended for high-stakes or safety-critical domains, such as legal, medical, or financial decision-making. It should not be used for generating misinformation or biased outputs without human oversight. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
Bantam inherits biases from its multilingual datasets, which include content from the internet, curated knowledge bases, and open-source text corpora. It may underperform on underrepresented languages or dialects. |
|
|
|
|
|
Additionally, as a relatively small model (β285M parameters), **hallucinations and factual inaccuracies are expected** β especially when reasoning beyond the scope of the training data. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
Users should implement output filtering, content moderation, and continuous evaluation on domain-specific benchmarks to identify and mitigate bias or performance issues. |
|
|
|
|
|
## Model Capabilities |
|
|
|
|
|
Bantam demonstrates strong **multilingual competence** across 55 languages and is capable of generating **informative, coherent, and contextually aware text** in each of them. |
|
|
|
|
|
The model was designed to leverage **many small attention heads in early layers** to capture linguistic and grammatical structures, transitioning to **larger, more abstract reasoning** in later layers. This design improves logical coherence and narrative flow across diverse languages despite the modelβs compact size. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Before loading the model, install the Bantam CLI: |
|
|
|
|
|
```bash |
|
|
pip install bantam-cli |
|
|
``` |
|
|
|
|
|
if you want to inference it directly via the bantam-cli you can run: |
|
|
|
|
|
```bash |
|
|
bantam-cli chat --model Theoistic/Bantam-285m |
|
|
``` |
|
|
|
|
|
or initialize the model in Python: |
|
|
|
|
|
```python |
|
|
import bantam # lazy imports |
|
|
import bantam.tokenization_bantam # registers BantamFastTokenizer with AutoTokenizer |
|
|
import bantam.modeling_bantam # registers config/model with AutoConfig/AutoModel |
|
|
|
|
|
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Theoistic/Bantam-285m") |
|
|
model = AutoModelForCausalLM.from_pretrained("Theoistic/Bantam-285m") |
|
|
|
|
|
|
|
|
prompt = "Once upon a time," |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
|
|
|
# Remove unsupported keys (like token_type_ids) before generation |
|
|
if "token_type_ids" in inputs: |
|
|
del inputs["token_type_ids"] |
|
|
|
|
|
|
|
|
outputs = model.generate(**inputs, max_length=100) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on the **Bantam Dataset**, a multilingual and multi-domain collection of JSONL files designed to support general-purpose language modeling. It includes content from knowledge bases, refined educational text, tiny fictional stories, and curated data for linguistic diversity. |
|
|
|
|
|
#### Languages Covered |
|
|
|
|
|
The dataset spans **55 languages**, including: |
|
|
|
|
|
* **English** |
|
|
* **Chinese (Mandarin, Wu, Cantonese)** |
|
|
* **Romance:** Spanish, French, Portuguese, Italian, Romanian, Catalan |
|
|
* **Indic & Dravidian:** Hindi, Bengali, Tamil, Telugu, Urdu, Gujarati |
|
|
* **Slavic & Germanic:** Russian, Polish, Czech, German, Danish, Swedish, Norwegian, Dutch, Faroese |
|
|
* **Others:** Arabic, Hebrew, Amharic, Turkish, Finnish, Korean, Japanese, Swahili, Vietnamese, Thai, Greek, Persian, and more. |
|
|
|
|
|
*Note: The dataset itself is not publicly released; this summary represents the linguistic and structural diversity of the data used for training.* |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
A large portion of the training data was deduped, normalized and categorized by field of study, language and preprossed in to feature rich dense articles using larger models |
|
|
to format in to concise, detailed markdown articles. |
|
|
The millions of articles provided where suffled in languages making sure larger domain or lingustic features did not overshadow |
|
|
or any low resource language impact happened due to catastrophic forgetting. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* **Parameters:** 285 million |
|
|
* **Precision:** bfloat16 mixed precision |
|
|
* **Optimizer:** AdamW with weight decay |
|
|
* **Batch size:** 2048 tokens per GPU |
|
|
* **Learning rate schedule:** Cosine decay with warmup |
|
|
* **Context length:** 2048 tokens |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
|
|
|
* **Training hardware:** NVIDIA RTX 5090 |
|
|
* **Training duration:** ~50 hours |
|
|
* **Checkpoint size:** ~285M parameters |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Bantam is a **pretrained base model**, not fine-tuned or benchmarked with external metrics. Qualitatively, it exhibits: |
|
|
|
|
|
* Strong multilingual understanding and generation across 55 languages. |
|
|
* Coherent reasoning and informative responses. |
|
|
* Expected hallucinations due to small model size. |
|
|
|
|
|
No quantitative metrics or interpretability visualizations (e.g., heatmaps, probing, or evaluation suites) have been produced yet. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
* **Hardware Type:** NVIDIA RTX 5090 |
|
|
* **Hours used:** ~50 |
|
|
* **Cloud Provider:** Local compute |
|
|
* **Compute Region:** N/A (local training) |
|
|
* **Carbon Emitted:** Estimated <0.05 tCOβeq |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
| Layer | Type | Query Heads | KV Heads | Head Dim | Groups | Intermediate Size | Window | MoE | Notes | |
|
|
| ----- | ----- | ----------- | -------- | -------- | ------ | ----------------- | ------ | --- | ------------------------------- | |
|
|
| 0 | Dense | 12 | 3 | 64 | 1 | 2304 | 128 | β | Dense local linguistic encoding | |
|
|
| 1 | Dense | 12 | 3 | 64 | 1 | 2304 | 128 | β | Dense local linguistic encoding | |
|
|
| 2 | Dense | 12 | 3 | 64 | 1 | 2368 | 128 | β | Dense local attention | |
|
|
| 3 | Dense | 12 | 3 | 64 | 1 | 2400 | β | β | Transition layer | |
|
|
| 4 | MoE | (6+3) | 3 | (80/96) | 2 | 2432 | 256 | β
| 6 experts, top-2 routing | |
|
|
| 5 | Dense | (6+3) | 3 | (80/96) | 2 | 2368 | 256 | β | Hybrid attention | |
|
|
| 6 | Dense | (6+3) | 3 | (80/96) | 2 | 2432 | 256 | β | Hybrid attention | |
|
|
| 7 | Dense | (6+3) | 3 | (80/96) | 2 | 2368 | β | β | Expanding context | |
|
|
| 8 | Dense | 9 | 3 | 64/128 | 2 | 2304 | 256 | β | Default grouped attention | |
|
|
| 9 | Dense | 9 | 3 | 64/128 | 2 | 2368 | 256 | β | Default grouped attention | |
|
|
| 10 | Dense | 9 | 3 | 64/128 | 2 | 2400 | 256 | β | Default grouped attention | |
|
|
| 11 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 256 | β | Default grouped attention | |
|
|
| 12 | Dense | 9 | 3 | 64/128 | 2 | 2432 | β | β | Expanding context | |
|
|
| 13 | Dense | 9 | 3 | 64/128 | 2 | 2400 | 512 | β | Logical attention expansion | |
|
|
| 14 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 512 | β | Logical attention expansion | |
|
|
| 15 | Dense | 9 | 3 | 64/128 | 2 | 2432 | 512 | β | Logical attention expansion | |
|
|
| 16 | MoE | 9 | 3 | 64/128 | 2 | 2432 | 512 | β
| 8 experts, top-2 routing | |
|
|
| 17 | MoE | 9 | 3 | 64/128 | 2 | 2432 | β | β
| 8 experts, top-2 routing | |
|
|
| 18 | Dense | 9 | 3 | 64/128 | 2 | 2368 | 512 | β | Output stabilization | |
|
|
| 19 | Dense | 9 | 3 | 64/128 | 2 | 2400 | β | β | Final dense layer | |
|
|
|
|
|
### Attention Group Defaults |
|
|
|
|
|
| Group | Query Heads | KV Heads | Head Dim | |
|
|
| ------- | ----------- | -------- | -------- | |
|
|
| Group 1 | 3 | 1 | 128 | |
|
|
| Group 2 | 6 | 2 | 64 | |
|
|
|
|
|
These defaults apply to all layers unless explicitly overridden in layer-specific configurations. |
|
|
|
|
|
* **Objective:** Causal next-token prediction |
|
|
* **Routing:** Top-2 expert routing with load-balancing loss 0.01 |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
* 1 Γ NVIDIA RTX 5090 GPU |
|
|
|
|
|
#### Software |
|
|
|
|
|
* PyTorch 2.8 |
|
|
* Transformers >=4.41 |
|
|
* Bantam CLI (required for import registration) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
* **Theodor Solbjorg** β Lead Developer, Theoistic |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For inquiries: [theo@theoistic.com](mailto:theo@theoistic.com) |
|
|
|