---
language:
- en
tags:
- text-generation
- llama
- pytorch
- causal-lm
- small-language-model
- slm
- 142m
- educational
- fanfiction
- academic
- base-model
- english
- rtx4090
- apache-2.0
- continual-pre-training
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/sciphi-textbooks
library_name: transformers
pipeline_tag: text-generation
inference:
  parameters:
    temperature: 0.7
    top_p: 0.9
    max_new_tokens: 150
base_model:
- StentorLabs/Stentor-30M
---

# Model Card: Stentor-Big

Stentor-Big is a direct expansion of the original [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) model developed by Kai Izumoto (StentorLabs). The architecture was scaled up from 30M to 142M parameters by increasing the hidden size, number of layers, and intermediate dimensions while preserving the pre-trained weights where possible. This approach allows the model to retain the linguistic foundations learned by its smaller counterpart while gaining additional capacity through new randomly initialized layers.

## Model Description

Stentor-Big is a compact language model with 142 million parameters, built upon the Llama architecture. It is the result of a three-stage continual pre-training process designed to combine broad linguistic competence, narrative coherence, and structured academic style. The model is intended as a strong base for further fine‑tuning or direct use in educational text generation, creative writing assistance, and prototyping of small‑scale language applications.

**Developed by:** stas122  
**Model type:** Causal language model (`LlamaForCausalLM`)  
**Language:** English  
**Parameters:** 142,639,104 (142.6M)  
**Context length:** 512 tokens  
**License:** Apache 2.0 (the original base model Stentor‑30M is also Apache 2.0)  

---

## Model Details

| Hyperparameter          | Value        |
|-------------------------|--------------|
| Hidden size             | 512          |
| Intermediate size       | 2048         |
| Number of layers        | 30           |
| Number of attention heads | 8          |
| Head dimension          | 64           |
| Vocabulary size         | 32768        |
| Max position embeddings | 512          |
| Tie word embeddings     | True         |
| Activation function     | SiLU (Swish) |

The architecture follows the standard LLaMA design, with pre‑RMSNorm, rotary positional embeddings (RoPE), and SwiGLU activation in the MLP.

---

## Training Data

The model was trained in three distinct stages, each using a different corpus to shape its capabilities.

| Stage | Dataset | Tokens | Purpose |
|-------|---------|--------|---------|
| 1 | FineWeb (educational subset) | 279M | Establish general linguistic knowledge, grammar, and basic facts. |
| 2 | Custom curated fanfiction corpus | 1.03B | Develop narrative flow, dialogue, and literary coherence. |
| 3 | Mixed educational corpus (FineWeb‑Edu + Sciphi Textbooks) | 1.02B | Enhance academic style, structured exposition, and scientific reasoning. |

**Total tokens seen during training:** ~2.33 billion, which corresponds to 81% of the Chinchilla‑optimal compute for a 142M model (≈2.88B tokens).

All datasets were pre‑tokenized with a context window of 512 tokens, using the same tokenizer as Stentor‑30M (vocabulary size 32768). Examples are chunked as 511 tokens plus an end‑of‑sequence token.

---

## Training Procedure

### Stage 1 (FineWeb)
- **Learning rate:** 2e‑4  
- **Effective batch size:** 128  
- **Steps:** ~9,500  
- **Hardware:** 1× RTX 4090  

### Stage 2 (Fanfiction)
- **Learning rate:** 2e‑4  
- **Effective batch size:** 128  
- **Steps:** ~6,500  
- **Hardware:** 2× RTX 4090 (DDP)  

### Stage 3 (Educational Mix)
- **Learning rate:** 1.5e‑4 (reduced to preserve previously learned style)  
- **Effective batch size:** 192  
- **Steps:** ~7,171  
- **Hardware:** 2× RTX 4090 (DDP)  

**Common hyperparameters:** AdamW optimizer, cosine learning rate schedule with warmup (5% of total steps), gradient clipping (1.0), weight decay (0.01), BF16 mixed precision, gradient checkpointing enabled.

---

## Evaluation Results

The model was evaluated on a custom **master test** consisting of 100 prompts across 10 categories, designed to probe factual knowledge, reasoning, creativity, and domain adaptation. Each prompt is a sentence fragment that the model continues. Temperature and max new tokens were varied per category to obtain meaningful outputs.

| Category | Score | Observations |
|----------|-------|--------------|
| Science (biology, physics) |  Excellent | Accurate terminology, structured explanations (e.g., photosynthesis, black holes). |
| Philosophy & abstract thought |  Excellent | Coherent, essay‑like discussions of meaning of life, consciousness, justice. |
| Interdisciplinary topics |  Excellent | Synthesizes concepts from multiple domains; good use of examples. |
| Literature & arts |  Moderate | Surface‑level knowledge, sometimes generic or repetitive. |
| Geography & facts |  Moderate | Often produces plausible but imprecise statements; capitals and dates are weak. |
| Programming (Python) |  Moderate | Recognises libraries and basic constructs, but generated code is often broken. |
| Mathematics |  Poor | Frequent errors in elementary arithmetic and geometry; repetitive patterns. |

### Full Evaluation Results

The complete set of 100 test prompts, along with detailed generation outputs for each category and temperature setting, is available in the repository:

📊 **[Evaluation Results](master_test.txt)** — full generation log with per-prompt temperature and token limits.

### Representative Generations

**Good example – science**  
*Prompt:* `Photosynthesis is the process by which plants`  
*Generation:*  
> convert sunlight, water, and carbon dioxide into glucose and oxygen. This process is essential for the survival of all life on Earth.  
>  
> 3. The light‑dependent reactions of photosynthesis involve the conversion of light energy into chemical energy, which is then used to power the plant's growth and development.

**Good example – philosophy**  
*Prompt:* `The meaning of life is a question that`  
*Generation:*  
> has been debated throughout history. It is the question that has been debated for centuries, with the philosophers of Aristotle and Plato contributing to the field.  
>  
> III. The Socratic method in philosophy is a method of questioning the very foundations of life. It is based on the belief that life is a form of divine truth, and that individuals must confront the unknown and make meaningful choices that shape their existence.

**Weak example – factual error**  
*Prompt:* `The capital of France is`  
*Generation:*  
> a complex and intricate system, with a rich history that spans thousands of years. The empire was a powerful force in the world of politics, and it was during this time that the empire began

**Weak example – mathematical mistake**  
*Prompt:* `The square root of 64 is`  
*Generation:*  
> 12.  
>  
> In conclusion, the 1st Grade Mathematics: Learning Addition and Subtraction through Fun Math Games chapter aims to provide a comprehensive and engaging approach to teaching these fundamental mathematical concepts.

### Comparison with Stentor‑30M

On the same science prompt, the improvement in coherence and factual accuracy is evident:

**Stentor‑30M:**  
> Photosynthesis is the process by which plants and animals develop. It is the process of growth and development of the organism. The process of growth is the process of growth.

**Stentor‑Big:**  
> Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. This process is essential for the survival of all life on Earth.

---

## Capabilities and Limitations

### Capabilities
- Generate fluent, well‑structured paragraphs in academic, scientific, and narrative styles.
- Produce coherent continuations for prompts requiring explanation, analysis, or creative expansion.
- Serve as a strong foundation for further fine‑tuning on domain‑specific tasks (e.g., question answering, summarisation, story generation).
- Operate efficiently on consumer hardware (e.g., RTX 4090) and can be quantized for edge deployment.

### Limitations
- **Factual accuracy is not guaranteed:** the model often produces plausible‑sounding but incorrect facts, especially for dates, numbers, and proper names.
- **Weak at mathematics and programming:** elementary arithmetic is frequently wrong, and generated code is rarely executable.
- **Limited context window (512 tokens):** not suitable for tasks requiring long‑range dependencies.
- **Not instruction‑tuned:** prompts are treated as continuations; the model does not reliably follow commands or engage in dialogue.
- **May exhibit repetitive or degenerate outputs** on certain prompts (e.g., recursive definitions, simple primes).
- **Potential biases inherited from training data** (web‑crawled and synthetic corpora) – use with appropriate caution.

---

## Usage Example

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "stas122/Stentor-Big"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Automatically choose the right dtype and device
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=dtype,
    attn_implementation="sdpa"  # efficient attention
)

# Move to GPU if available
if torch.cuda.is_available():
    model = model.to("cuda")

prompt = "Photosynthesis is the process by which plants"
inputs = tokenizer(prompt, return_tensors="pt")
if torch.cuda.is_available():
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

outputs = model.generate(
    **inputs,
    max_new_tokens=80,
    temperature=0.5,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Ethical Considerations and Recommended Use

Stentor‑Big is a research prototype. It is **not** aligned with human preferences and may generate content that is factually incorrect, biased, or otherwise unsuitable for production environments. Users should validate all outputs, especially when used in educational or decision‑making contexts. The model should not be deployed in applications where harm could arise from erroneous or misleading text.

Recommended uses:
- Educational tool for exploring language model behaviour.
- Creative writing aid (with human oversight).
- Starting point for fine‑tuning on specialised corpora.
- Lightweight experimentation in resource‑constrained settings.

---

## Citation

If you use this model in your work, please cite it as:

```bibtex
@misc{stentor-big,
  author = {stas122},
  title = {Stentor-Big: A 142M Parameter Language Model},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/stas122/Stentor-Big}}
}
```

---

## Acknowledgements
I would like to express my deepest gratitude to **Kai Izumoto (StentorLabs)**. His work on the original [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) model not only served as the foundation for this project but also sparked my initial interest in deep learning and ultimately inspired me to embark on this entire journey of language model research. This work stands on the shoulders of his contribution to the open‑source community.
- Hugging Face for the transformers and datasets libraries.
- The creators of FineWeb, FineWeb‑Edu, Cosmopedia v2, and Sciphi Textbooks.
- The open‑source community for enabling accessible NLP research.
- DeepSeek for insightful discussions and assistance with theoretical aspects of model architecture, training strategies, and evaluation methodologies.
- [Immers.cloud](https://immers.cloud) for providing reliable GPU infrastructure (NVIDIA RTX 4090) that made the extensive training experiments possible.
- MLP fan community for their creations