File size: 7,571 Bytes
f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d f0fabaa c79851d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | ---
license: mit
language:
- en
tags:
- text-generation
- gpt
- story-generation
- sft
- educational
library_name: custom
pipeline_tag: text-generation
---
# miniJBrain-Story-SFT-v0.1
`miniJBrain-Story-SFT-v0.1` is a small GPT-style causal language model fine-tuned for short, gentle, children-story style generation.
This release is part of the `miniJBrain` learning project, which covers:
- tokenizer training
- base pretraining
- supervised fine-tuning (SFT)
- lightweight style alignment
- checkpoint export and open release
Official code repository:
- `https://github.com/chongliujia/miniJBrain`
## Model Details
- Model family: custom GPT-style causal LM
- Release checkpoint: `minij_chat_story_stage2p1`
- Main use case: short story and bedtime-style text generation
- Vocabulary size: `32,000`
- Context length: `1,024`
- Layers: `16`
- Attention heads: `16`
- Embedding size: `1,024`
- Weights format: `safetensors`
This checkpoint was selected as the most balanced story-oriented SFT result in the project. It performed better overall than narrower bedtime-only variants.
## Repository Contents
This directory is an exported model package. It includes:
- `model.safetensors`
- `config.json`
- `tokenizer.json`
- `generation_config.json`
- `inference.py`
- `README.md`
Important: this is not a zero-code Hugging Face `transformers` package. The weights are present and usable, but the architecture is defined by the `miniJBrain` codebase rather than a standard `AutoModelForCausalLM` config.
## How To Use
The recommended way to run this model is to use the original `miniJBrain` model code:
- `https://github.com/chongliujia/miniJBrain`
### Option 1: Run the local inference script
If you do not already have the model code, clone it first:
```bash
git clone https://github.com/chongliujia/miniJBrain.git
```
If this model directory sits next to the cloned `miniJBrain` project directory, you can run:
```bash
python inference.py \
--device cpu \
--prompt $'User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n' \
--max_new_tokens 220 \
--temperature 0.50 \
--top_k 50 \
--top_p 0.95 \
--repetition_penalty 1.06
```
By default, `inference.py` loads:
- `./model.safetensors`
- `./config.json`
- `./tokenizer.json`
- `../miniJBrain` as the model-code directory
If your `miniJBrain` checkout lives elsewhere:
```bash
python inference.py --minijbrain-root /path/to/miniJBrain
```
### Option 2: Load it in your own Python code
Use the real `miniJBrain` model definition from `model/gpt.py` in the official repository:
```python
import json
import sys
from pathlib import Path
import torch
from safetensors.torch import load_file
from tokenizers import Tokenizer
minijbrain_root = Path("/path/to/miniJBrain")
sys.path.insert(0, str(minijbrain_root))
from model.gpt import GPT, GPTConfig
device = "cuda" if torch.cuda.is_available() else "cpu"
with open("config.json", "r", encoding="utf-8") as f:
raw_config = json.load(f)
model = GPT(GPTConfig(**raw_config)).to(device)
state_dict = load_file("model.safetensors")
# The exported safetensors file keeps tied weights through lm_head.weight.
if "transformer.wte.weight" not in state_dict and "lm_head.weight" in state_dict:
state_dict["transformer.wte.weight"] = state_dict["lm_head.weight"]
model.load_state_dict(state_dict)
model.eval()
tokenizer = Tokenizer.from_file("tokenizer.json")
prompt = "User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n"
input_ids = torch.tensor(
[tokenizer.encode(prompt, add_special_tokens=False).ids],
dtype=torch.long,
device=device,
)
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=220,
temperature=0.50,
top_k=50,
top_p=0.95,
repetition_penalty=1.06,
eos_token_id=tokenizer.token_to_id("<eos>"),
stop_on_eos=True,
)
text = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True)
print(text)
```
### What does not work out of the box
This repository does not yet support direct loading like:
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("your-repo-name")
```
That will not work yet because this repository does not provide a standard `transformers` architecture definition, `model_type`, or compatible modeling code.
## Prompt Format
The model works best with the chat-style prompt format used during SFT:
```text
User:
Tell me a warm short bedtime story before sleep.
Assistant:
```
It generally responds best when the prompt is short, explicit, and clearly story-oriented.
## Recommended Decoding
Suggested defaults from `generation_config.json`:
```text
max_new_tokens = 220
temperature = 0.50
top_k = 50
top_p = 0.95
repetition_penalty = 1.06
```
## Intended Use
This model is intended for:
- educational demonstration of small-LLM training and release
- toy story generation experiments
- prompt-format experiments
- decoding experiments on a compact custom LM
- studying story-heavy SFT behavior
## Out-of-Scope Use
This model is not intended for:
- factual question answering
- safety-critical applications
- production child-facing systems
- high-reliability assistant behavior
- benchmark-oriented comparison with modern instruction models
## Training Summary
This release comes from the `stage2p1` story-SFT experiment in the broader `miniJBrain` project.
High-level training path:
1. train tokenizer
2. pretrain a small GPT-style base model
3. run instruction/story SFT
4. build a story-heavy second-stage SFT mixture
5. select the most balanced checkpoint for release
The final checkpoint was chosen because it retained better prompt following and more stable generation than narrower bedtime-specialized runs.
## Data Summary
The broader `miniJBrain` SFT experiments used locally prepared prompt/response data assembled from public sources, including:
- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`
- `Open-Orca/OpenOrca`
- `openai/gsm8k`
- `roneneldan/TinyStories`
For this release checkpoint, the most important SFT sources were:
- `roneneldan/TinyStories`
- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`
Approximate `stage2p1` training composition:
- story samples: `120,000`
- UltraChat-derived samples: `17,839`
- Dolly-derived samples: `3,337`
Approximate validation composition:
- story samples: `6,000`
- UltraChat-derived samples: `1,059`
This puts the final SFT mix at roughly:
- `85%` story-style data
- `15%` chat/instruction-style data
That balance was selected because pure story-only tuning narrowed prompt generalization too much, while a story-heavy mix with some chat data produced more stable behavior.
Dataset note: before any formal redistribution claims, upstream dataset licenses and usage restrictions should be reviewed source by source.
## Limitations
Known limitations include:
- repeated story structure and character patterns
- frequent reuse of certain names and motifs
- generic story arcs
- style instability across prompt phrasings
- occasional abrupt endings with short decoding limits
- imperfect specialization for bedtime-only prompts
## Release Notes
This is a learning-project release, not a benchmark-optimized or production-tuned model.
The main publication goal is transparency around:
- how the data was formatted
- how story-heavy SFT was performed
- how custom GPT-style checkpoints were exported
- how a small custom model can be shared openly
|