File size: 7,571 Bytes
f0fabaa
 
 
 
 
 
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0fabaa
 
c79851d
 
 
 
 
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
c79851d
f0fabaa
c79851d
 
 
f0fabaa
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
 
 
 
c79851d
f0fabaa
 
 
 
 
c79851d
f0fabaa
 
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
 
 
 
 
 
c79851d
f0fabaa
c79851d
 
 
f0fabaa
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
license: mit
language:
  - en
tags:
  - text-generation
  - gpt
  - story-generation
  - sft
  - educational
library_name: custom
pipeline_tag: text-generation
---

# miniJBrain-Story-SFT-v0.1

`miniJBrain-Story-SFT-v0.1` is a small GPT-style causal language model fine-tuned for short, gentle, children-story style generation.

This release is part of the `miniJBrain` learning project, which covers:

- tokenizer training
- base pretraining
- supervised fine-tuning (SFT)
- lightweight style alignment
- checkpoint export and open release

Official code repository:

- `https://github.com/chongliujia/miniJBrain`

## Model Details

- Model family: custom GPT-style causal LM
- Release checkpoint: `minij_chat_story_stage2p1`
- Main use case: short story and bedtime-style text generation
- Vocabulary size: `32,000`
- Context length: `1,024`
- Layers: `16`
- Attention heads: `16`
- Embedding size: `1,024`
- Weights format: `safetensors`

This checkpoint was selected as the most balanced story-oriented SFT result in the project. It performed better overall than narrower bedtime-only variants.

## Repository Contents

This directory is an exported model package. It includes:

- `model.safetensors`
- `config.json`
- `tokenizer.json`
- `generation_config.json`
- `inference.py`
- `README.md`

Important: this is not a zero-code Hugging Face `transformers` package. The weights are present and usable, but the architecture is defined by the `miniJBrain` codebase rather than a standard `AutoModelForCausalLM` config.

## How To Use

The recommended way to run this model is to use the original `miniJBrain` model code:

- `https://github.com/chongliujia/miniJBrain`

### Option 1: Run the local inference script

If you do not already have the model code, clone it first:

```bash
git clone https://github.com/chongliujia/miniJBrain.git
```

If this model directory sits next to the cloned `miniJBrain` project directory, you can run:

```bash
python inference.py \
  --device cpu \
  --prompt $'User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n' \
  --max_new_tokens 220 \
  --temperature 0.50 \
  --top_k 50 \
  --top_p 0.95 \
  --repetition_penalty 1.06
```

By default, `inference.py` loads:

- `./model.safetensors`
- `./config.json`
- `./tokenizer.json`
- `../miniJBrain` as the model-code directory

If your `miniJBrain` checkout lives elsewhere:

```bash
python inference.py --minijbrain-root /path/to/miniJBrain
```

### Option 2: Load it in your own Python code

Use the real `miniJBrain` model definition from `model/gpt.py` in the official repository:

```python
import json
import sys
from pathlib import Path

import torch
from safetensors.torch import load_file
from tokenizers import Tokenizer

minijbrain_root = Path("/path/to/miniJBrain")
sys.path.insert(0, str(minijbrain_root))

from model.gpt import GPT, GPTConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

with open("config.json", "r", encoding="utf-8") as f:
    raw_config = json.load(f)

model = GPT(GPTConfig(**raw_config)).to(device)
state_dict = load_file("model.safetensors")

# The exported safetensors file keeps tied weights through lm_head.weight.
if "transformer.wte.weight" not in state_dict and "lm_head.weight" in state_dict:
    state_dict["transformer.wte.weight"] = state_dict["lm_head.weight"]

model.load_state_dict(state_dict)
model.eval()

tokenizer = Tokenizer.from_file("tokenizer.json")
prompt = "User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n"
input_ids = torch.tensor(
    [tokenizer.encode(prompt, add_special_tokens=False).ids],
    dtype=torch.long,
    device=device,
)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=220,
        temperature=0.50,
        top_k=50,
        top_p=0.95,
        repetition_penalty=1.06,
        eos_token_id=tokenizer.token_to_id("<eos>"),
        stop_on_eos=True,
    )

text = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True)
print(text)
```

### What does not work out of the box

This repository does not yet support direct loading like:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("your-repo-name")
```

That will not work yet because this repository does not provide a standard `transformers` architecture definition, `model_type`, or compatible modeling code.

## Prompt Format

The model works best with the chat-style prompt format used during SFT:

```text
User:
Tell me a warm short bedtime story before sleep.

Assistant:
```

It generally responds best when the prompt is short, explicit, and clearly story-oriented.

## Recommended Decoding

Suggested defaults from `generation_config.json`:

```text
max_new_tokens = 220
temperature = 0.50
top_k = 50
top_p = 0.95
repetition_penalty = 1.06
```

## Intended Use

This model is intended for:

- educational demonstration of small-LLM training and release
- toy story generation experiments
- prompt-format experiments
- decoding experiments on a compact custom LM
- studying story-heavy SFT behavior

## Out-of-Scope Use

This model is not intended for:

- factual question answering
- safety-critical applications
- production child-facing systems
- high-reliability assistant behavior
- benchmark-oriented comparison with modern instruction models

## Training Summary

This release comes from the `stage2p1` story-SFT experiment in the broader `miniJBrain` project.

High-level training path:

1. train tokenizer
2. pretrain a small GPT-style base model
3. run instruction/story SFT
4. build a story-heavy second-stage SFT mixture
5. select the most balanced checkpoint for release

The final checkpoint was chosen because it retained better prompt following and more stable generation than narrower bedtime-specialized runs.

## Data Summary

The broader `miniJBrain` SFT experiments used locally prepared prompt/response data assembled from public sources, including:

- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`
- `Open-Orca/OpenOrca`
- `openai/gsm8k`
- `roneneldan/TinyStories`

For this release checkpoint, the most important SFT sources were:

- `roneneldan/TinyStories`
- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`

Approximate `stage2p1` training composition:

- story samples: `120,000`
- UltraChat-derived samples: `17,839`
- Dolly-derived samples: `3,337`

Approximate validation composition:

- story samples: `6,000`
- UltraChat-derived samples: `1,059`

This puts the final SFT mix at roughly:

- `85%` story-style data
- `15%` chat/instruction-style data

That balance was selected because pure story-only tuning narrowed prompt generalization too much, while a story-heavy mix with some chat data produced more stable behavior.

Dataset note: before any formal redistribution claims, upstream dataset licenses and usage restrictions should be reviewed source by source.

## Limitations

Known limitations include:

- repeated story structure and character patterns
- frequent reuse of certain names and motifs
- generic story arcs
- style instability across prompt phrasings
- occasional abrupt endings with short decoding limits
- imperfect specialization for bedtime-only prompts

## Release Notes

This is a learning-project release, not a benchmark-optimized or production-tuned model.

The main publication goal is transparency around:

- how the data was formatted
- how story-heavy SFT was performed
- how custom GPT-style checkpoints were exported
- how a small custom model can be shared openly