File size: 7,571 Bytes

f0fabaa
 
 
 
 
 
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0fabaa
 
c79851d
 
 
 
 
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
c79851d
f0fabaa
c79851d
 
 
f0fabaa
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
f0fabaa
 
 
c79851d
f0fabaa
 
 
 
 
 
 
c79851d
f0fabaa
 
 
 
 
c79851d
f0fabaa
 
 
 
 
 
 
 
 
 
c79851d
f0fabaa
c79851d
 
f0fabaa
c79851d
f0fabaa
c79851d
f0fabaa
 
 
 
 
 
c79851d
f0fabaa
c79851d
 
 
f0fabaa
 
 
c79851d
f0fabaa
c79851d
f0fabaa
c79851d

---
license: mit
language:
  - en
tags:
  - text-generation
  - gpt
  - story-generation
  - sft
  - educational
library_name: custom
pipeline_tag: text-generation
---

# miniJBrain-Story-SFT-v0.1

`miniJBrain-Story-SFT-v0.1` is a small GPT-style causal language model fine-tuned for short, gentle, children-story style generation.

This release is part of the `miniJBrain` learning project, which covers:

- tokenizer training
- base pretraining
- supervised fine-tuning (SFT)
- lightweight style alignment
- checkpoint export and open release

Official code repository:

- `https://github.com/chongliujia/miniJBrain`

## Model Details

- Model family: custom GPT-style causal LM
- Release checkpoint: `minij_chat_story_stage2p1`
- Main use case: short story and bedtime-style text generation
- Vocabulary size: `32,000`
- Context length: `1,024`
- Layers: `16`
- Attention heads: `16`
- Embedding size: `1,024`
- Weights format: `safetensors`

This checkpoint was selected as the most balanced story-oriented SFT result in the project. It performed better overall than narrower bedtime-only variants.

## Repository Contents

This directory is an exported model package. It includes:

- `model.safetensors`
- `config.json`
- `tokenizer.json`
- `generation_config.json`
- `inference.py`
- `README.md`

Important: this is not a zero-code Hugging Face `transformers` package. The weights are present and usable, but the architecture is defined by the `miniJBrain` codebase rather than a standard `AutoModelForCausalLM` config.

## How To Use

The recommended way to run this model is to use the original `miniJBrain` model code:

- `https://github.com/chongliujia/miniJBrain`

### Option 1: Run the local inference script

If you do not already have the model code, clone it first:

```bash
git clone https://github.com/chongliujia/miniJBrain.git
```

If this model directory sits next to the cloned `miniJBrain` project directory, you can run:

```bash
python inference.py \
  --device cpu \
  --prompt $'User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n' \
  --max_new_tokens 220 \
  --temperature 0.50 \
  --top_k 50 \
  --top_p 0.95 \
  --repetition_penalty 1.06
```

By default, `inference.py` loads:

- `./model.safetensors`
- `./config.json`
- `./tokenizer.json`
- `../miniJBrain` as the model-code directory

If your `miniJBrain` checkout lives elsewhere:

```bash
python inference.py --minijbrain-root /path/to/miniJBrain
```

### Option 2: Load it in your own Python code

Use the real `miniJBrain` model definition from `model/gpt.py` in the official repository:

```python
import json
import sys
from pathlib import Path

import torch
from safetensors.torch import load_file
from tokenizers import Tokenizer

minijbrain_root = Path("/path/to/miniJBrain")
sys.path.insert(0, str(minijbrain_root))

from model.gpt import GPT, GPTConfig

device = "cuda" if torch.cuda.is_available() else "cpu"

with open("config.json", "r", encoding="utf-8") as f:
    raw_config = json.load(f)

model = GPT(GPTConfig(**raw_config)).to(device)
state_dict = load_file("model.safetensors")

# The exported safetensors file keeps tied weights through lm_head.weight.
if "transformer.wte.weight" not in state_dict and "lm_head.weight" in state_dict:
    state_dict["transformer.wte.weight"] = state_dict["lm_head.weight"]

model.load_state_dict(state_dict)
model.eval()

tokenizer = Tokenizer.from_file("tokenizer.json")
prompt = "User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n"
input_ids = torch.tensor(
    [tokenizer.encode(prompt, add_special_tokens=False).ids],
    dtype=torch.long,
    device=device,
)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=220,
        temperature=0.50,
        top_k=50,
        top_p=0.95,
        repetition_penalty=1.06,
        eos_token_id=tokenizer.token_to_id("<eos>"),
        stop_on_eos=True,
    )

text = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True)
print(text)
```

### What does not work out of the box

This repository does not yet support direct loading like:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("your-repo-name")
```

That will not work yet because this repository does not provide a standard `transformers` architecture definition, `model_type`, or compatible modeling code.

## Prompt Format

The model works best with the chat-style prompt format used during SFT:

```text
User:
Tell me a warm short bedtime story before sleep.

Assistant:
```

It generally responds best when the prompt is short, explicit, and clearly story-oriented.

## Recommended Decoding

Suggested defaults from `generation_config.json`:

```text
max_new_tokens = 220
temperature = 0.50
top_k = 50
top_p = 0.95
repetition_penalty = 1.06
```

## Intended Use

This model is intended for:

- educational demonstration of small-LLM training and release
- toy story generation experiments
- prompt-format experiments
- decoding experiments on a compact custom LM
- studying story-heavy SFT behavior

## Out-of-Scope Use

This model is not intended for:

- factual question answering
- safety-critical applications
- production child-facing systems
- high-reliability assistant behavior
- benchmark-oriented comparison with modern instruction models

## Training Summary

This release comes from the `stage2p1` story-SFT experiment in the broader `miniJBrain` project.

High-level training path:

1. train tokenizer
2. pretrain a small GPT-style base model
3. run instruction/story SFT
4. build a story-heavy second-stage SFT mixture
5. select the most balanced checkpoint for release

The final checkpoint was chosen because it retained better prompt following and more stable generation than narrower bedtime-specialized runs.

## Data Summary

The broader `miniJBrain` SFT experiments used locally prepared prompt/response data assembled from public sources, including:

- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`
- `Open-Orca/OpenOrca`
- `openai/gsm8k`
- `roneneldan/TinyStories`

For this release checkpoint, the most important SFT sources were:

- `roneneldan/TinyStories`
- `HuggingFaceH4/ultrachat_200k`
- `databricks/databricks-dolly-15k`

Approximate `stage2p1` training composition:

- story samples: `120,000`
- UltraChat-derived samples: `17,839`
- Dolly-derived samples: `3,337`

Approximate validation composition:

- story samples: `6,000`
- UltraChat-derived samples: `1,059`

This puts the final SFT mix at roughly:

- `85%` story-style data
- `15%` chat/instruction-style data

That balance was selected because pure story-only tuning narrowed prompt generalization too much, while a story-heavy mix with some chat data produced more stable behavior.

Dataset note: before any formal redistribution claims, upstream dataset licenses and usage restrictions should be reviewed source by source.

## Limitations

Known limitations include:

- repeated story structure and character patterns
- frequent reuse of certain names and motifs
- generic story arcs
- style instability across prompt phrasings
- occasional abrupt endings with short decoding limits
- imperfect specialization for bedtime-only prompts

## Release Notes

This is a learning-project release, not a benchmark-optimized or production-tuned model.

The main publication goal is transparency around:

- how the data was formatted
- how story-heavy SFT was performed
- how custom GPT-style checkpoints were exported
- how a small custom model can be shared openly