--- license: mit language: - en tags: - text-generation - gpt - story-generation - sft - educational library_name: custom pipeline_tag: text-generation --- # miniJBrain-Story-SFT-v0.1 `miniJBrain-Story-SFT-v0.1` is a small GPT-style causal language model fine-tuned for short, gentle, children-story style generation. This release is part of the `miniJBrain` learning project, which covers: - tokenizer training - base pretraining - supervised fine-tuning (SFT) - lightweight style alignment - checkpoint export and open release Official code repository: - `https://github.com/chongliujia/miniJBrain` ## Model Details - Model family: custom GPT-style causal LM - Release checkpoint: `minij_chat_story_stage2p1` - Main use case: short story and bedtime-style text generation - Vocabulary size: `32,000` - Context length: `1,024` - Layers: `16` - Attention heads: `16` - Embedding size: `1,024` - Weights format: `safetensors` This checkpoint was selected as the most balanced story-oriented SFT result in the project. It performed better overall than narrower bedtime-only variants. ## Repository Contents This directory is an exported model package. It includes: - `model.safetensors` - `config.json` - `tokenizer.json` - `generation_config.json` - `inference.py` - `README.md` Important: this is not a zero-code Hugging Face `transformers` package. The weights are present and usable, but the architecture is defined by the `miniJBrain` codebase rather than a standard `AutoModelForCausalLM` config. ## How To Use The recommended way to run this model is to use the original `miniJBrain` model code: - `https://github.com/chongliujia/miniJBrain` ### Option 1: Run the local inference script If you do not already have the model code, clone it first: ```bash git clone https://github.com/chongliujia/miniJBrain.git ``` If this model directory sits next to the cloned `miniJBrain` project directory, you can run: ```bash python inference.py \ --device cpu \ --prompt $'User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n' \ --max_new_tokens 220 \ --temperature 0.50 \ --top_k 50 \ --top_p 0.95 \ --repetition_penalty 1.06 ``` By default, `inference.py` loads: - `./model.safetensors` - `./config.json` - `./tokenizer.json` - `../miniJBrain` as the model-code directory If your `miniJBrain` checkout lives elsewhere: ```bash python inference.py --minijbrain-root /path/to/miniJBrain ``` ### Option 2: Load it in your own Python code Use the real `miniJBrain` model definition from `model/gpt.py` in the official repository: ```python import json import sys from pathlib import Path import torch from safetensors.torch import load_file from tokenizers import Tokenizer minijbrain_root = Path("/path/to/miniJBrain") sys.path.insert(0, str(minijbrain_root)) from model.gpt import GPT, GPTConfig device = "cuda" if torch.cuda.is_available() else "cpu" with open("config.json", "r", encoding="utf-8") as f: raw_config = json.load(f) model = GPT(GPTConfig(**raw_config)).to(device) state_dict = load_file("model.safetensors") # The exported safetensors file keeps tied weights through lm_head.weight. if "transformer.wte.weight" not in state_dict and "lm_head.weight" in state_dict: state_dict["transformer.wte.weight"] = state_dict["lm_head.weight"] model.load_state_dict(state_dict) model.eval() tokenizer = Tokenizer.from_file("tokenizer.json") prompt = "User:\nTell me a warm short bedtime story before sleep.\n\nAssistant:\n" input_ids = torch.tensor( [tokenizer.encode(prompt, add_special_tokens=False).ids], dtype=torch.long, device=device, ) with torch.no_grad(): output_ids = model.generate( input_ids, max_new_tokens=220, temperature=0.50, top_k=50, top_p=0.95, repetition_penalty=1.06, eos_token_id=tokenizer.token_to_id(""), stop_on_eos=True, ) text = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True) print(text) ``` ### What does not work out of the box This repository does not yet support direct loading like: ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("your-repo-name") ``` That will not work yet because this repository does not provide a standard `transformers` architecture definition, `model_type`, or compatible modeling code. ## Prompt Format The model works best with the chat-style prompt format used during SFT: ```text User: Tell me a warm short bedtime story before sleep. Assistant: ``` It generally responds best when the prompt is short, explicit, and clearly story-oriented. ## Recommended Decoding Suggested defaults from `generation_config.json`: ```text max_new_tokens = 220 temperature = 0.50 top_k = 50 top_p = 0.95 repetition_penalty = 1.06 ``` ## Intended Use This model is intended for: - educational demonstration of small-LLM training and release - toy story generation experiments - prompt-format experiments - decoding experiments on a compact custom LM - studying story-heavy SFT behavior ## Out-of-Scope Use This model is not intended for: - factual question answering - safety-critical applications - production child-facing systems - high-reliability assistant behavior - benchmark-oriented comparison with modern instruction models ## Training Summary This release comes from the `stage2p1` story-SFT experiment in the broader `miniJBrain` project. High-level training path: 1. train tokenizer 2. pretrain a small GPT-style base model 3. run instruction/story SFT 4. build a story-heavy second-stage SFT mixture 5. select the most balanced checkpoint for release The final checkpoint was chosen because it retained better prompt following and more stable generation than narrower bedtime-specialized runs. ## Data Summary The broader `miniJBrain` SFT experiments used locally prepared prompt/response data assembled from public sources, including: - `HuggingFaceH4/ultrachat_200k` - `databricks/databricks-dolly-15k` - `Open-Orca/OpenOrca` - `openai/gsm8k` - `roneneldan/TinyStories` For this release checkpoint, the most important SFT sources were: - `roneneldan/TinyStories` - `HuggingFaceH4/ultrachat_200k` - `databricks/databricks-dolly-15k` Approximate `stage2p1` training composition: - story samples: `120,000` - UltraChat-derived samples: `17,839` - Dolly-derived samples: `3,337` Approximate validation composition: - story samples: `6,000` - UltraChat-derived samples: `1,059` This puts the final SFT mix at roughly: - `85%` story-style data - `15%` chat/instruction-style data That balance was selected because pure story-only tuning narrowed prompt generalization too much, while a story-heavy mix with some chat data produced more stable behavior. Dataset note: before any formal redistribution claims, upstream dataset licenses and usage restrictions should be reviewed source by source. ## Limitations Known limitations include: - repeated story structure and character patterns - frequent reuse of certain names and motifs - generic story arcs - style instability across prompt phrasings - occasional abrupt endings with short decoding limits - imperfect specialization for bedtime-only prompts ## Release Notes This is a learning-project release, not a benchmark-optimized or production-tuned model. The main publication goal is transparency around: - how the data was formatted - how story-heavy SFT was performed - how custom GPT-style checkpoints were exported - how a small custom model can be shared openly