Instructions to use humanvprojectceo/nilla-story with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use humanvprojectceo/nilla-story with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="humanvprojectceo/nilla-story")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("humanvprojectceo/nilla-story", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use humanvprojectceo/nilla-story with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "humanvprojectceo/nilla-story" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "humanvprojectceo/nilla-story", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/humanvprojectceo/nilla-story
- SGLang
How to use humanvprojectceo/nilla-story with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "humanvprojectceo/nilla-story" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "humanvprojectceo/nilla-story", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "humanvprojectceo/nilla-story" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "humanvprojectceo/nilla-story", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use humanvprojectceo/nilla-story with Docker Model Runner:
docker model run hf.co/humanvprojectceo/nilla-story
HumanV (Transformers Integration) + Nilla-Story Checkpoint
This repository contains:
- HumanV: a lightweight, decoder-only Transformer architecture integrated into the 🤗 Transformers codebase.
- Nilla-Story: a small HumanV checkpoint trained for short story generation (TinyStories-style).
Goal: upstream the HumanV architecture into
huggingface/transformersso it can be loaded with standardAutoModel*classes (withouttrust_remote_code=True).
Model: Nilla-Story
- Hub:
nebularesearchtrain/nilla-story - Tokenizer: GPT-2 tokenizer (
gpt2), vocab size 50,257 - Context length: 1,024 (trained with sequence length 512)
Quickstart (from the Hub)
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "nebularesearchtrain/nilla-story"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "Once upon a time,"
inputs = tokenizer(prompt, return_tensors="pt")
out = model.generate(
**inputs,
max_new_tokens=120,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
If you are using a development version that still requires custom code on the Hub, load with:
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
Architecture: HumanV
HumanV is a decoder-only Transformer inspired by modern LLaMA-style blocks:
- Causal self-attention with Rotary Position Embeddings (RoPE)
- RMSNorm
- SiLU / SwiGLU-style MLP
- Optional grouped-query attention via
num_key_value_heads(can be equal tonum_attention_headsfor standard MHA)
Precision policy (recommended)
For TPU-friendly stability and speed:
- BF16 for most matmul operations
- FP32 for numerically sensitive steps (attention softmax + attention mask add, RMSNorm, logits/loss)
Training (Nilla-Story)
- Dataset: TinyStories (subset)
- Sequence length: 512
- Precision: BF16 (with FP32 softmax/norm/loss as described above)
- Hardware: Google TPU v5e-1
Example generation (sample)
Prompts like:
Once upon a time,The little bird wanted to
produce short story continuations suitable for toy storytelling tasks.
Contributing / Upstreaming to Transformers
This repository is prepared for an upstream PR to 🤗 Transformers. A typical PR includes:
src/transformers/models/humanv/implementation (configuration_*.py,modeling_*.py)- Auto-class registration (so
AutoModelForCausalLMworks) - Unit tests in
tests/models/humanv/ - Documentation page:
docs/source/en/model_doc/humanv.md
Transformers recommends a modular approach for new model contributions, and CI may validate generated files when using modular modeling.
Limitations
- This is a small model trained on a limited dataset. It may repeat phrases, hallucinate details, or generate simplistic stories.
- Not intended for safety-critical use cases.
License
- Code: Apache-2.0 (compatible with 🤗 Transformers)
Citation
If you use this work, please cite the repository and the Hugging Face model page.
- Downloads last month
- 1