| --- |
| license: mit |
| datasets: |
| - roneneldan/TinyStories |
| language: |
| - en |
| pipeline_tag: text-generation |
| tags: |
| - text-generation-inference |
| new_version: GODELEV/Test-1-4000 |
| --- |
| # Test-1-3000 — A 190M Parameter Narrative Intelligence Engine |
|
|
| <p align="center"> |
|
|
|  |
|  |
|  |
|  |
|  |
|
|
| </p> |
|
|
| --- |
|
|
| # Overview |
|
|
| **Test-1-3000** is a compact yet remarkably capable decoder-only Transformer language model built upon the modern **Llama architecture**. |
|
|
| The project explores an important question in language model research: |
|
|
| > *How much narrative reasoning, coherence, and world understanding can emerge inside a small model when trained correctly?* |
|
|
| Despite containing only **190.55 million parameters**, Test-1-3000 demonstrates surprisingly advanced: |
|
|
| - Narrative continuity |
| - Character persistence |
| - Long-range memory consistency |
| - Emotional progression |
| - Logical event sequencing |
| - Contextual storytelling stability |
|
|
| The model was trained specifically for **short-form narrative intelligence**, focusing on coherent storytelling rather than broad internet-scale memorization. |
|
|
| Unlike many small models that generate fragmented or repetitive text, Test-1-3000 learns to maintain: |
|
|
| - causal relationships, |
| - stable story worlds, |
| - emotional trajectories, |
| - and meaningful resolutions across long contexts. |
|
|
| --- |
|
|
| # Key Highlights |
|
|
| | Feature | Description | |
| |---|---| |
| | Architecture | Llama-based Decoder-only Transformer | |
| | Parameters | 190.55 Million | |
| | Context Length | 2048 Tokens | |
| | Final Training Step | 3000 | |
| | Final Training Loss | **0.8516** | |
| | Attention Optimization | Flash Attention 2 | |
| | Compilation | `torch.compile` | |
| | Precision | bfloat16 Mixed Precision | |
| | Positional Encoding | Rotary Positional Embeddings (RoPE) | |
|
|
| --- |
|
|
| #What Makes Test-1-3000 Special? |
|
|
| Most compact language models struggle with: |
|
|
| - maintaining consistency, |
| - remembering earlier events, |
| - resolving story arcs, |
| - and avoiding repetition. |
|
|
| Test-1-3000 was trained with a different objective philosophy: |
|
|
| ## Narrative Intelligence First |
|
|
| Instead of optimizing for broad factual memorization, the model focuses on: |
|
|
| - temporal continuity, |
| - event causality, |
| - emotional logic, |
| - and narrative closure. |
|
|
| This creates a surprisingly stable storytelling engine capable of generating coherent multi-paragraph narratives with strong thematic flow. |
|
|
| --- |
|
|
| # Model Architecture |
|
|
| Test-1-3000 follows a modern efficient Transformer design optimized for both: |
|
|
| - training stability, |
| - and inference throughput. |
|
|
| The architecture borrows heavily from the proven Llama design philosophy while remaining lightweight enough for experimentation and rapid iteration. |
|
|
| --- |
|
|
| # Technical Specifications |
|
|
| | Feature | Specification | |
| |---|---| |
| | Model Type | Decoder-only Transformer | |
| | Hidden Dimension | 768 | |
| | Layers (Depth) | 12 | |
| | Attention Heads | 12 | |
| | Intermediate Size | 3072 | |
| | Activation Function | SwiGLU | |
| | Normalization | RMSNorm | |
| | Vocabulary Size | 50,257 | |
| | Tokenizer | GPT-2 Tokenizer | |
| | Context Window | 2048 Tokens | |
| | Precision | bfloat16 | |
| | Attention Backend | Flash Attention 2 | |
|
|
| --- |
|
|
| # Positional Understanding with RoPE |
|
|
| Test-1-3000 uses **Rotary Positional Embeddings (RoPE)** to maintain precise token relationship awareness throughout long contexts. |
|
|
| This allows the model to: |
|
|
| - track entities across paragraphs, |
| - preserve story continuity, |
| - maintain dialogue references, |
| - and understand long-range dependencies efficiently. |
|
|
| For a model of this scale, the 2048-token context window provides unusually strong narrative memory. |
|
|
| --- |
|
|
| #The Evolution of Learning |
|
|
| Training Test-1-3000 revealed clear emergent phases of cognitive development. |
|
|
| The model did not merely memorize text patterns — it progressively developed increasingly sophisticated representations of narrative structure and world dynamics. |
|
|
| --- |
|
|
| #The Lexical Phase |
| ## *(Steps 0 → 250)* |
|
|
| At the beginning of training, the model learned the statistical foundations of language. |
|
|
| It discovered: |
|
|
| - common sentence structures, |
| - punctuation behavior, |
| - frequent vocabulary patterns, |
| - and story-opening syntax. |
|
|
| During this phase, phrases such as: |
|
|
| > "Once upon a time" |
|
|
| became strong narrative anchors. |
|
|
| The model began constructing basic grammatical fluency but still lacked deeper logical understanding. |
|
|
| ### Characteristics |
|
|
| - High repetition |
| - Weak memory |
| - Poor event continuity |
| - Basic syntax acquisition |
|
|
| --- |
|
|
| # The Relational Phase |
| ## *(Steps 250 → 1000)* |
|
|
| The model started connecting concepts together into meaningful relationships. |
|
|
| It learned: |
|
|
| - object interactions, |
| - spatial reasoning, |
| - basic causality, |
| - and action consistency. |
|
|
| For example: |
|
|
| - parks imply trees and playing, |
| - rain implies umbrellas or wetness, |
| - sadness often precedes comfort or resolution. |
|
|
| The training loss rapidly decreased below **1.5**, signaling major improvements in structural reasoning. |
|
|
| ### Emergent Behaviors |
|
|
| - Scene consistency |
| - Character-action alignment |
| - Basic emotional logic |
| - Improved descriptive continuity |
|
|
| --- |
|
|
| # The Coherence Phase |
| ## *(Steps 1000 → 2000)* |
|
|
| This phase marked the emergence of true narrative stabilization. |
|
|
| The model learned: |
|
|
| - story pacing, |
| - setup/payoff relationships, |
| - conflict resolution, |
| - and multi-sentence thematic continuity. |
|
|
| Stories no longer collapsed into unrelated fragments. |
|
|
| Instead, the model began maintaining: |
|
|
| - stable goals, |
| - emotional arcs, |
| - and logical conclusions. |
|
|
| If a story introduced a problem: |
|
|
| > "Lily was lonely." |
|
|
| the model increasingly learned to produce meaningful emotional resolutions later in the text. |
|
|
| ### Major Improvements |
|
|
| - Long-range memory |
| - Reduced contradiction |
| - Better endings |
| - Stronger narrative flow |
| - Lower hallucination frequency |
|
|
| Final loss at this stage: |
|
|
| | Step | Loss | |
| |---|---| |
| | 2000 | **1.27** | |
|
|
| --- |
|
|
| # The Emergent Narrative Intelligence Phase |
| ## *(Steps 2000 → 3000)* |
|
|
| This final stage represented a major leap in generative sophistication. |
|
|
| Rather than simply maintaining coherence, the model began exhibiting signs of: |
|
|
| - implicit world modeling, |
| - narrative anticipation, |
| - emotional persistence, |
| - and latent planning behavior. |
|
|
| The model increasingly understood that stories possess: |
|
|
| - momentum, |
| - consequences, |
| - emotional gravity, |
| - and thematic closure. |
|
|
| Characters began behaving more consistently across long contexts. |
|
|
| Events earlier in stories influenced future generations more reliably. |
|
|
| The model also became significantly better at: |
|
|
| - avoiding repetitive loops, |
| - maintaining tone, |
| - preserving narrative identity, |
| - and generating cleaner transitions between scenes. |
|
|
| ### Emergent Capabilities |
|
|
| - Multi-event causal chaining |
| - Persistent emotional tone |
| - Improved dialogue continuity |
| - Better conflict resolution |
| - Reduced topic drift |
| - More natural pacing |
| - Stronger thematic stability |
|
|
| Most importantly: |
|
|
| > The model began generating stories that feel intentionally written rather than statistically assembled. |
|
|
| --- |
|
|
| #Final Training Statistics |
|
|
| | Metric | Value | |
| |---|---| |
| | Final Step | 3000 | |
| | Final Loss | **0.8516** | |
| | Training Stability | Excellent | |
| | Gradient Behavior | Stable | |
| | Divergence Events | None Observed | |
|
|
| --- |
|
|
| # Training Configuration |
|
|
| ## Hyperparameters |
|
|
| | Parameter | Value | |
| |---|---| |
| | Optimizer | AdamW | |
| | Betas | β₁=0.9, β₂=0.95 | |
| | Learning Rate | 5e-4 | |
| | Scheduler | OneCycleLR | |
| | Weight Decay | 0.01 | |
| | Precision | bfloat16 | |
| | Compilation | torch.compile | |
| | Attention Optimization | Flash Attention 2 | |
| | Effective Batch Size | ~262,144 Tokens / Step | |
|
|
| --- |
|
|
| # Dataset |
|
|
| ## TinyStories (2M) |
|
|
| Test-1-3000 was trained on the **TinyStories** dataset. |
|
|
| TinyStories is uniquely valuable because it isolates: |
|
|
| - narrative structure, |
| - reasoning, |
| - consistency, |
| - and causality |
|
|
| without the overwhelming informational noise of the open web. |
|
|
| The stories use: |
|
|
| - child-level vocabulary, |
| - but professionally structured narrative composition. |
|
|
| This creates an ideal environment for studying emergent reasoning inside small language models. |
|
|
| --- |
|
|
| # Training Philosophy |
|
|
| The project intentionally prioritizes: |
|
|
| - coherence over memorization, |
| - reasoning over factual retrieval, |
| - and narrative intelligence over benchmark chasing. |
|
|
| The goal is not merely to create a chatbot. |
|
|
| The goal is to study: |
|
|
| > how structured cognition emerges inside compact neural systems. |
|
|
| --- |
|
|
| #Usage — Quick Start |
|
|
| Install dependencies: |
|
|
| ```bash |
| pip install transformers torch accelerate |
| ``` |
|
|
| --- |
|
|
| ## Inference Example |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_path = "GODELEV/Test-1-3000" |
| |
| # Load Tokenizer and Model |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| model_path, |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| |
| # Prompt |
| prompt = "Once upon a time, Tom found a blue car." |
| |
| inputs = tokenizer( |
| prompt, |
| return_tensors="pt" |
| ).to(model.device) |
| |
| # Generate |
| output = model.generate( |
| **inputs, |
| max_new_tokens=200, |
| temperature=0.7, |
| top_p=0.9, |
| repetition_penalty=1.1, |
| do_sample=True, |
| eos_token_id=tokenizer.eos_token_id, |
| pad_token_id=tokenizer.pad_token_id |
| ) |
| |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) |
| ``` |
|
|
| --- |
|
|
| # Recommended Generation Settings |
|
|
| | Parameter | Recommended | |
| |---|---| |
| | Temperature | 0.7 | |
| | Top-p | 0.9 | |
| | Repetition Penalty | 1.1 | |
| | Max Tokens | 128–512 | |
| | Sampling | Enabled | |
|
|
| --- |
|
|
| # Observed Emergent Behaviors |
|
|
| During evaluation, the model demonstrated: |
|
|
| - Character persistence |
| - Goal-oriented progression |
| - Emotional continuity |
| - Environmental consistency |
| - Contextual callbacks |
| - Story resolution awareness |
|
|
| These behaviors are especially notable given the model's relatively small parameter count. |
|
|
| --- |
|
|
| # Limitations |
|
|
| Although highly capable for its size, Test-1-3000 still has limitations: |
|
|
| - Limited factual world knowledge |
| - Occasional repetition in very long generations |
| - Reduced reasoning performance outside storytelling domains |
| - Less stable beyond trained narrative styles |
|
|
| The model is optimized specifically for: |
|
|
| > coherent short-form storytelling. |
|
|
| --- |
| `` |
|
|
| --- |
|
|
| # 📜 Citation |
|
|
| ```bibtex |
| @misc{test13000, |
| title={Test-1-3000: A 190M Parameter Narrative Intelligence Engine}, |
| author={GODELEV}, |
| year={2026}, |
| note={Compact narrative-focused language model trained on TinyStories} |
| } |
| ``` |
|
|
| --- |
|
|
| # License |
|
|
| This project is intended for: |
|
|
| - research, |
| - experimentation, |
| - educational use, |
| - and open exploration of compact language models. |
|
|
| --- |
|
|
| # Final Thoughts |
|
|
| Test-1-3000 demonstrates that meaningful narrative intelligence can emerge inside surprisingly small neural systems when training is focused, clean, and structurally optimized. |
|
|
| At only **190M parameters**, the model exhibits behaviors often associated with significantly larger systems: |
|
|
| - narrative planning, |
| - emotional continuity, |
| - causal consistency, |
| - and coherent resolution generation. |
|
|
| The project serves as both: |
|
|
| - a practical storytelling model, |
| - and an experiment in emergent cognition within compact architectures. |
|
|
| --- |
|
|
| <p align="center"> |
|
|
| ### “Small models are not weak models. |
| ### They are compressed intelligence waiting to emerge.” |
|
|
| </p> |
| ```` |