Update README.md

3e49ad3 verified about 2 months ago

11.5 kB

	---
	license: mit
	datasets:
	- roneneldan/TinyStories
	language:
	- en
	pipeline_tag: text-generation
	tags:
	- text-generation-inference
	new_version: GODELEV/Test-1-4000
	---
	# Test-1-3000 — A 190M Parameter Narrative Intelligence Engine

	<p align="center">

	![Architecture](https://img.shields.io/badge/Architecture-Llama-blue)
	![Parameters](https://img.shields.io/badge/Parameters-190M-green)
	![Context](https://img.shields.io/badge/Context-2048-orange)
	![Framework](https://img.shields.io/badge/Framework-PyTorch-red)
	![Training](https://img.shields.io/badge/Training-Step_3000-purple)

	</p>

	---

	# Overview

	Test-1-3000 is a compact yet remarkably capable decoder-only Transformer language model built upon the modern Llama architecture.

	The project explores an important question in language model research:

	> How much narrative reasoning, coherence, and world understanding can emerge inside a small model when trained correctly?

	Despite containing only 190.55 million parameters, Test-1-3000 demonstrates surprisingly advanced:

	- Narrative continuity
	- Character persistence
	- Long-range memory consistency
	- Emotional progression
	- Logical event sequencing
	- Contextual storytelling stability

	The model was trained specifically for short-form narrative intelligence, focusing on coherent storytelling rather than broad internet-scale memorization.

	Unlike many small models that generate fragmented or repetitive text, Test-1-3000 learns to maintain:

	- causal relationships,
	- stable story worlds,
	- emotional trajectories,
	- and meaningful resolutions across long contexts.

	---

	# Key Highlights

	\| Feature \| Description \|
	\|---\|---\|
	\| Architecture \| Llama-based Decoder-only Transformer \|
	\| Parameters \| 190.55 Million \|
	\| Context Length \| 2048 Tokens \|
	\| Final Training Step \| 3000 \|
	\| Final Training Loss \| 0.8516 \|
	\| Attention Optimization \| Flash Attention 2 \|
	\| Compilation \| `torch.compile` \|
	\| Precision \| bfloat16 Mixed Precision \|
	\| Positional Encoding \| Rotary Positional Embeddings (RoPE) \|

	---

	#What Makes Test-1-3000 Special?

	Most compact language models struggle with:

	- maintaining consistency,
	- remembering earlier events,
	- resolving story arcs,
	- and avoiding repetition.

	Test-1-3000 was trained with a different objective philosophy:

	## Narrative Intelligence First

	Instead of optimizing for broad factual memorization, the model focuses on:

	- temporal continuity,
	- event causality,
	- emotional logic,
	- and narrative closure.

	This creates a surprisingly stable storytelling engine capable of generating coherent multi-paragraph narratives with strong thematic flow.

	---

	# Model Architecture

	Test-1-3000 follows a modern efficient Transformer design optimized for both:

	- training stability,
	- and inference throughput.

	The architecture borrows heavily from the proven Llama design philosophy while remaining lightweight enough for experimentation and rapid iteration.

	---

	# Technical Specifications

	\| Feature \| Specification \|
	\|---\|---\|
	\| Model Type \| Decoder-only Transformer \|
	\| Hidden Dimension \| 768 \|
	\| Layers (Depth) \| 12 \|
	\| Attention Heads \| 12 \|
	\| Intermediate Size \| 3072 \|
	\| Activation Function \| SwiGLU \|
	\| Normalization \| RMSNorm \|
	\| Vocabulary Size \| 50,257 \|
	\| Tokenizer \| GPT-2 Tokenizer \|
	\| Context Window \| 2048 Tokens \|
	\| Precision \| bfloat16 \|
	\| Attention Backend \| Flash Attention 2 \|

	---

	# Positional Understanding with RoPE

	Test-1-3000 uses Rotary Positional Embeddings (RoPE) to maintain precise token relationship awareness throughout long contexts.

	This allows the model to:

	- track entities across paragraphs,
	- preserve story continuity,
	- maintain dialogue references,
	- and understand long-range dependencies efficiently.

	For a model of this scale, the 2048-token context window provides unusually strong narrative memory.

	---

	#The Evolution of Learning

	Training Test-1-3000 revealed clear emergent phases of cognitive development.

	The model did not merely memorize text patterns — it progressively developed increasingly sophisticated representations of narrative structure and world dynamics.

	---

	#The Lexical Phase
	## (Steps 0 → 250)

	At the beginning of training, the model learned the statistical foundations of language.

	It discovered:

	- common sentence structures,
	- punctuation behavior,
	- frequent vocabulary patterns,
	- and story-opening syntax.

	During this phase, phrases such as:

	> "Once upon a time"

	became strong narrative anchors.

	The model began constructing basic grammatical fluency but still lacked deeper logical understanding.

	### Characteristics

	- High repetition
	- Weak memory
	- Poor event continuity
	- Basic syntax acquisition

	---

	# The Relational Phase
	## (Steps 250 → 1000)

	The model started connecting concepts together into meaningful relationships.

	It learned:

	- object interactions,
	- spatial reasoning,
	- basic causality,
	- and action consistency.

	For example:

	- parks imply trees and playing,
	- rain implies umbrellas or wetness,
	- sadness often precedes comfort or resolution.

	The training loss rapidly decreased below 1.5, signaling major improvements in structural reasoning.

	### Emergent Behaviors

	- Scene consistency
	- Character-action alignment
	- Basic emotional logic
	- Improved descriptive continuity

	---

	# The Coherence Phase
	## (Steps 1000 → 2000)

	This phase marked the emergence of true narrative stabilization.

	The model learned:

	- story pacing,
	- setup/payoff relationships,
	- conflict resolution,
	- and multi-sentence thematic continuity.

	Stories no longer collapsed into unrelated fragments.

	Instead, the model began maintaining:

	- stable goals,
	- emotional arcs,
	- and logical conclusions.

	If a story introduced a problem:

	> "Lily was lonely."

	the model increasingly learned to produce meaningful emotional resolutions later in the text.

	### Major Improvements

	- Long-range memory
	- Reduced contradiction
	- Better endings
	- Stronger narrative flow
	- Lower hallucination frequency

	Final loss at this stage:

	\| Step \| Loss \|
	\|---\|---\|
	\| 2000 \| 1.27 \|

	---

	# The Emergent Narrative Intelligence Phase
	## (Steps 2000 → 3000)

	This final stage represented a major leap in generative sophistication.

	Rather than simply maintaining coherence, the model began exhibiting signs of:

	- implicit world modeling,
	- narrative anticipation,
	- emotional persistence,
	- and latent planning behavior.

	The model increasingly understood that stories possess:

	- momentum,
	- consequences,
	- emotional gravity,
	- and thematic closure.

	Characters began behaving more consistently across long contexts.

	Events earlier in stories influenced future generations more reliably.

	The model also became significantly better at:

	- avoiding repetitive loops,
	- maintaining tone,
	- preserving narrative identity,
	- and generating cleaner transitions between scenes.

	### Emergent Capabilities

	- Multi-event causal chaining
	- Persistent emotional tone
	- Improved dialogue continuity
	- Better conflict resolution
	- Reduced topic drift
	- More natural pacing
	- Stronger thematic stability

	Most importantly:

	> The model began generating stories that feel intentionally written rather than statistically assembled.

	---

	#Final Training Statistics

	\| Metric \| Value \|
	\|---\|---\|
	\| Final Step \| 3000 \|
	\| Final Loss \| 0.8516 \|
	\| Training Stability \| Excellent \|
	\| Gradient Behavior \| Stable \|
	\| Divergence Events \| None Observed \|

	---

	# Training Configuration

	## Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Optimizer \| AdamW \|
	\| Betas \| β₁=0.9, β₂=0.95 \|
	\| Learning Rate \| 5e-4 \|
	\| Scheduler \| OneCycleLR \|
	\| Weight Decay \| 0.01 \|
	\| Precision \| bfloat16 \|
	\| Compilation \| torch.compile \|
	\| Attention Optimization \| Flash Attention 2 \|
	\| Effective Batch Size \| ~262,144 Tokens / Step \|

	---

	# Dataset

	## TinyStories (2M)

	Test-1-3000 was trained on the TinyStories dataset.

	TinyStories is uniquely valuable because it isolates:

	- narrative structure,
	- reasoning,
	- consistency,
	- and causality

	without the overwhelming informational noise of the open web.

	The stories use:

	- child-level vocabulary,
	- but professionally structured narrative composition.

	This creates an ideal environment for studying emergent reasoning inside small language models.

	---

	# Training Philosophy

	The project intentionally prioritizes:

	- coherence over memorization,
	- reasoning over factual retrieval,
	- and narrative intelligence over benchmark chasing.

	The goal is not merely to create a chatbot.

	The goal is to study:

	> how structured cognition emerges inside compact neural systems.

	---

	#Usage — Quick Start

	Install dependencies:

	```bash
	pip install transformers torch accelerate
	```

	---

	## Inference Example

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_path = "GODELEV/Test-1-3000"

	# Load Tokenizer and Model
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	model = AutoModelForCausalLM.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Prompt
	prompt = "Once upon a time, Tom found a blue car."

	inputs = tokenizer(
	prompt,
	return_tensors="pt"
	).to(model.device)

	# Generate
	output = model.generate(
	**inputs,
	max_new_tokens=200,
	temperature=0.7,
	top_p=0.9,
	repetition_penalty=1.1,
	do_sample=True,
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.pad_token_id
	)

	print(tokenizer.decode(output[0], skip_special_tokens=True))
	```

	---

	# Recommended Generation Settings

	\| Parameter \| Recommended \|
	\|---\|---\|
	\| Temperature \| 0.7 \|
	\| Top-p \| 0.9 \|
	\| Repetition Penalty \| 1.1 \|
	\| Max Tokens \| 128–512 \|
	\| Sampling \| Enabled \|

	---

	# Observed Emergent Behaviors

	During evaluation, the model demonstrated:

	- Character persistence
	- Goal-oriented progression
	- Emotional continuity
	- Environmental consistency
	- Contextual callbacks
	- Story resolution awareness

	These behaviors are especially notable given the model's relatively small parameter count.

	---

	# Limitations

	Although highly capable for its size, Test-1-3000 still has limitations:

	- Limited factual world knowledge
	- Occasional repetition in very long generations
	- Reduced reasoning performance outside storytelling domains
	- Less stable beyond trained narrative styles

	The model is optimized specifically for:

	> coherent short-form storytelling.

	---
	``

	---

	# 📜 Citation

	```bibtex
	@misc{test13000,
	title={Test-1-3000: A 190M Parameter Narrative Intelligence Engine},
	author={GODELEV},
	year={2026},
	note={Compact narrative-focused language model trained on TinyStories}
	}
	```

	---

	# License

	This project is intended for:

	- research,
	- experimentation,
	- educational use,
	- and open exploration of compact language models.

	---

	# Final Thoughts

	Test-1-3000 demonstrates that meaningful narrative intelligence can emerge inside surprisingly small neural systems when training is focused, clean, and structurally optimized.

	At only 190M parameters, the model exhibits behaviors often associated with significantly larger systems:

	- narrative planning,
	- emotional continuity,
	- causal consistency,
	- and coherent resolution generation.

	The project serves as both:

	- a practical storytelling model,
	- and an experiment in emergent cognition within compact architectures.

	---

	<p align="center">

	### “Small models are not weak models.
	### They are compressed intelligence waiting to emerge.”

	</p>
	````