motionlabs
/

random-llama-small

 license: mit
 pipeline_tag: text-generation
 library_name: transformers
+---
+# Random-Llama-Small
+## Model Overview
+**Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.
+---
+## Key Details
+- **Architecture:** LLaMA (Causal Language Model)
+- **Parameters:** ~2B
+- **Hidden Size:** 2304
+- **Layers:** 22
+- **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention)
+- **Intermediate Size:** 9216
+- **Vocabulary Size:** 128256
+- **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct`
+- **Precision:** bfloat16
+- **Max Context Length:** 131,072 tokens (with RoPE scaling)
+- **License:** MIT
+---
+## LLaMA Architecture
+The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:
+### Core Components
+- **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
+- **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
+- **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
+- **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness.
+- **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
+- **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M.
+---
+## Benefits of LLaMA Architecture
+- **Efficiency:** High throughput, low memory use.
+- **Scalability:** Works well across model sizes.
+- **Flexibility:** Long-context support and task adaptability.
+- **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics.
+---
+## Random-Llama-Small Specifics
+This model uses random weights and:
+- Has ~1.52B parameters across 22 layers.
+- Uses a 2304 hidden size and 9216 FFN size.
+- Supports 128K+ vocab tokens and bfloat16 precision.
+- Supports extended context lengths of 131,072 tokens.
+---
+## Intended Use
+- Research on transformer dynamics, optimization, or architectural changes.
+- Baseline for pretraining or task-specific fine-tuning.
+- Experimentation with scaling laws or custom architectures.
+---
+## Out-of-Scope Use
+- **Not for direct production deployment.**
+- **Not suitable for tasks needing coherence or accuracy without training.**
+---
+## Usage
+### Requirements
+- `transformers >= 4.45.0`
+- `torch >= 2.0`
+- GPU with ≥ 6GB VRAM (24GB+ for training)
+---
+### Inference Example
+```python
+# Use a pipeline as a high-level helper
+from transformers import pipeline
+messages = [
+    {"role": "user", "content": "Who are you?"},
+]
+pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
+print(pipe(messages))
+```
+> Note: Outputs will be random and incoherent due to the model’s untrained state.
+---
+### Training Example
+```python
+from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer
+model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
+tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")
+training_args = TrainingArguments(
+    output_dir="./random_llama_small_finetuned",
+    per_device_train_batch_size=4,
+    num_train_epochs=3,
+    fp16=True,
+)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=your_dataset,
+    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
+)
+trainer.train()
+```
+---
+## Limitations
+- **Random Initialization:** Needs significant training to be useful.
+- **Resource Intensive:** High computational cost.
+- **No Pretraining Data:** Users must provide their own.
+- **Tokenizer Constraint:** May not suit all domains.
+---
+## Benefits and Potential
+- **Customizability:** A blank slate for full control of objectives and data.
+- **Research Insights:** Ideal for understanding early-stage LLM behavior.
+- **Scalable Baseline:** Balances size and research feasibility.
+- **Extended Context:** Useful for long-form tasks post-training.
+---
+## Model Configuration
+```json
+{
+  "architectures": ["LlamaForCausalLM"],
+  "hidden_size": 2304,
+  "num_hidden_layers": 22,
+  "num_attention_heads": 36,
+  "num_key_value_heads": 9,
+  "intermediate_size": 9216,
+  "vocab_size": 128256,
+  "max_position_embeddings": 131072,
+  "rope_scaling": {
+    "factor": 32.0,
+    "high_freq_factor": 4.0,
+    "low_freq_factor": 1.0,
+    "original_max_position_embeddings": 8192,
+    "rope_type": "llama3"
+  },
+  "torch_dtype": "bfloat16",
+  "tie_word_embeddings": true
+}
+```
+---
+## Ethical Considerations
+- **Untrained Safety:** No immediate harmful outputs, but ethics matter during training.
+- **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute.
+- **Accessibility:** Resource requirements may limit use by smaller research teams.
+---
+## Contact
+For questions or issues, please open an issue on the Hugging Face repository.
+> *Model card created on April 20, 2025.*