--- license: mit pipeline_tag: text-generation library_name: transformers --- # Random-Llama-Small ## Model Overview **Random-Llama-Small** is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from `HuggingFaceTB/SmolLM2-1.7B-Instruct` and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models. --- ## Key Details - **Architecture:** LLaMA (Causal Language Model) - **Parameters:** ~2B - **Hidden Size:** 2304 - **Layers:** 22 - **Attention Heads:** 36 (with 9 key-value heads for grouped-query attention) - **Intermediate Size:** 9216 - **Vocabulary Size:** 128256 - **Tokenizer:** Imported from `HuggingFaceTB/SmolLM2-1.7B-Instruct` - **Precision:** bfloat16 - **Max Context Length:** 131,072 tokens (with RoPE scaling) - **License:** MIT --- ## LLaMA Architecture The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features: ### Core Components - **Decoder-Only Transformer:** Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation. - **Grouped-Query Attention (GQA):** 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost. - **Rotary Position Embeddings (RoPE):** Embeds positional information with scaling, enabling a context length of up to 131,072 tokens. - **Swiglu Activation:** Uses SiLU (Swish) activation in the FFN for improved expressiveness. - **RMSNorm:** Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence. - **Tied Embeddings:** Input and output embeddings share weights (`tie_word_embeddings=True`), reducing parameter count by ~295M. --- ## Benefits of LLaMA Architecture - **Efficiency:** High throughput, low memory use. - **Scalability:** Works well across model sizes. - **Flexibility:** Long-context support and task adaptability. - **Research-Friendly:** Great for exploring attention, positional encoding, and training dynamics. --- ## Random-Llama-Small Specifics This model uses random weights and: - Has ~2B parameters across 22 layers. - Uses a 2304 hidden size and 9216 FFN size. - Supports 128K+ vocab tokens and bfloat16 precision. - Supports extended context lengths of 131,072 tokens. --- ## Intended Use - Research on transformer dynamics, optimization, or architectural changes. - Baseline for pretraining or task-specific fine-tuning. - Experimentation with scaling laws or custom architectures. --- ## Out-of-Scope Use - **Not for direct production deployment.** - **Not suitable for tasks needing coherence or accuracy without training.** --- ## Usage ### Requirements - `transformers >= 4.45.0` - `torch >= 2.0` - GPU with ≥ 6GB VRAM (24GB+ for training) --- ### Inference Example ```python # Use a pipeline as a high-level helper from transformers import pipeline messages = [ {"role": "user", "content": "Who are you?"}, ] pipe = pipeline("text-generation", model="reflex-ai/random-llama-small") print(pipe(messages)) ``` > Note: Outputs will be random and incoherent due to the model’s untrained state. --- ### Training Example ```python from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small") tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small") training_args = TrainingArguments( output_dir="./random_llama_small_finetuned", per_device_train_batch_size=4, num_train_epochs=3, fp16=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=your_dataset, data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False), ) trainer.train() ``` --- ## Limitations - **Random Initialization:** Needs significant training to be useful. - **Resource Intensive:** High computational cost. - **No Pretraining Data:** Users must provide their own. - **Tokenizer Constraint:** May not suit all domains. --- ## Benefits and Potential - **Customizability:** A blank slate for full control of objectives and data. - **Research Insights:** Ideal for understanding early-stage LLM behavior. - **Scalable Baseline:** Balances size and research feasibility. - **Extended Context:** Useful for long-form tasks post-training. --- ## Model Configuration ```json { "architectures": ["LlamaForCausalLM"], "hidden_size": 2304, "num_hidden_layers": 22, "num_attention_heads": 36, "num_key_value_heads": 9, "intermediate_size": 9216, "vocab_size": 128256, "max_position_embeddings": 131072, "rope_scaling": { "factor": 32.0, "high_freq_factor": 4.0, "low_freq_factor": 1.0, "original_max_position_embeddings": 8192, "rope_type": "llama3" }, "torch_dtype": "bfloat16", "tie_word_embeddings": true } ``` --- ## Ethical Considerations - **Untrained Safety:** No immediate harmful outputs, but ethics matter during training. - **Environmental Impact:** Large-scale training consumes energy; optimize and use green compute. - **Accessibility:** Resource requirements may limit use by smaller research teams. --- ## Contact For questions or issues, please open an issue on the Hugging Face repository. > *Model card created on April 20, 2025.*