--- license: mit language: - en base_model: - Qwen/Qwen3-0.6B tags: - loop-attention - qwen3 - pytorch - causal-lm model_name: Qwen3-0.6B-Looped --- # Open-Source Training/Implementation of Loop Attention for Qwen3-0.6B Hello world! I’m poodle, I wanted to share a open-source methodology of how I implemented loop attention into Qwen3-0.6B. I did not want to just hand you the weights so I also included the training script meant for qwen’s architecture. I hope you enjoy! This model implements **Loop Attention** on top of Qwen3-0.6B, a novel architecture that performs two forward passes through the attention mechanism: This is a custom implementation of **Loop Attention** applied to the Qwen3-0.6B architecture. It features a novel gating mechanism that dynamically mixes global context (from a first pass) with local windowed attention (in a second pass), aiming to improve generation coherence and context usage. **Repository:** `coolpoodle/Qwen3-0.6B-Looped` **Base Model:** [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) ## Model Details - **Architecture:** Qwen3 with Loop Attention Wrapper - **Run:** Denotes the training run / what specifically we tried doing different. - **Parameter Count:** ~0.6B (Base) + ~58k (Gates) - **Trained on:** WikiText-2 ### Run 1 (Notes) For **Run 1**, I started with the following parameters: - **Context Length:** Trained with **512** context. ### Run 2 Experiments (Notes) For **Run 2**, we attempted the following changes: - **Context Length:** Retrained with **1024** context (vs 512 in Run 1). - **Layer Norms:** Unfrozen layer norms during training (in hope that features are more stable). ## Results | Model | Validation Loss | Perplexity (PPL) | | :--- | :---: | :---: | | Baseline Qwen3-0.6B | 3.7274 | 41.57 | | Loop Run1 (Epoch 3) | 3.5549 | 35.01 | | Loop Run2 (Epoch 1) | 3.6434 | 38.22 | | Loop Run2 (Epoch 2) | 3.5936 | 36.37 | | Loop Run2 (Epoch 3) | 3.5642 | 35.31 | ## 🚀 Easy Inference You can load this model directly using `transformers`. **Note:** `trust_remote_code=True` is required because this model uses a custom architecture (`Qwen3LoopForCausalLM`). ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "coolpoodle/Qwen3-0.6B-Looped" print("Loading model...") # trust_remote_code=True is essential for the custom architecture model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) # Prompt prompt = "The future of artificial intelligence is" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Generate # use_cache=False is RECOMMENDED for Loop Attention to fully activate its mixing logic during generation print("Generating...") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=True, temperature=0.7, use_cache=False ) print("-" * 20) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print("-" * 20) ``` ## How it Works The model performs two passes for each forward step (during training or non-cached generation): 1. **Global Pass:** Standard full attention. 2. **Local/Mix Pass:** A gated combination of the cached global context and a local sliding window attention. The gate starts initialized to prioritize global attention (bias +5.0) to prevent initialization shock, gradually learning to utilize local context. ## Files - `Qwen3-0.6B-Looped-Run2-Final.bin`: The main model weights. - `modeling_qwen_loop.py`: The custom model code. - `pytorch_model.bin.index.json`: Maps the custom weight file for seamless loading. ## Todo: 1. Upload benchmarks on HumanEval to see if attention loop provides transferable gains to the entirety of the model. 2. Keep working on the math, to see if I can improve the training 3. Sleep? ## Citation ```bibtex @misc{qwen3-looped, author = {coolpoodle}, title = {Qwen3-0.6B-Looped}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/coolpoodle/Qwen3-0.6B-Looped}} } ```