---
license: mit
language:
- en
base_model:
- Qwen/Qwen3-0.6B
tags:
- loop-attention
- qwen3
- pytorch
- causal-lm
model_name: Qwen3-0.6B-Looped
---

# Open-Source Training/Implementation of Loop Attention for Qwen3-0.6B

Hello world! I’m poodle, I wanted to share a open-source methodology of how I implemented loop attention into Qwen3-0.6B. I did not want to just hand you the weights so I also included the training script meant for qwen’s architecture.

I hope you enjoy! 

This model implements **Loop Attention** on top of Qwen3-0.6B, a novel architecture that performs two forward passes through the attention mechanism:

This is a custom implementation of **Loop Attention** applied to the Qwen3-0.6B architecture.
It features a novel gating mechanism that dynamically mixes global context (from a first pass) with local windowed attention (in a second pass), aiming to improve generation coherence and context usage.

**Repository:** `coolpoodle/Qwen3-0.6B-Looped`
**Base Model:** [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)

## Model Details

- **Architecture:** Qwen3 with Loop Attention Wrapper
- **Run:** Denotes the training run / what specifically we tried doing different.
- **Parameter Count:** ~0.6B (Base) + ~58k (Gates)
- **Trained on:** WikiText-2

### Run 1 (Notes)
For **Run 1**, I started with the following parameters:
- **Context Length:** Trained with **512** context.

### Run 2 Experiments (Notes)
For **Run 2**, we attempted the following changes:
- **Context Length:** Retrained with **1024** context (vs 512 in Run 1).
- **Layer Norms:** Unfrozen layer norms during training (in hope that features are more stable).

## Results

| Model | Validation Loss | Perplexity (PPL) |
| :--- | :---: | :---: |
| Baseline Qwen3-0.6B | 3.7274 | 41.57 |
| Loop Run1 (Epoch 3) | 3.5549 | 35.01 |
| Loop Run2 (Epoch 1) | 3.6434 | 38.22 |
| Loop Run2 (Epoch 2) | 3.5936 | 36.37 |
| Loop Run2 (Epoch 3) | 3.5642 | 35.31 |


## 🚀 Easy Inference

You can load this model directly using `transformers`.
**Note:** `trust_remote_code=True` is required because this model uses a custom architecture (`Qwen3LoopForCausalLM`).

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "coolpoodle/Qwen3-0.6B-Looped"

print("Loading model...")
# trust_remote_code=True is essential for the custom architecture
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Prompt
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
# use_cache=False is RECOMMENDED for Loop Attention to fully activate its mixing logic during generation
print("Generating...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=100, 
        do_sample=True, 
        temperature=0.7,
        use_cache=False 
    )

print("-" * 20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("-" * 20)
```

## How it Works

The model performs two passes for each forward step (during training or non-cached generation):
1.  **Global Pass:** Standard full attention.
2.  **Local/Mix Pass:** A gated combination of the cached global context and a local sliding window attention.

The gate starts initialized to prioritize global attention (bias +5.0) to prevent initialization shock, gradually learning to utilize local context.

## Files

- `Qwen3-0.6B-Looped-Run2-Final.bin`: The main model weights.
- `modeling_qwen_loop.py`: The custom model code.
- `pytorch_model.bin.index.json`: Maps the custom weight file for seamless loading.

## Todo:

1. Upload benchmarks on HumanEval to see if attention loop provides transferable gains to the entirety of the model.
2. Keep working on the math, to see if I can improve the training
3. Sleep? 

## Citation

```bibtex
@misc{qwen3-looped,
  author = {coolpoodle},
  title = {Qwen3-0.6B-Looped},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/coolpoodle/Qwen3-0.6B-Looped}}
}
```