| --- |
| library_name: peft |
| base_model: Qwen/Qwen3.5-2B |
| license: apache-2.0 |
| tags: |
| - qlora |
| - 4bit |
| - low-resource |
| - arc-challenge |
| - gsm8k |
| - science |
| - math |
| - reasoning |
| datasets: |
| - custom |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # AVA v2 |
|
|
| AVA v2 is a QLoRA fine-tune of [Qwen/Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B) that achieves **79% on ARC-Challenge** and **48% on GSM8K** while training and running inference in under 2 GB of VRAM. |
|
|
| Trained entirely on a single NVIDIA RTX A2000 Laptop GPU (4 GB VRAM). The adapter is 42 MB. |
|
|
| ## Results |
|
|
| | Benchmark | Qwen3.5-2B Base | AVA v2 | Improvement | |
| |---|---|---|---| |
| | ARC-Challenge (100) | 66.0% | **79.0%** | +13.0pp | |
| | GSM8K (50) | 28.0% | **48.0%** | +20.0pp | |
|
|
| ### Comparison to Other Small Models |
|
|
| | Model | Params | ARC-C | GSM8K | |
| |---|---|---|---| |
| | Gemma 2 2B | 2.0B | 55.7% | 24.3% | |
| | SmolLM2-1.7B-Instruct | 1.7B | ~52% | 48.2% | |
| | Llama 3.2 1B-Instruct | 1.0B | 59.4% | 44.4% | |
| | Llama 3.2 3B-Instruct | 3.0B | 78.6% | 77.7% | |
| | **AVA v2** | **2.0B** | **79.0%** | **48.0%** | |
|
|
| AVA v2's ARC-Challenge score at 2B parameters exceeds Llama 3.2 3B-Instruct (78.6% at 3B). |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
| from peft import PeftModel |
| import torch |
| |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_quant_type="nf4", |
| bnb_4bit_compute_dtype=torch.bfloat16, |
| bnb_4bit_use_double_quant=True, |
| ) |
| model = AutoModelForCausalLM.from_pretrained( |
| "Qwen/Qwen3.5-2B", |
| quantization_config=bnb_config, |
| device_map="auto", |
| dtype=torch.bfloat16, |
| attn_implementation="sdpa", |
| ) |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-2B") |
| |
| model = PeftModel.from_pretrained(model, "NAME0x0/AVA-v2") |
| model = model.merge_and_unload() |
| |
| messages = [{"role": "user", "content": "Explain why ice floats on water."}] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True) |
| print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| ## Training Details |
|
|
| - **Method**: QLoRA (4-bit NF4 + LoRA rank 16) |
| - **Base model**: Qwen3.5-2B |
| - **Training data**: 20,741 prompt-response pairs (math, science, reasoning, instruction following) |
| - **Hardware**: NVIDIA RTX A2000 Laptop (4 GB VRAM) |
| - **Training time**: 100.5 minutes |
| - **Final loss**: 0.4145 |
| - **Peak VRAM**: 1.81 GB |
| - **Trainable params**: 10,911,744 / 1,892,736,832 (0.58%) |
| - **Optimizer**: paged_adamw_8bit |
| - **LR schedule**: cosine, peak 1.5e-4 |
| - **Batch size**: 1 (gradient accumulation 8, effective batch 8) |
| - **Max sequence length**: 384 tokens |
| - **Epochs**: 1 |
|
|
| ## Limitations |
|
|
| - Evaluation was run on 100 ARC-Challenge and 50 GSM8K items (not full test sets) |
| - Evaluation protocols (shot count, prompting) differ across model comparison sources |
| - The model inherits Qwen3.5-2B's base capabilities and limitations |
| - Max training sequence length was 384 tokens due to VRAM constraints |
|
|
| ## Citation |
|
|
| ``` |
| @misc{ava-v2-2026, |
| title={AVA v2: QLoRA Fine-tuning Under Extreme VRAM Constraints}, |
| author={Afsah}, |
| year={2026}, |
| url={https://github.com/NAME0x0/AVA} |
| } |
| ``` |
|
|