Fairy2i-W2 / README.md
nielsr's picture
nielsr HF Staff
Add `library_name: transformers` and enhance model card with detailed usage
b055efc verified
|
raw
history blame
11.1 kB
metadata
base_model: meta-llama/Llama-2-7b-hf
language:
  - en
license: llama2
pipeline_tag: text-generation
library_name: transformers
tags:
  - llama-2
  - quantization
  - qat
  - complex-valued
  - 2-bit
  - recursive
  - safetensors

Fairy2i-W2

πŸ”— Links

Paper GitHub ModelScope

Abstract

Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models.

Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity ({Β±1, Β±i}). Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation.

We demonstrate that Fairy2i-W2 restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods.

This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

Method

Fairy2i-W2 consists of three key components:

Widely-Linear Transformation

We transform pre-trained real-valued linear layers into an equivalent widely-linear complex form without altering the model's behavior. Each real linear layer R (a real matrix of size 2nΓ—2m) is reparameterized into two complex matrices U and W (each of size nΓ—m) such that y = Ux + WxΜ…, where xΜ… denotes the complex conjugate of x. This transformation is lossless and unique, preserving the original forward computation before quantization.

Phase-Aware Complex Quantization

We quantize complex weights using a phase-based scheme with the codebook {Β±1, Β±i} (fourth roots of unity). For each complex weight, we project it to the nearest codeword by angle and apply axis-wise scaling factors. During QAT training, we maintain full-precision master weights and use quantized copies in the forward pass with straight-through estimator (STE) gradients.

Recursive Residual Quantization

To further reduce quantization error, we recursively quantize the residual error. Each complex weight is represented as a sum of low-bit terms: W_q β‰ˆ Ξ£ W^(t) (sum over t from 0 to T-1), where each term is quantized using the same phase-aware mechanism. For Fairy2i-W2 (T=2), we use 2 recursive stages, achieving an effective 2 bits per real parameter.

Evaluation

Main Results on LLaMA-2 7B

Method Bits C4 PPL↓ ARC-e ARC-c HellaSwag PIQA Winogrande Avg.
LLaMA-2 (FP16) 16 6.63 75.59 43.17 57.06 77.91 69.85 64.72
Fairy2i-W2 2 7.85 72.73 39.76 53.33 76.17 68.03 62.00
AQLM 2 8.54 63.68 32.76 49.55 74.76 65.67 57.28
QuIP# 2 11.01 55.56 28.84 42.94 71.38 62.43 52.23
Real-Ternary (QAT) 1.58 11.06 55.93 24.15 38.43 69.80 55.17 48.70
Fairy2i-W1 1 11.03 56.56 24.82 38.19 70.08 53.67 48.66
Real-Binary (QAT) 1 11.75 53.32 22.70 35.57 66.81 52.64 46.21
GPTQ 3 10.61 58.46 31.06 45.21 71.49 59.19 53.08

Key Results:

  • Fairy2i-W2 (2-bit) achieves a perplexity of 7.85, closing the gap to FP16 (6.63) while outperforming all 2-bit PTQ methods
  • Fairy2i-W2 achieves 62.00% average accuracy on zero-shot tasks, highly competitive with FP16 (64.72%)
  • Fairy2i-W1 (1-bit) outperforms real-valued binary and ternary baselines at the same or lower bit budgets

πŸš€ Quick Start

Fairy2i-W2 is based on LLaMA-2 7B architecture, with only the linear layers replaced by complex-valued QAT layers. The model structure is otherwise identical to LLaMA-2.

πŸ“¦ Installation

pip install torch transformers safetensors huggingface_hub accelerate datasets lm-eval

πŸ”„ Loading the Model

The model can be loaded using the model_module package. Here's a basic example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from model_module.qat_modules import replace_modules_for_qat, convert_to_inference_mode
import torch

# Load base model
model_path = "meta-llama/Llama-2-7b-hf"  # or your local path
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Replace linear layers with QAT modules
replace_modules_for_qat(model, "complex_phase_v2", skip_lm_head=False)

# Convert to inference mode for faster inference
convert_to_inference_mode(model)

# The model is ready to use!
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

πŸ“Š Data Processing

The training data is processed from RedPajama-Data-1T using two sequential steps:

Step 1: Sample 100B tokens from RedPajama-Data-1T

Use dataset/sample.py to sample 100B tokens from the RedPajama-Data-1T dataset:

cd dataset
python sample.py

This script:

  • Loads the RedPajama-Data-1T dataset from Hugging Face
  • Samples approximately 100B tokens using 10 parallel processes
  • Saves the sampled data to new_dataset_100B_redpajama_final_dataset{0-9} directories

Step 2: Process into 2048-token aligned blocks

Use dataset/padding_and_cut.py to chunk the sampled data into 2048-token aligned blocks:

cd dataset
python padding_and_cut.py

This script:

  • Loads the sampled datasets from Step 1
  • Processes data into 2048-token aligned blocks using group_and_chunk function
  • Saves the processed data to dataset_100B_redpajama_2048_aligned/ directory

Note: Make sure to update the input paths in padding_and_cut.py to point to your sampled dataset directories.

Custom DataCollator

The training uses a custom MyDataCollatorForLanguageModeling class defined in train/mydatacollator.py. This collator is specifically designed to work with the 2048-token aligned data blocks.

To use the custom DataCollator:

You can directly copy train/mydatacollator.py into transformers.data.data_collator module (version-independent). The custom collator handles:

  • Proper label masking for aligned 2048-token blocks
  • EOS token position handling for causal language modeling
  • Compatibility with the pre-processed aligned dataset format

The custom collator is automatically imported in the training script via:

from transformers.data.data_collator import MyDataCollatorForLanguageModeling

πŸ‹οΈ Training

To train a model with QAT, use the training script:

cd train
bash train.sh

Note: For Fairy2i-W2, the training uses fixed parameters:

  • --quant_method complex_phase_v2 (1-step recursive residual quantization)
  • --skip_lm_head False (lm_head will be replaced)

The training script supports the following arguments:

  • --quant_method: QAT quantization method (choices: bitnet, complex_phase_v1, complex_phase_v2, complex_phase_v3, complex_phase_v4)
  • --skip_lm_head: Whether to skip replacement of lm_head layer (default: False)

βœ… Evaluation

πŸ“‰ Perplexity Evaluation

Evaluate perplexity on Wikitext-2 and C4 datasets:

cd eval
bash eval_ppl.sh

🎯 Task Evaluation

Evaluate on downstream tasks using lm-eval:

cd eval
bash eval_task.sh

ℹ️ Model Details

  • Base Model: LLaMA-2 7B
  • Quantization Method: Complex-Phase V2 (2-step recursive residual quantization)
  • Effective Bit Width: 2 bits per real parameter
  • Codebook: {Β±1, Β±i} (fourth roots of unity)
  • Training: QAT (Quantization-Aware Training) on 30B tokens from RedPajama dataset

πŸ“ Repository Structure

fairy2i-w2-repo-github/
β”œβ”€β”€ README.md
β”œβ”€β”€ model_module/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ qat_modules.py          # QAT linear layer implementations
β”‚   └── quantization.py         # Quantization functions (PhaseQuant, BitNet, etc.)
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ sample.py               # Sample 100B tokens from RedPajama-Data-1T
β”‚   └── padding_and_cut.py     # Process data into 2048-token aligned blocks
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ train.py                # Training script
β”‚   β”œβ”€β”€ train.sh                # Training launch script
β”‚   β”œβ”€β”€ mydatacollator.py       # Custom DataCollator for aligned data
β”‚   └── complexnet_config.yaml  # Accelerate configuration
└── eval/
    β”œβ”€β”€ eval_ppl.py             # Perplexity evaluation script
    β”œβ”€β”€ eval_ppl.sh             # Perplexity evaluation launcher
    β”œβ”€β”€ eval_task.py            # Task evaluation script
    β”œβ”€β”€ eval_task.sh            # Task evaluation launcher
    └── eval_utils.py            # Evaluation utilities

πŸ“š Citation

If you use Fairy2i-W2 in your research, please cite:

@article{wang2025fairy2i,
  title={Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in {$\\pm 1, \\pm i$}},
  author={Wang, Feiyu and Tan, Xinyu and Huang, Bokai and Zhang, Yihao and Wang, Guoan and Cong, Peizhuang and Yang, Tong},
  journal={arXiv preprint},
  year={2025}
}

βš–οΈ License

This model follows the same license as LLaMA-2. Please refer to the original LLaMA-2 license for details.

πŸ“§ Contact

For questions or issues, please contact: tanxinyu330@gmail.com