Fairy2i-W2 / README.md

nielsr HF Staff

Add `library_name: transformers` and enhance model card with detailed usage

b055efc verified 4 months ago

11.1 kB

base_model: meta-llama/Llama-2-7b-hf
language:
  - en
license: llama2
pipeline_tag: text-generation
library_name: transformers
tags:
  - llama-2
  - quantization
  - qat
  - complex-valued
  - 2-bit
  - recursive
  - safetensors

Fairy2i-W2

🔗 Links

Abstract

Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models.

Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity ({±1, ±i}). Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation.

We demonstrate that Fairy2i-W2 restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods.

This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

Method

Fairy2i-W2 consists of three key components:

Widely-Linear Transformation

We transform pre-trained real-valued linear layers into an equivalent widely-linear complex form without altering the model's behavior. Each real linear layer R (a real matrix of size 2n×2m) is reparameterized into two complex matrices U and W (each of size n×m) such that y = Ux + Wx̅, where x̅ denotes the complex conjugate of x. This transformation is lossless and unique, preserving the original forward computation before quantization.

Phase-Aware Complex Quantization

We quantize complex weights using a phase-based scheme with the codebook {±1, ±i} (fourth roots of unity). For each complex weight, we project it to the nearest codeword by angle and apply axis-wise scaling factors. During QAT training, we maintain full-precision master weights and use quantized copies in the forward pass with straight-through estimator (STE) gradients.

Recursive Residual Quantization

To further reduce quantization error, we recursively quantize the residual error. Each complex weight is represented as a sum of low-bit terms: W_q ≈ Σ W^(t) (sum over t from 0 to T-1), where each term is quantized using the same phase-aware mechanism. For Fairy2i-W2 (T=2), we use 2 recursive stages, achieving an effective 2 bits per real parameter.

Evaluation

Main Results on LLaMA-2 7B

Method	Bits	C4 PPL↓	ARC-e	ARC-c	HellaSwag	PIQA	Winogrande	Avg.
LLaMA-2 (FP16)	16	6.63	75.59	43.17	57.06	77.91	69.85	64.72
Fairy2i-W2	2	7.85	72.73	39.76	53.33	76.17	68.03	62.00
AQLM	2	8.54	63.68	32.76	49.55	74.76	65.67	57.28
QuIP#	2	11.01	55.56	28.84	42.94	71.38	62.43	52.23
Real-Ternary (QAT)	1.58	11.06	55.93	24.15	38.43	69.80	55.17	48.70
Fairy2i-W1	1	11.03	56.56	24.82	38.19	70.08	53.67	48.66
Real-Binary (QAT)	1	11.75	53.32	22.70	35.57	66.81	52.64	46.21
GPTQ	3	10.61	58.46	31.06	45.21	71.49	59.19	53.08

Key Results:

Fairy2i-W2 (2-bit) achieves a perplexity of 7.85, closing the gap to FP16 (6.63) while outperforming all 2-bit PTQ methods
Fairy2i-W2 achieves 62.00% average accuracy on zero-shot tasks, highly competitive with FP16 (64.72%)
Fairy2i-W1 (1-bit) outperforms real-valued binary and ternary baselines at the same or lower bit budgets

🚀 Quick Start

Fairy2i-W2 is based on LLaMA-2 7B architecture, with only the linear layers replaced by complex-valued QAT layers. The model structure is otherwise identical to LLaMA-2.

📦 Installation

pip install torch transformers safetensors huggingface_hub accelerate datasets lm-eval

🔄 Loading the Model

The model can be loaded using the model_module package. Here's a basic example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from model_module.qat_modules import replace_modules_for_qat, convert_to_inference_mode
import torch

# Load base model
model_path = "meta-llama/Llama-2-7b-hf"  # or your local path
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Replace linear layers with QAT modules
replace_modules_for_qat(model, "complex_phase_v2", skip_lm_head=False)

# Convert to inference mode for faster inference
convert_to_inference_mode(model)

# The model is ready to use!
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

📊 Data Processing

The training data is processed from RedPajama-Data-1T using two sequential steps:

Step 1: Sample 100B tokens from RedPajama-Data-1T

Use dataset/sample.py to sample 100B tokens from the RedPajama-Data-1T dataset:

cd dataset
python sample.py

This script:

Loads the RedPajama-Data-1T dataset from Hugging Face
Samples approximately 100B tokens using 10 parallel processes
Saves the sampled data to new_dataset_100B_redpajama_final_dataset{0-9} directories

Step 2: Process into 2048-token aligned blocks

Use dataset/padding_and_cut.py to chunk the sampled data into 2048-token aligned blocks:

cd dataset
python padding_and_cut.py

This script:

Loads the sampled datasets from Step 1
Processes data into 2048-token aligned blocks using group_and_chunk function
Saves the processed data to dataset_100B_redpajama_2048_aligned/ directory

Note: Make sure to update the input paths in padding_and_cut.py to point to your sampled dataset directories.

Custom DataCollator

The training uses a custom MyDataCollatorForLanguageModeling class defined in train/mydatacollator.py. This collator is specifically designed to work with the 2048-token aligned data blocks.

To use the custom DataCollator:

You can directly copy train/mydatacollator.py into transformers.data.data_collator module (version-independent). The custom collator handles:

Proper label masking for aligned 2048-token blocks
EOS token position handling for causal language modeling
Compatibility with the pre-processed aligned dataset format

The custom collator is automatically imported in the training script via:

from transformers.data.data_collator import MyDataCollatorForLanguageModeling

🏋️ Training

To train a model with QAT, use the training script:

cd train
bash train.sh

Note: For Fairy2i-W2, the training uses fixed parameters:

--quant_method complex_phase_v2 (1-step recursive residual quantization)
--skip_lm_head False (lm_head will be replaced)

The training script supports the following arguments:

--quant_method: QAT quantization method (choices: bitnet, complex_phase_v1, complex_phase_v2, complex_phase_v3, complex_phase_v4)
--skip_lm_head: Whether to skip replacement of lm_head layer (default: False)

✅ Evaluation

📉 Perplexity Evaluation

Evaluate perplexity on Wikitext-2 and C4 datasets:

cd eval
bash eval_ppl.sh

🎯 Task Evaluation

Evaluate on downstream tasks using lm-eval:

cd eval
bash eval_task.sh

ℹ️ Model Details

Base Model: LLaMA-2 7B
Quantization Method: Complex-Phase V2 (2-step recursive residual quantization)
Effective Bit Width: 2 bits per real parameter
Codebook: {±1, ±i} (fourth roots of unity)
Training: QAT (Quantization-Aware Training) on 30B tokens from RedPajama dataset

📁 Repository Structure

fairy2i-w2-repo-github/
├── README.md
├── model_module/
│   ├── __init__.py
│   ├── qat_modules.py          # QAT linear layer implementations
│   └── quantization.py         # Quantization functions (PhaseQuant, BitNet, etc.)
├── dataset/
│   ├── sample.py               # Sample 100B tokens from RedPajama-Data-1T
│   └── padding_and_cut.py     # Process data into 2048-token aligned blocks
├── train/
│   ├── train.py                # Training script
│   ├── train.sh                # Training launch script
│   ├── mydatacollator.py       # Custom DataCollator for aligned data
│   └── complexnet_config.yaml  # Accelerate configuration
└── eval/
    ├── eval_ppl.py             # Perplexity evaluation script
    ├── eval_ppl.sh             # Perplexity evaluation launcher
    ├── eval_task.py            # Task evaluation script
    ├── eval_task.sh            # Task evaluation launcher
    └── eval_utils.py            # Evaluation utilities

📚 Citation

If you use Fairy2i-W2 in your research, please cite:

@article{wang2025fairy2i,
  title={Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in {$\\pm 1, \\pm i$}},
  author={Wang, Feiyu and Tan, Xinyu and Huang, Bokai and Zhang, Yihao and Wang, Guoan and Cong, Peizhuang and Yang, Tong},
  journal={arXiv preprint},
  year={2025}
}

⚖️ License

This model follows the same license as LLaMA-2. Please refer to the original LLaMA-2 license for details.

📧 Contact

For questions or issues, please contact: tanxinyu330@gmail.com