🧠 TaoNet: Hybrid State-Space Model with Efficient Quantization

TaoNet is a LLM that combines State-Space Models (SSMs) with ternary weight quantization for efficient inference. The model is designed for both high performance and computational efficiency, making it suitable for resource-constrained environments.

Try It Out

Interactive Browser Demo: Test TaoNet directly in your browser without installation!

πŸ‘‰ TaoNet Interactive Inference Showcase

Generate text instantly in your browser - works on desktop, tablet, and mobile devices.

πŸ“‹ Model Details

TaoNet implements a hybrid architecture that strategically combines two complementary mechanisms:

1. State-Space Models (SSM) Blocks

  • Efficient parallel computation during training (convolutional mode)
  • RNN-style token-by-token inference with state caching

2. Ternary Weight Quantization (BitLinear)

  • Weights quantized to {-1, 0, +1} during inference
  • FPGA-friendly operations for hardware acceleration

Model Specifications

Specification Value
Vocabulary Size 50,257 (GPT-2 tokenizer)
Model Dimension 512
State Dimension 512
Number of Layers 8
Max Sequence Length 256 tokens
Dropout 0.02
Quantization Ternary weights + INT8 activations

⭐ Key Features

✨ Efficiency First

  • Ternary Quantization: Weights reduced to 3 values {-1, 0, +1} for 99% parameter reduction
  • Stateful Inference: RNN-style generation with cached SSM states eliminates redundant computation
  • FPGA-Optimized: BitLinear layers designed for hardware acceleration
  • SSM Blocks: Linear complexity (O(N)) for long-sequence processing

πŸ“Š Data Quality Focus

  • FineWeb-Edu Dataset: High-quality educational content (1M+ documents)
  • Smart Filtering: Removes boilerplate, SEO spam, and low-quality text
  • Natural Chunking: Respects paragraph/sentence boundaries for semantic coherence
  • Perplexity-based Selection: Optional quality threshold filtering

πŸš€ Quick Start

RNN-Style Stateful Inference

import torch
import time
from transformers import AutoModelForCausalLM, GPT2Tokenizer

MODEL_NAME = "TaoTern/TaoNet-pico-T1"
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
    
def generate_text(prompt, model, tokenizer, max_length=512, temperature=1, top_k=50, top_p=0.95):
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    start_time = time.time()
    outputs = model.generate(
        inputs,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True
    )
    end_time = time.time()
    
    # Calculate tokens per second
    num_tokens_generated = outputs.shape[1] - inputs.shape[1]
    elapsed_time = end_time - start_time
    tokens_per_second = num_tokens_generated / elapsed_time if elapsed_time > 0 else 0

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text, tokens_per_second

def main():
    prompt = "A brown fox jumps over the lazy dog"

    generated_text, tokens_per_second = generate_text(prompt, model, tokenizer)

    print(generated_text)
    print(f"\nTokens per second: {tokens_per_second:.2f}")

if __name__ == "__main__":
    main()

πŸ“š Training Details

πŸ’Ύ Training Data

Primary Dataset: FineWeb-Edu (HuggingFace)

  • High-quality educational content
  • 1M+ documents with rigorous quality curation
  • Natural language diversity across domains
  • Minimal spam, boilerplate, or low-quality text

Data Processing Pipeline:

  1. Filtering: Removes HTML tags, URLs, tracking codes, repetitive spam
  2. Quality Checks: Alphabetic ratio (>70%), symbol ratio (<10%), unique word ratio (>30%)
  3. Chunking: Respects paragraph/sentence boundaries while producing fixed-length sequences
  4. Tokenization: GPT-2 tokenizer with vocabulary size 50,257
  5. Parallel Processing: Multi-threaded data loading (16 workers)

βš™οΈ Training Hyperparameters

Parameter Value
Batch Size 32
Gradient Accumulation Steps 4
Learning Rate (Standard Params) 2.5e-4
Learning Rate (BitLinear) 1.8e-3
Weight Decay 0.075
Warmup Steps 300
Max Epochs 4
Sequence Length 256 tokens
Optimization AdamW

🎯 Training Strategy

  • Separate Learning Rates: BitLinear layers (ternary weights) use 5-7Γ— higher LR than standard parameters for ternary quantization stability
  • Cosine Annealing: Learning rate schedule with linear warmup β†’ cosine decay (50% steady phase)
  • Gradient Clipping: Max norm 1.0 to prevent explosion
  • Gradient Noise: Optional additive noise (scale: 1e-5) for stability

πŸ“Š Model Performance

⚑ Inference Characteristics

  • RNN-like Latency: O(1) per token when using state caching
  • Memory Footprint: Significantly reduced due to:
    • Ternary weights (99% reduction)
    • Stateful inference for single-token processing
  • Throughput: Optimized for FPGA deployment with integer arithmetic

🌐 Browser-Based Performance Benchmarks

Device Tokens/Second
Phone (Mobile Browser) ~10 tokens/sec
Computer (Desktop Browser) ~45 tokens/sec

Benchmarks measured on inference entirely in the browser via WebGPU. Actual performance varies based on device capabilities and browser optimization.

πŸ’ͺ Quantization Impact

  • Space: 7-10Γ— model size reduction with ternary quantization
  • Speed: Hardware acceleration potential with {-1, 0, +1} operations
  • Accuracy: Minimal degradation with separate BitLinear learning rates

🎯 Use Cases

βœ… Recommended For

  • Resource-constrained inference (edge devices, FPGAs)
  • Real-time token generation (chatbots, autocomplete)
  • Model compression research and hardware acceleration studies

⚠️ Limitations

  • Shorter context window (256 tokens) compared to modern transformers
  • Ternary quantization may impact nuanced reasoning tasks
  • Limited to English language training data

⚠️ Bias, Risks, and Limitations

🎭 Dataset Biases

  • Source Bias: FineWeb-Edu skews toward educational/technical content; may underrepresent creative writing, poetry
  • Language Coverage: English-only; limited multilingual capability
  • Domain Gaps: Underrepresented domains due to quality filtering (e.g., informal speech, colloquial language)

πŸ”§ Technical Limitations

  • Ternary Quantization: Reduces expressiveness; may struggle with nuanced language patterns
  • Short Context: 256-token training context constrains long-form reasoning
  • SSM Peculiarities: State-space models have different inductive biases than transformers; may struggle with discrete counting tasks

πŸ’‘ Recommendations for Users

  1. Validate outputs on your specific use case before deployment
  2. Consider fine-tuning on domain-specific data for specialized applications
  3. Monitor generation quality and implement rejection sampling for critical applications
  4. Use in complementary ensemble with traditional transformers for robustness

πŸ“– Citation

If you use TaoNet in your research, please cite:

@software{taonet2026,
  title={TaoNet: State-Space Model with Ternary Quantization for Efficient Language Modeling},
  author={[TaoTern]},
  year={2026},
  url={https://huggingface.co/TaoTern/TaoNet-pico-T1}
}

πŸ”— Related Work

  • Mamba: State-space models as alternative to transformers (Gu & Dao, 2024)
  • BitNet: Extreme quantization for efficient LLMs (Wang et al., 2024)

βš–οΈ License

This project is licensed under the MIT License.

πŸ™ Acknowledgments

  • FineWeb dataset for high-quality training data
  • HuggingFace Transformers library for model architectures and utilities
  • The open-source ML community for foundational work on SSMs and quantization techniques
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train TaoTern/TaoNet-pico-T1