π§ TaoNet: Hybrid State-Space Model with Efficient Quantization
TaoNet is a LLM that combines State-Space Models (SSMs) with ternary weight quantization for efficient inference. The model is designed for both high performance and computational efficiency, making it suitable for resource-constrained environments.
Try It Out
Interactive Browser Demo: Test TaoNet directly in your browser without installation!
π TaoNet Interactive Inference Showcase
Generate text instantly in your browser - works on desktop, tablet, and mobile devices.
π Model Details
TaoNet implements a hybrid architecture that strategically combines two complementary mechanisms:
1. State-Space Models (SSM) Blocks
- Efficient parallel computation during training (convolutional mode)
- RNN-style token-by-token inference with state caching
2. Ternary Weight Quantization (BitLinear)
- Weights quantized to {-1, 0, +1} during inference
- FPGA-friendly operations for hardware acceleration
Model Specifications
| Specification | Value |
|---|---|
| Vocabulary Size | 50,257 (GPT-2 tokenizer) |
| Model Dimension | 512 |
| State Dimension | 512 |
| Number of Layers | 8 |
| Max Sequence Length | 256 tokens |
| Dropout | 0.02 |
| Quantization | Ternary weights + INT8 activations |
β Key Features
β¨ Efficiency First
- Ternary Quantization: Weights reduced to 3 values {-1, 0, +1} for 99% parameter reduction
- Stateful Inference: RNN-style generation with cached SSM states eliminates redundant computation
- FPGA-Optimized: BitLinear layers designed for hardware acceleration
- SSM Blocks: Linear complexity (O(N)) for long-sequence processing
π Data Quality Focus
- FineWeb-Edu Dataset: High-quality educational content (1M+ documents)
- Smart Filtering: Removes boilerplate, SEO spam, and low-quality text
- Natural Chunking: Respects paragraph/sentence boundaries for semantic coherence
- Perplexity-based Selection: Optional quality threshold filtering
π Quick Start
RNN-Style Stateful Inference
import torch
import time
from transformers import AutoModelForCausalLM, GPT2Tokenizer
MODEL_NAME = "TaoTern/TaoNet-pico-T1"
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True)
def generate_text(prompt, model, tokenizer, max_length=512, temperature=1, top_k=50, top_p=0.95):
inputs = tokenizer.encode(prompt, return_tensors="pt")
start_time = time.time()
outputs = model.generate(
inputs,
max_length=max_length,
temperature=temperature,
top_k=top_k,
top_p=top_p,
do_sample=True
)
end_time = time.time()
# Calculate tokens per second
num_tokens_generated = outputs.shape[1] - inputs.shape[1]
elapsed_time = end_time - start_time
tokens_per_second = num_tokens_generated / elapsed_time if elapsed_time > 0 else 0
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text, tokens_per_second
def main():
prompt = "A brown fox jumps over the lazy dog"
generated_text, tokens_per_second = generate_text(prompt, model, tokenizer)
print(generated_text)
print(f"\nTokens per second: {tokens_per_second:.2f}")
if __name__ == "__main__":
main()
π Training Details
πΎ Training Data
Primary Dataset: FineWeb-Edu (HuggingFace)
- High-quality educational content
- 1M+ documents with rigorous quality curation
- Natural language diversity across domains
- Minimal spam, boilerplate, or low-quality text
Data Processing Pipeline:
- Filtering: Removes HTML tags, URLs, tracking codes, repetitive spam
- Quality Checks: Alphabetic ratio (>70%), symbol ratio (<10%), unique word ratio (>30%)
- Chunking: Respects paragraph/sentence boundaries while producing fixed-length sequences
- Tokenization: GPT-2 tokenizer with vocabulary size 50,257
- Parallel Processing: Multi-threaded data loading (16 workers)
βοΈ Training Hyperparameters
| Parameter | Value |
|---|---|
| Batch Size | 32 |
| Gradient Accumulation Steps | 4 |
| Learning Rate (Standard Params) | 2.5e-4 |
| Learning Rate (BitLinear) | 1.8e-3 |
| Weight Decay | 0.075 |
| Warmup Steps | 300 |
| Max Epochs | 4 |
| Sequence Length | 256 tokens |
| Optimization | AdamW |
π― Training Strategy
- Separate Learning Rates: BitLinear layers (ternary weights) use 5-7Γ higher LR than standard parameters for ternary quantization stability
- Cosine Annealing: Learning rate schedule with linear warmup β cosine decay (50% steady phase)
- Gradient Clipping: Max norm 1.0 to prevent explosion
- Gradient Noise: Optional additive noise (scale: 1e-5) for stability
π Model Performance
β‘ Inference Characteristics
- RNN-like Latency: O(1) per token when using state caching
- Memory Footprint: Significantly reduced due to:
- Ternary weights (99% reduction)
- Stateful inference for single-token processing
- Throughput: Optimized for FPGA deployment with integer arithmetic
π Browser-Based Performance Benchmarks
| Device | Tokens/Second |
|---|---|
| Phone (Mobile Browser) | ~10 tokens/sec |
| Computer (Desktop Browser) | ~45 tokens/sec |
Benchmarks measured on inference entirely in the browser via WebGPU. Actual performance varies based on device capabilities and browser optimization.
πͺ Quantization Impact
- Space: 7-10Γ model size reduction with ternary quantization
- Speed: Hardware acceleration potential with {-1, 0, +1} operations
- Accuracy: Minimal degradation with separate BitLinear learning rates
π― Use Cases
β Recommended For
- Resource-constrained inference (edge devices, FPGAs)
- Real-time token generation (chatbots, autocomplete)
- Model compression research and hardware acceleration studies
β οΈ Limitations
- Shorter context window (256 tokens) compared to modern transformers
- Ternary quantization may impact nuanced reasoning tasks
- Limited to English language training data
β οΈ Bias, Risks, and Limitations
π Dataset Biases
- Source Bias: FineWeb-Edu skews toward educational/technical content; may underrepresent creative writing, poetry
- Language Coverage: English-only; limited multilingual capability
- Domain Gaps: Underrepresented domains due to quality filtering (e.g., informal speech, colloquial language)
π§ Technical Limitations
- Ternary Quantization: Reduces expressiveness; may struggle with nuanced language patterns
- Short Context: 256-token training context constrains long-form reasoning
- SSM Peculiarities: State-space models have different inductive biases than transformers; may struggle with discrete counting tasks
π‘ Recommendations for Users
- Validate outputs on your specific use case before deployment
- Consider fine-tuning on domain-specific data for specialized applications
- Monitor generation quality and implement rejection sampling for critical applications
- Use in complementary ensemble with traditional transformers for robustness
π Citation
If you use TaoNet in your research, please cite:
@software{taonet2026,
title={TaoNet: State-Space Model with Ternary Quantization for Efficient Language Modeling},
author={[TaoTern]},
year={2026},
url={https://huggingface.co/TaoTern/TaoNet-pico-T1}
}
π Related Work
- Mamba: State-space models as alternative to transformers (Gu & Dao, 2024)
- BitNet: Extreme quantization for efficient LLMs (Wang et al., 2024)
βοΈ License
This project is licensed under the MIT License.
π Acknowledgments
- FineWeb dataset for high-quality training data
- HuggingFace Transformers library for model architectures and utilities
- The open-source ML community for foundational work on SSMs and quantization techniques
- Downloads last month
- 9