| # CPU-Optimized Small Language Model (SLM) | |
| ## π Revolutionary CPU-First Conversational AI | |
| This is a **blazing-fast, CPU-optimized Small Language Model** that achieves unprecedented speed and efficiency: | |
| ### β‘ Performance Highlights | |
| - **893 tokens/sec** on CPU (fast production speed) | |
| - **3.7MB model size** (76.6% smaller than original) | |
| - **3.7M parameters** (tiny but powerful) | |
| - **Q&A specialized** (learned conversation patterns) | |
| ### π― Training Speed | |
| - **2.35 minutes** for fine-tuning (unheard of!) | |
| - **28 minutes** for base training (4 epochs) | |
| - **Total time:** ~30 minutes from scratch to production | |
| ### π§ Technical Specs | |
| - **Architecture:** Transformer-lite with RMSNorm, SwiGLU, Rotary embeddings | |
| - **Quantization:** 8-bit post-training quantization | |
| - **Optimization:** CPU-first with memory mapping and efficient batching | |
| - **Framework:** PyTorch (CPU optimized) | |
| ### π± Deployment Ready | |
| - **Mobile-friendly:** 3.7MB fits in any mobile app | |
| - **No GPU required:** Pure CPU inference | |
| - **Fast startup:** Instant model loading | |
| - **Low memory:** Minimal RAM requirements | |
| ## Usage | |
| ### Quick Start | |
| ```python | |
| from huggingface_hub import hf_hub_download | |
| import torch | |
| import sys | |
| sys.path.append('src') # Add your model code path | |
| from model import create_model_from_config | |
| from tokenizer import BPETokenizer | |
| from quantize import QuantizedModel | |
| # Download model files | |
| model_path = hf_hub_download(repo_id="Rahulwale12/SLM", filename="pytorch_model.bin") | |
| config_path = hf_hub_download(repo_id="Rahulwale12/SLM", filename="config.json") | |
| tokenizer_path = hf_hub_download(repo_id="Rahulwale12/SLM", filename="tokenizer.json") | |
| # Load config | |
| import json | |
| with open(config_path, 'r') as f: | |
| config = json.load(f) | |
| # Create model | |
| model_config = { | |
| 'model': { | |
| 'vocab_size': config['vocab_size'], | |
| 'd_model': config['hidden_size'], | |
| 'n_layers': config['num_hidden_layers'], | |
| 'n_heads': config['num_attention_heads'], | |
| 'd_ff': config['intermediate_size'], | |
| 'seq_len': config['max_position_embeddings'], | |
| 'dropout': 0.1, | |
| 'use_rmsnorm': True, | |
| 'use_rotary': True, | |
| 'use_swiglu': True | |
| } | |
| } | |
| model = create_model_from_config({'model': model_config['model']}) | |
| # Load quantized weights | |
| checkpoint = torch.load(model_path, map_location='cpu') | |
| quantized_model = QuantizedModel(model, checkpoint['quantization_bits']) | |
| quantized_model.quantized_weights = checkpoint['quantized_weights'] | |
| quantized_model.scales = checkpoint['scales'] | |
| quantized_model.zeros = checkpoint['zeros'] | |
| quantized_model.dequantize_weights() | |
| # Load tokenizer | |
| tokenizer = BPETokenizer() | |
| tokenizer.load(tokenizer_path) | |
| # Generate text | |
| prompt = "Question: How are you? Answer:" | |
| input_ids = tokenizer.encode(prompt, add_special_tokens=True) | |
| input_ids = torch.tensor([input_ids], dtype=torch.long) | |
| model.eval() | |
| with torch.no_grad(): | |
| for _ in range(20): | |
| logits = model(input_ids)[0, -1, :] | |
| next_token = torch.argmax(logits, dim=-1).unsqueeze(0) | |
| input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1) | |
| response = tokenizer.decode(input_ids[0].tolist(), skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Complete Usage Guide | |
| Run the comprehensive usage guide: | |
| ```bash | |
| python usage_guide.py | |
| ``` | |
| ## Model Details | |
| - **Base Model:** Trained on conversational data | |
| - **Fine-tuning:** Specialized for Q&A conversations | |
| - **Quantization:** 8-bit for optimal speed/size balance | |
| - **License:** MIT | |
| ## Performance Comparison | |
| | Model | Speed (tokens/sec) | Size | Training Time | | |
| |-------|-------------------|------|---------------| | |
| | Base | 942 | 45.2MB | 28 min | | |
| | **Fine-tuned** | **893** | **3.7MB** | **2.35 min** | | |
| This model represents a breakthrough in CPU-optimized language models, making conversational AI accessible on any device without requiring specialized hardware. | |