--- language: - en - es - pt license: apache-2.0 library_name: transformers tags: - perplexity-estimation - tensorrt - data-novely-estimation - dataset-contamination-detection - a100-optimized pipeline_tag: text-classification --- # oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL 🚀 **A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity. ## ⚠️ Hardware Requirements **This model works on NVIDIA A100 GPUs with:** - GPU Architecture: sm_80 (A100-80GB) - CUDA: 12.8+ - TensorRT: 10.13.x - Driver: 570.124.06+ ## 🚀 Performance - **Throughput**: ~50,000+ samples/sec (A100) - **Latency**: <1ms per sample - **Batch Size**: Up to 2048 - **Memory**: ~2GB GPU memory ## 📦 Installation ```bash # Install requirements (A100 + CUDA 12.8+ required) pip install -r tensorrt_requirements.txt # Verify TensorRT installation python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x ``` ## 🔧 Usage ### Option 1: PyTorch Model (Standard) ```python from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") texts = ["Your text here"] inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) outputs = model(**inputs) print(f"PPL: {outputs['ppl']}") ``` ### Option 2: TensorRT Engine (High Performance) ```python from tensorrt_inference import WayraPPLTensorRT from transformers import AutoTokenizer # Load TensorRT model (A100 required) model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine") tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") # High-throughput inference texts = ["Your text here"] * 1000 # Large batch inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy()) ``` ## Files Included - **PyTorch Model**: Standard HuggingFace format - `pytorch_model.bin` - Model weights - `config.json` - Model configuration - `tokenizer.json` - Tokenizer - **TensorRT Engine**: A100-optimized - `wayrappl_fp16_bs2048.engine` - TensorRT engine (A100 only) - `tensorrt_config.json` - Engine configuration - `tensorrt_inference.py` - Inference code - `tensorrt_requirements.txt` - Dependencies ## Use Cases - **Semantic Filtering** - **Curriculum Learning** - **Large-scale dataset cleaning** (millions of documents) - **Real-time perplexity estimation** - **High-throughput data quality assessment** - **Production MLOps pipelines** ## Model Details - **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B - **Architecture**: GPT2-based Transformer blocks with perplexity heads - **Languages**: Spanish, Portuguese, English - **Max Length**: 512 tokens - **Precision**: FP16 (TensorRT), FP32 (PyTorch) ## ⚡ Benchmarks (A100) | Model Type | Throughput | Latency | Memory | |------------------|------------|---------|--------| | Llama 3 1B | ~200/sec | 50ms | 8GB | | Wayra PyTorch | ~1,000/sec | 10ms | 4GB | | Wayra TensorRT | ~50,000/sec| <1ms | 2GB | ## Troubleshooting **"TensorRT engine not compatible"** - Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture) - Check CUDA version: `nvidia-smi` (should be 12.8+) - Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x) **"CUDA out of memory"** - Reduce batch size in inference - Use gradient checkpointing if training ## Citation ```bibtex @software{WayraPPL, title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty}, author={Omar U. Florez and LatamGPT Team}, year={2025}, url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M} } ``` ## License Apache 2.0 - See LICENSE file --- **Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.