oflorez's picture
Upload TensorRT WayraPPL model optimized for A100 GPUs
2991961 verified
---
language:
- en
- es
- pt
license: apache-2.0
library_name: transformers
tags:
- perplexity-estimation
- tensorrt
- data-novely-estimation
- dataset-contamination-detection
- a100-optimized
pipeline_tag: text-classification
---
# oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL
🚀 **A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity.
## ⚠️ Hardware Requirements
**This model works on NVIDIA A100 GPUs with:**
- GPU Architecture: sm_80 (A100-80GB)
- CUDA: 12.8+
- TensorRT: 10.13.x
- Driver: 570.124.06+
## 🚀 Performance
- **Throughput**: ~50,000+ samples/sec (A100)
- **Latency**: <1ms per sample
- **Batch Size**: Up to 2048
- **Memory**: ~2GB GPU memory
## 📦 Installation
```bash
# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt
# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x
```
## 🔧 Usage
### Option 1: PyTorch Model (Standard)
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")
```
### Option 2: TensorRT Engine (High Performance)
```python
from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer
# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
# High-throughput inference
texts = ["Your text here"] * 1000 # Large batch
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())
```
## Files Included
- **PyTorch Model**: Standard HuggingFace format
- `pytorch_model.bin` - Model weights
- `config.json` - Model configuration
- `tokenizer.json` - Tokenizer
- **TensorRT Engine**: A100-optimized
- `wayrappl_fp16_bs2048.engine` - TensorRT engine (A100 only)
- `tensorrt_config.json` - Engine configuration
- `tensorrt_inference.py` - Inference code
- `tensorrt_requirements.txt` - Dependencies
## Use Cases
- **Semantic Filtering**
- **Curriculum Learning**
- **Large-scale dataset cleaning** (millions of documents)
- **Real-time perplexity estimation**
- **High-throughput data quality assessment**
- **Production MLOps pipelines**
## Model Details
- **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B
- **Architecture**: GPT2-based Transformer blocks with perplexity heads
- **Languages**: Spanish, Portuguese, English
- **Max Length**: 512 tokens
- **Precision**: FP16 (TensorRT), FP32 (PyTorch)
## ⚡ Benchmarks (A100)
| Model Type | Throughput | Latency | Memory |
|------------------|------------|---------|--------|
| Llama 3 1B | ~200/sec | 50ms | 8GB |
| Wayra PyTorch | ~1,000/sec | 10ms | 4GB |
| Wayra TensorRT | ~50,000/sec| <1ms | 2GB |
## Troubleshooting
**"TensorRT engine not compatible"**
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
- Check CUDA version: `nvidia-smi` (should be 12.8+)
- Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x)
**"CUDA out of memory"**
- Reduce batch size in inference
- Use gradient checkpointing if training
## Citation
```bibtex
@software{WayraPPL,
title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
author={Omar U. Florez and LatamGPT Team},
year={2025},
url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}
```
## License
Apache 2.0 - See LICENSE file
---
**Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.