File size: 4,065 Bytes
---
language:
- en
- es  
- pt
license: apache-2.0
library_name: transformers
tags:
- perplexity-estimation
- tensorrt
- data-novely-estimation
- dataset-contamination-detection
- a100-optimized
pipeline_tag: text-classification
---

# oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL

🚀 **A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity.

## ⚠️ Hardware Requirements

**This model works on NVIDIA A100 GPUs with:**
- GPU Architecture: sm_80 (A100-80GB)
- CUDA: 12.8+
- TensorRT: 10.13.x
- Driver: 570.124.06+

## 🚀 Performance

- **Throughput**: ~50,000+ samples/sec (A100)
- **Latency**: <1ms per sample
- **Batch Size**: Up to 2048
- **Memory**: ~2GB GPU memory

## 📦 Installation

```bash
# Install requirements (A100 + CUDA 12.8+ required)
pip install -r tensorrt_requirements.txt

# Verify TensorRT installation
python -c "import tensorrt; print(tensorrt.__version__)"  # Should be 10.13.x
```

## 🔧 Usage

### Option 1: PyTorch Model (Standard)
```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

texts = ["Your text here"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model(**inputs)
print(f"PPL: {outputs['ppl']}")
```

### Option 2: TensorRT Engine (High Performance)
```python
from tensorrt_inference import WayraPPLTensorRT
from transformers import AutoTokenizer

# Load TensorRT model (A100 required)
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

# High-throughput inference
texts = ["Your text here"] * 1000  # Large batch
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())
```

## Files Included

- **PyTorch Model**: Standard HuggingFace format
  - `pytorch_model.bin` - Model weights
  - `config.json` - Model configuration  
  - `tokenizer.json` - Tokenizer

- **TensorRT Engine**: A100-optimized
  - `wayrappl_fp16_bs2048.engine` - TensorRT engine (A100 only)
  - `tensorrt_config.json` - Engine configuration
  - `tensorrt_inference.py` - Inference code
  - `tensorrt_requirements.txt` - Dependencies

## Use Cases

- **Semantic Filtering**
- **Curriculum Learning**
- **Large-scale dataset cleaning** (millions of documents)
- **Real-time perplexity estimation**
- **High-throughput data quality assessment**
- **Production MLOps pipelines**

## Model Details

- **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B
- **Architecture**: GPT2-based Transformer blocks with perplexity heads
- **Languages**: Spanish, Portuguese, English
- **Max Length**: 512 tokens
- **Precision**: FP16 (TensorRT), FP32 (PyTorch)

## ⚡ Benchmarks (A100)

| Model Type       | Throughput | Latency | Memory |
|------------------|------------|---------|--------|
| Llama 3 1B       |   ~200/sec | 50ms    | 8GB    |
| Wayra PyTorch    | ~1,000/sec | 10ms    | 4GB    |
| Wayra TensorRT   | ~50,000/sec| <1ms    | 2GB    |

## Troubleshooting

**"TensorRT engine not compatible"**
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
- Check CUDA version: `nvidia-smi` (should be 12.8+)
- Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x)

**"CUDA out of memory"**
- Reduce batch size in inference
- Use gradient checkpointing if training

## Citation

```bibtex
@software{WayraPPL,
  title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
  author={Omar U. Florez and LatamGPT Team},
  year={2025},
  url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
}
```

## License

Apache 2.0 - See LICENSE file

---

**Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.