|
|
--- |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
- pt |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- perplexity-estimation |
|
|
- tensorrt |
|
|
- data-novely-estimation |
|
|
- dataset-contamination-detection |
|
|
- a100-optimized |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL |
|
|
|
|
|
🚀 **A100-optimized TensorRT version** of WayraPPL for high-throughput prediction of Perplexity. |
|
|
|
|
|
## ⚠️ Hardware Requirements |
|
|
|
|
|
**This model works on NVIDIA A100 GPUs with:** |
|
|
- GPU Architecture: sm_80 (A100-80GB) |
|
|
- CUDA: 12.8+ |
|
|
- TensorRT: 10.13.x |
|
|
- Driver: 570.124.06+ |
|
|
|
|
|
## 🚀 Performance |
|
|
|
|
|
- **Throughput**: ~50,000+ samples/sec (A100) |
|
|
- **Latency**: <1ms per sample |
|
|
- **Batch Size**: Up to 2048 |
|
|
- **Memory**: ~2GB GPU memory |
|
|
|
|
|
## 📦 Installation |
|
|
|
|
|
```bash |
|
|
# Install requirements (A100 + CUDA 12.8+ required) |
|
|
pip install -r tensorrt_requirements.txt |
|
|
|
|
|
# Verify TensorRT installation |
|
|
python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x |
|
|
``` |
|
|
|
|
|
## 🔧 Usage |
|
|
|
|
|
### Option 1: PyTorch Model (Standard) |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") |
|
|
model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") |
|
|
|
|
|
texts = ["Your text here"] |
|
|
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) |
|
|
outputs = model(**inputs) |
|
|
print(f"PPL: {outputs['ppl']}") |
|
|
``` |
|
|
|
|
|
### Option 2: TensorRT Engine (High Performance) |
|
|
```python |
|
|
from tensorrt_inference import WayraPPLTensorRT |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load TensorRT model (A100 required) |
|
|
model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine") |
|
|
tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT") |
|
|
|
|
|
# High-throughput inference |
|
|
texts = ["Your text here"] * 1000 # Large batch |
|
|
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True) |
|
|
outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy()) |
|
|
``` |
|
|
|
|
|
## Files Included |
|
|
|
|
|
- **PyTorch Model**: Standard HuggingFace format |
|
|
- `pytorch_model.bin` - Model weights |
|
|
- `config.json` - Model configuration |
|
|
- `tokenizer.json` - Tokenizer |
|
|
|
|
|
- **TensorRT Engine**: A100-optimized |
|
|
- `wayrappl_fp16_bs2048.engine` - TensorRT engine (A100 only) |
|
|
- `tensorrt_config.json` - Engine configuration |
|
|
- `tensorrt_inference.py` - Inference code |
|
|
- `tensorrt_requirements.txt` - Dependencies |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Semantic Filtering** |
|
|
- **Curriculum Learning** |
|
|
- **Large-scale dataset cleaning** (millions of documents) |
|
|
- **Real-time perplexity estimation** |
|
|
- **High-throughput data quality assessment** |
|
|
- **Production MLOps pipelines** |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base**: Knowledge distillation from meta-llama/Llama-3.2-1B |
|
|
- **Architecture**: GPT2-based Transformer blocks with perplexity heads |
|
|
- **Languages**: Spanish, Portuguese, English |
|
|
- **Max Length**: 512 tokens |
|
|
- **Precision**: FP16 (TensorRT), FP32 (PyTorch) |
|
|
|
|
|
## ⚡ Benchmarks (A100) |
|
|
|
|
|
| Model Type | Throughput | Latency | Memory | |
|
|
|------------------|------------|---------|--------| |
|
|
| Llama 3 1B | ~200/sec | 50ms | 8GB | |
|
|
| Wayra PyTorch | ~1,000/sec | 10ms | 4GB | |
|
|
| Wayra TensorRT | ~50,000/sec| <1ms | 2GB | |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
**"TensorRT engine not compatible"** |
|
|
- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture) |
|
|
- Check CUDA version: `nvidia-smi` (should be 12.8+) |
|
|
- Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x) |
|
|
|
|
|
**"CUDA out of memory"** |
|
|
- Reduce batch size in inference |
|
|
- Use gradient checkpointing if training |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@software{WayraPPL, |
|
|
title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty}, |
|
|
author={Omar U. Florez and LatamGPT Team}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 - See LICENSE file |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware. |
|
|
|