Upload TensorRT WayraPPL model optimized for A100 GPUs

2991961 verified 6 months ago

4.07 kB

	---
	language:
	- en
	- es
	- pt
	license: apache-2.0
	library_name: transformers
	tags:
	- perplexity-estimation
	- tensorrt
	- data-novely-estimation
	- dataset-contamination-detection
	- a100-optimized
	pipeline_tag: text-classification
	---

	# oflorez/Wayra-Perplexity-Estimator-55M-TensorRT: TensorRT Optimized WayraPPL

	🚀 A100-optimized TensorRT version of WayraPPL for high-throughput prediction of Perplexity.

	## ⚠️ Hardware Requirements

	This model works on NVIDIA A100 GPUs with:
	- GPU Architecture: sm_80 (A100-80GB)
	- CUDA: 12.8+
	- TensorRT: 10.13.x
	- Driver: 570.124.06+

	## 🚀 Performance

	- Throughput: ~50,000+ samples/sec (A100)
	- Latency: <1ms per sample
	- Batch Size: Up to 2048
	- Memory: ~2GB GPU memory

	## 📦 Installation

	```bash
	# Install requirements (A100 + CUDA 12.8+ required)
	pip install -r tensorrt_requirements.txt

	# Verify TensorRT installation
	python -c "import tensorrt; print(tensorrt.__version__)" # Should be 10.13.x
	```

	## 🔧 Usage

	### Option 1: PyTorch Model (Standard)
	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")
	model = AutoModel.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

	texts = ["Your text here"]
	inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
	outputs = model(**inputs)
	print(f"PPL: {outputs['ppl']}")
	```

	### Option 2: TensorRT Engine (High Performance)
	```python
	from tensorrt_inference import WayraPPLTensorRT
	from transformers import AutoTokenizer

	# Load TensorRT model (A100 required)
	model = WayraPPLTensorRT("wayrappl_fp16_bs2048.engine")
	tokenizer = AutoTokenizer.from_pretrained("oflorez/Wayra-Perplexity-Estimator-55M-TensorRT")

	# High-throughput inference
	texts = ["Your text here"] * 1000 # Large batch
	inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
	outputs = model.infer(inputs['input_ids'].numpy(), inputs['attention_mask'].numpy())
	```

	## Files Included

	- PyTorch Model: Standard HuggingFace format
	- `pytorch_model.bin` - Model weights
	- `config.json` - Model configuration
	- `tokenizer.json` - Tokenizer

	- TensorRT Engine: A100-optimized
	- `wayrappl_fp16_bs2048.engine` - TensorRT engine (A100 only)
	- `tensorrt_config.json` - Engine configuration
	- `tensorrt_inference.py` - Inference code
	- `tensorrt_requirements.txt` - Dependencies

	## Use Cases

	- Semantic Filtering
	- Curriculum Learning
	- Large-scale dataset cleaning (millions of documents)
	- Real-time perplexity estimation
	- High-throughput data quality assessment
	- Production MLOps pipelines

	## Model Details

	- Base: Knowledge distillation from meta-llama/Llama-3.2-1B
	- Architecture: GPT2-based Transformer blocks with perplexity heads
	- Languages: Spanish, Portuguese, English
	- Max Length: 512 tokens
	- Precision: FP16 (TensorRT), FP32 (PyTorch)

	## ⚡ Benchmarks (A100)

	\| Model Type \| Throughput \| Latency \| Memory \|
	\|------------------\|------------\|---------\|--------\|
	\| Llama 3 1B \| ~200/sec \| 50ms \| 8GB \|
	\| Wayra PyTorch \| ~1,000/sec \| 10ms \| 4GB \|
	\| Wayra TensorRT \| ~50,000/sec\| <1ms \| 2GB \|

	## Troubleshooting

	"TensorRT engine not compatible"
	- Ensure you're using A100-SXM4-80GB GPU (sm_80 architecture)
	- Check CUDA version: `nvidia-smi` (should be 12.8+)
	- Verify TensorRT: `python -c "import tensorrt"` (should be 10.13.x)

	"CUDA out of memory"
	- Reduce batch size in inference
	- Use gradient checkpointing if training

	## Citation

	```bibtex
	@software{WayraPPL,
	title={WayraPPL: High-Performance Perplexity Estimation of Data Novelty},
	author={Omar U. Florez and LatamGPT Team},
	year={2025},
	url={https://huggingface.co/latam-gpt/Wayra-Perplexity-Estimator-55M}
	}
	```

	## License

	Apache 2.0 - See LICENSE file

	---

	Note: This model is optimized for A100 GPUs. For other GPUs, use the PyTorch version or retrain the TensorRT engine for your specific hardware.