Embedl All Minilm L6 V2 (Quantized for TensorRT)
Deployable INT8-quantized version of sentence-transformers/all-MiniLM-L6-v2,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT inference on edge GPUs. Produces
the same L2-normalised sentence embedding as the upstream encoder.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
- Drop-in replacement for
sentence-transformers/all-MiniLM-L6-v2in TensorRT pipelines β same input pair (input_ids, attention_mask) at seq_len=128, same output embedding semantics (mean-pooled, L2-normalised). - Validated accuracy within 0.0026 of the FP32 Spearman Ο on stsb (see Accuracy table below).
- Faster than
trtexec --beston supported NVIDIA hardware (see Performance table below). - Includes both ONNX (for TensorRT) and PT2
(
torch.export-loadable) artifacts plus runnable inference scripts.
Quick Start
pip install huggingface_hub transformers numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/all-MiniLM-L6-v2-quantized-trt', local_dir='.')"
python infer_pt2.py --sentence "A man is eating food." # pure PyTorch via torch.export
# or
python infer_trt.py --sentence "A man is eating food." # TensorRT (requires pycuda + tensorrt)
Files
| File | Purpose |
|---|---|
embedl_all-MiniLM-L6-v2_int8.onnx |
INT8-quantized ONNX with Q/DQ nodes β feed to TensorRT. |
embedl_all-MiniLM-L6-v2_int8.pt2 |
INT8-quantized torch.export ExportedProgram. |
infer_trt.py |
Build a TRT engine from the ONNX and run sample inference. |
infer_pt2.py |
Load the .pt2 with torch.export.load and run sample inference. |
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
NVIDIA Jetson AGX Orin
| Configuration | Mean Latency | Speedup vs FP16 |
|---|---|---|
| TensorRT FP16 | 0.41 ms | 1.00x |
| TensorRT --best (unconstrained) | 0.41 ms | 1.01x |
| Embedl Deploy INT8 | 0.38 ms | 1.07x |
Accuracy
Evaluated on the stsb validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.
| Model | Spearman Ο |
|---|---|
sentence-transformers/all-MiniLM-L6-v2 FP32 (ours) |
0.8672 |
| Embedl All Minilm L6 V2 INT8 | 0.8646 |
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. You can apply the same workflow to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | All Minilm L6 V2 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
- Downloads last month
- 17
Model tree for embedl/all-MiniLM-L6-v2-quantized-trt
Base model
sentence-transformers/all-MiniLM-L6-v2