File size: 1,798 Bytes
0f97414 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | ---
base_model: meta-llama/Llama-3.1-8B-Instruct
tags:
- sql
- llama
- nvfp4
- quantized
- vllm
- blackwell
- llmcompressor
library_name: transformers
pipeline_tag: text-generation
---
# Llama 3.1 8B SQL — NVFP4 Quantized (Blackwell)
SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using `llm-compressor`.
## Quantization Details
| Component | Format | Notes |
|-----------|----------|--------------------------------------------|
| Weights | NVFP4 | ~4.5GB — Blackwell 5th-gen Tensor Core native |
| KV-Cache | FP8 | 50% memory vs FP16 — configured via vLLM |
| Activations | FP16 | `lm_head` kept in FP16 for output quality |
## vLLM Inference (RTX 5090)
```bash
vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \
--dtype float16 \
--quantization fp4 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--port 8000
```
## Performance Targets (Dual RTX 5090 Pod — 8 Replicas)
| Metric | Target |
|-------------------------|---------------|
| Time to First Token | < 15ms |
| Throughput (1 replica) | ~200 tok/s |
| Aggregate (8 replicas) | 1,500+ tok/s |
| Max Concurrency | 100+ users |
## Example Usage (Python)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model = "pshashid/llama3.1B_8B_SQL_Finetuned_model",
quantization = "fp4",
kv_cache_dtype = "fp8",
max_model_len = 131072,
enable_prefix_caching = True,
)
sampling = SamplingParams(temperature=0, max_tokens=200)
outputs = llm.generate(["SELECT"], sampling)
print(outputs[0].outputs[0].text)
```
|