Llama 3.1 8B SQL β NVFP4 Quantized (Blackwell)
SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using llm-compressor.
Quantization Details
| Component | Format | Notes |
|---|---|---|
| Weights | NVFP4 | ~4.5GB β Blackwell 5th-gen Tensor Core native |
| KV-Cache | FP8 | 50% memory vs FP16 β configured via vLLM |
| Activations | FP16 | lm_head kept in FP16 for output quality |
vLLM Inference (RTX 5090)
vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \
--dtype float16 \
--quantization fp4 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--port 8000
Performance Targets (Dual RTX 5090 Pod β 8 Replicas)
| Metric | Target |
|---|---|
| Time to First Token | < 15ms |
| Throughput (1 replica) | ~200 tok/s |
| Aggregate (8 replicas) | 1,500+ tok/s |
| Max Concurrency | 100+ users |
Example Usage (Python)
from vllm import LLM, SamplingParams
llm = LLM(
model = "pshashid/llama3.1B_8B_SQL_Finetuned_model",
quantization = "fp4",
kv_cache_dtype = "fp8",
max_model_len = 131072,
enable_prefix_caching = True,
)
sampling = SamplingParams(temperature=0, max_tokens=200)
outputs = llm.generate(["SELECT"], sampling)
print(outputs[0].outputs[0].text)
- Downloads last month
- 23
Model tree for pshashid/llama3.1B_8B_SQL_Finetuned_model
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct