| | --- |
| | base_model: meta-llama/Llama-3.1-8B-Instruct |
| | tags: |
| | - sql |
| | - llama |
| | - nvfp4 |
| | - quantized |
| | - vllm |
| | - blackwell |
| | - llmcompressor |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Llama 3.1 8B SQL — NVFP4 Quantized (Blackwell) |
| |
|
| | SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using `llm-compressor`. |
| |
|
| | ## Quantization Details |
| |
|
| | | Component | Format | Notes | |
| | |-----------|----------|--------------------------------------------| |
| | | Weights | NVFP4 | ~4.5GB — Blackwell 5th-gen Tensor Core native | |
| | | KV-Cache | FP8 | 50% memory vs FP16 — configured via vLLM | |
| | | Activations | FP16 | `lm_head` kept in FP16 for output quality | |
| |
|
| | ## vLLM Inference (RTX 5090) |
| | ```bash |
| | vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \ |
| | --dtype float16 \ |
| | --quantization fp4 \ |
| | --kv-cache-dtype fp8 \ |
| | --max-model-len 131072 \ |
| | --gpu-memory-utilization 0.85 \ |
| | --enable-prefix-caching \ |
| | --port 8000 |
| | ``` |
| |
|
| | ## Performance Targets (Dual RTX 5090 Pod — 8 Replicas) |
| |
|
| | | Metric | Target | |
| | |-------------------------|---------------| |
| | | Time to First Token | < 15ms | |
| | | Throughput (1 replica) | ~200 tok/s | |
| | | Aggregate (8 replicas) | 1,500+ tok/s | |
| | | Max Concurrency | 100+ users | |
| |
|
| | ## Example Usage (Python) |
| | ```python |
| | from vllm import LLM, SamplingParams |
| | |
| | llm = LLM( |
| | model = "pshashid/llama3.1B_8B_SQL_Finetuned_model", |
| | quantization = "fp4", |
| | kv_cache_dtype = "fp8", |
| | max_model_len = 131072, |
| | enable_prefix_caching = True, |
| | ) |
| | |
| | sampling = SamplingParams(temperature=0, max_tokens=200) |
| | outputs = llm.generate(["SELECT"], sampling) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|