--- base_model: meta-llama/Llama-3.1-8B-Instruct tags: - sql - llama - nvfp4 - quantized - vllm - blackwell - llmcompressor library_name: transformers pipeline_tag: text-generation --- # Llama 3.1 8B SQL — NVFP4 Quantized (Blackwell) SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using `llm-compressor`. ## Quantization Details | Component | Format | Notes | |-----------|----------|--------------------------------------------| | Weights | NVFP4 | ~4.5GB — Blackwell 5th-gen Tensor Core native | | KV-Cache | FP8 | 50% memory vs FP16 — configured via vLLM | | Activations | FP16 | `lm_head` kept in FP16 for output quality | ## vLLM Inference (RTX 5090) ```bash vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \ --dtype float16 \ --quantization fp4 \ --kv-cache-dtype fp8 \ --max-model-len 131072 \ --gpu-memory-utilization 0.85 \ --enable-prefix-caching \ --port 8000 ``` ## Performance Targets (Dual RTX 5090 Pod — 8 Replicas) | Metric | Target | |-------------------------|---------------| | Time to First Token | < 15ms | | Throughput (1 replica) | ~200 tok/s | | Aggregate (8 replicas) | 1,500+ tok/s | | Max Concurrency | 100+ users | ## Example Usage (Python) ```python from vllm import LLM, SamplingParams llm = LLM( model = "pshashid/llama3.1B_8B_SQL_Finetuned_model", quantization = "fp4", kv_cache_dtype = "fp8", max_model_len = 131072, enable_prefix_caching = True, ) sampling = SamplingParams(temperature=0, max_tokens=200) outputs = llm.generate(["SELECT"], sampling) print(outputs[0].outputs[0].text) ```