README.md · pshashid/llama3.1B_8B_SQL_Finetuned

llama3.1B_8B_SQL_Finetuned_model / README.md

pshashid

Upload NVFP4 quantized model with FP8 KV-cache

0f97414 verified 8 days ago

preview code

raw

history blame contribute delete

1.8 kB

	---
	base_model: meta-llama/Llama-3.1-8B-Instruct
	tags:
	- sql
	- llama
	- nvfp4
	- quantized
	- vllm
	- blackwell
	- llmcompressor
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Llama 3.1 8B SQL — NVFP4 Quantized (Blackwell)

	SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using `llm-compressor`.

	## Quantization Details

	\| Component \| Format \| Notes \|
	\|-----------\|----------\|--------------------------------------------\|
	\| Weights \| NVFP4 \| ~4.5GB — Blackwell 5th-gen Tensor Core native \|
	\| KV-Cache \| FP8 \| 50% memory vs FP16 — configured via vLLM \|
	\| Activations \| FP16 \| `lm_head` kept in FP16 for output quality \|

	## vLLM Inference (RTX 5090)
	```bash
	vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \
	--dtype float16 \
	--quantization fp4 \
	--kv-cache-dtype fp8 \
	--max-model-len 131072 \
	--gpu-memory-utilization 0.85 \
	--enable-prefix-caching \
	--port 8000
	```

	## Performance Targets (Dual RTX 5090 Pod — 8 Replicas)

	\| Metric \| Target \|
	\|-------------------------\|---------------\|
	\| Time to First Token \| < 15ms \|
	\| Throughput (1 replica) \| ~200 tok/s \|
	\| Aggregate (8 replicas) \| 1,500+ tok/s \|
	\| Max Concurrency \| 100+ users \|

	## Example Usage (Python)
	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model = "pshashid/llama3.1B_8B_SQL_Finetuned_model",
	quantization = "fp4",
	kv_cache_dtype = "fp8",
	max_model_len = 131072,
	enable_prefix_caching = True,
	)

	sampling = SamplingParams(temperature=0, max_tokens=200)
	outputs = llm.generate(["SELECT"], sampling)
	print(outputs[0].outputs[0].text)
	```