Text Generation
Transformers
Safetensors
llama
sql
nvfp4
quantized
vllm
blackwell
llmcompressor
conversational
text-generation-inference
8-bit precision
compressed-tensors
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("pshashid/llama3.1B_8B_SQL_Finetuned_model")
model = AutoModelForCausalLM.from_pretrained("pshashid/llama3.1B_8B_SQL_Finetuned_model")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Quick Links
Llama 3.1 8B SQL โ NVFP4 Quantized (Blackwell)
SQL generation model fine-tuned on text-to-SQL tasks, quantized for NVIDIA Blackwell (RTX 50-series) using llm-compressor.
Quantization Details
| Component | Format | Notes |
|---|---|---|
| Weights | NVFP4 | ~4.5GB โ Blackwell 5th-gen Tensor Core native |
| KV-Cache | FP8 | 50% memory vs FP16 โ configured via vLLM |
| Activations | FP16 | lm_head kept in FP16 for output quality |
vLLM Inference (RTX 5090)
vllm serve pshashid/llama3.1B_8B_SQL_Finetuned_model \
--dtype float16 \
--quantization fp4 \
--kv-cache-dtype fp8 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--port 8000
Performance Targets (Dual RTX 5090 Pod โ 8 Replicas)
| Metric | Target |
|---|---|
| Time to First Token | < 15ms |
| Throughput (1 replica) | ~200 tok/s |
| Aggregate (8 replicas) | 1,500+ tok/s |
| Max Concurrency | 100+ users |
Example Usage (Python)
from vllm import LLM, SamplingParams
llm = LLM(
model = "pshashid/llama3.1B_8B_SQL_Finetuned_model",
quantization = "fp4",
kv_cache_dtype = "fp8",
max_model_len = 131072,
enable_prefix_caching = True,
)
sampling = SamplingParams(temperature=0, max_tokens=200)
outputs = llm.generate(["SELECT"], sampling)
print(outputs[0].outputs[0].text)
- Downloads last month
- 6
Model tree for pshashid/llama3.1B_8B_SQL_Finetuned_model
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pshashid/llama3.1B_8B_SQL_Finetuned_model") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)