Instructions to use hxia7/Llama-3.1-8B-Block-FT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use hxia7/Llama-3.1-8B-Block-FT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="hxia7/Llama-3.1-8B-Block-FT")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("hxia7/Llama-3.1-8B-Block-FT") model = AutoModelForCausalLM.from_pretrained("hxia7/Llama-3.1-8B-Block-FT") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use hxia7/Llama-3.1-8B-Block-FT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "hxia7/Llama-3.1-8B-Block-FT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hxia7/Llama-3.1-8B-Block-FT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/hxia7/Llama-3.1-8B-Block-FT
- SGLang
How to use hxia7/Llama-3.1-8B-Block-FT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "hxia7/Llama-3.1-8B-Block-FT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hxia7/Llama-3.1-8B-Block-FT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "hxia7/Llama-3.1-8B-Block-FT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "hxia7/Llama-3.1-8B-Block-FT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use hxia7/Llama-3.1-8B-Block-FT with Docker Model Runner:
docker model run hf.co/hxia7/Llama-3.1-8B-Block-FT
Llama-3.1-8B-Block-FT
Block-Attention fine-tuned meta-llama/Llama-3.1-8B for efficient RAG inference.
Overview
This model is fine-tuned using the Block-Attention mechanism from Block-Attention for Efficient Prefilling. Block-Attention divides the input context into independent blocks during the prefill phase, enabling KV cache reuse across different queries on the same documents — a key optimization for RAG serving.
Training Data Control Variable: This model was fine-tuned on an 8K subset of the Tulu3-Block-FT-RAG dataset, matching the data volume used for the companion Qwen3-8B model. A companion Llama-3.2-1B model uses the full 80K samples.
Evaluation Results
Comparison with Other Block-FT Models (on Unseen TriviaQA, 100 clean samples)
Questions and evidence passages from TriviaQA RC validation split, excluded from training data. Substr-EM checks whether the correct answer appears as a substring in the model's response.
| Model | Params | Train Data | Substr-EM | F1 Score |
|---|---|---|---|---|
| meta-llama/Llama-3.2-1B (base) | 1B | - | 56.00% | 12.51% |
| meta-llama/Llama-3.2-1B-Instruct | 1B | - | 86.00% | 23.62% |
| hxia7/Llama-3.2-1B-block-FT (full) | 1B | 80K | 87.00% | 26.59% |
| hxia7/Llama-3.2-1B-block-FT (block) | 1B | 80K | 88.00% | 27.53% |
| hxia7/Qwen3-8B-block-FT (full) | 8B | 8K | 91.00% | 25.18% |
| hxia7/Qwen3-8B-block-FT (block) | 8B | 8K | 90.00% | 23.71% |
| hxia7/Llama-3.1-8B-block-FT | 8B | 8K | TBD | TBD |
Evaluation results for this model will be added once GPU resources are available.
Block-Attention Mechanism
In Block-Attention, the context is split into N blocks:
- Blocks 1..N-1 (document blocks): Use local attention — each block attends only to itself
- Block N (query block): Uses global attention — attends to all previous blocks
This isolation allows document blocks' KV states to be computed once and reused across multiple queries.
Training Details
- Base Model: meta-llama/Llama-3.1-8B
- Training Data: Tulu3-Block-FT-RAG (8K subset)
- Epochs: 1
- Learning Rate: 2e-6
- Optimizer: AdamW (fused)
- Precision: BF16
- DeepSpeed: ZeRO Stage 2 with CPU optimizer offload
- Loss Reduction: sum (over non-masked tokens)
During training, each sample produces two variants:
- Full-attention version (standard causal mask)
- Block-attention version (with
[Block-Attention]prefix token and 4D block mask)
Both variants contribute to the loss, teaching the model to handle both inference modes.
Inference
Block-Attention Inference (recommended for RAG)
Important: Block-Attention uses a 4D attention mask [1, 1, seq_len, seq_len] during prefill. model.generate() only accepts 2D masks, so inference requires manual prefill + autoregressive decode:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.data.block import build_attention_mask, convert_attention_mask_to_model_required
model = AutoModelForCausalLM.from_pretrained("hxia7/Llama-3.1-8B-block-FT", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("hxia7/Llama-3.1-8B-block-FT")
blocks = [
"<|start_header_id|>system\nYou are an AI assistant. Below are reference documents.\n\n",
"- Title: Document 1\nContent of document 1...\n",
"- Title: Document 2\nContent of document 2...\n",
"Answer the question using the documents.\nQuestion: What is X?\n\n",
]
@torch.no_grad()
def block_generate(model, tokenizer, blocks, max_new_tokens=128):
block_token_counts = []
all_ids = []
for b in blocks:
ids = tokenizer.encode(b, add_special_tokens=False)
all_ids.extend(ids)
block_token_counts.append(len(ids))
input_ids = torch.tensor([all_ids], dtype=torch.int64, device=model.device)
total_len = len(all_ids)
helper = torch.tril(torch.ones(total_len + 64, total_len + 64, dtype=torch.bool))
attn_mask = build_attention_mask(
local_attention_block_tokens=torch.tensor(block_token_counts[:-1], dtype=torch.long),
global_attention_block_tokens=torch.tensor(block_token_counts[-1], dtype=torch.long),
lower_triangular_matrix=helper,
)
attn_mask = convert_attention_mask_to_model_required(attn_mask)
attn_mask = attn_mask.unsqueeze(0).unsqueeze(0).to(model.device)
outputs = model(input_ids=input_ids, attention_mask=attn_mask, use_cache=True)
past_kv = outputs.past_key_values
next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1, keepdim=True)
generated = []
for _ in range(max_new_tokens - 1):
if next_token.item() == tokenizer.eos_token_id:
break
generated.append(next_token.item())
outputs = model(input_ids=next_token, past_key_values=past_kv, use_cache=True)
past_kv = outputs.past_key_values
next_token = torch.argmax(outputs.logits[:, -1, :], dim=-1, keepdim=True)
if next_token.item() != tokenizer.eos_token_id:
generated.append(next_token.item())
return tokenizer.decode(generated, skip_special_tokens=True).strip()
answer = block_generate(model, tokenizer, blocks)
print(answer)
Full-Attention Inference (standard)
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("hxia7/Llama-3.1-8B-block-FT", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("hxia7/Llama-3.1-8B-block-FT")
prompt = "Your full RAG prompt here..."
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=3968).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False, pad_token_id=tokenizer.eos_token_id)
answer = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(answer)
References
- Downloads last month
- 7
Model tree for hxia7/Llama-3.1-8B-Block-FT
Base model
meta-llama/Llama-3.1-8B