Instructions to use pyloxsystems/cs-bitext-llama-3.1-8b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use pyloxsystems/cs-bitext-llama-3.1-8b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base_model, "pyloxsystems/cs-bitext-llama-3.1-8b-lora") - Notebooks
- Google Colab
- Kaggle
Pylox Customer Service 8B (cs-bitext)
A LoRA adapter for meta-llama/Llama-3.1-8B-Instruct, fine-tuned on the Bitext customer support conversation dataset. Built end-to-end on a single NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB unified memory) with the same NF4 train, NVFP4 serve, EAGLE-3 speculative decoding stack used across the Pylox Forge portfolio.
This is a small-dataset seed adapter intended as a portfolio demonstration of the pipeline. The fine-tune-quality measurement against the base model is below the 50/50 line on this volume of data (see Evaluation), and is documented honestly. A larger refresh round on the operator's own ticket history is the recommended path to production-grade tone match.
Model details
- Adapter: LoRA (PEFT, rank 32, alpha 64, dropout 0.1)
- Target modules:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj - Base model:
meta-llama/Llama-3.1-8B-Instruct - Recommended serving base:
nvidia/Llama-3.1-8B-Instruct-NVFP4 - Speculative head:
RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3 - License: Llama 3.1 Community License
- Hardware: NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB UMA)
Training data and technique
- Source:
bitext/Bitext-customer-support-llm-chatbot-training-dataset(synthetic-style customer support conversations) - Examples after preprocessing: 378
- Format: Standard chat messages format with assistant-only loss
- Method: NF4 QLoRA SFT (4-bit NormalFloat base, bfloat16 compute, double quantization)
- Hyperparameters: 3 epochs, cosine LR (2e-4 peak), max_seq_length 2048, sequence packing enabled, NEFTune noise alpha 5, paged_adamw_8bit optimizer, gradient accumulation 16
Evaluation
Numbers are pulled directly from the local benchmark JSON. No invented values.
Throughput on Grace Blackwell GB10
| Metric | Value |
|---|---|
| Sustained throughput | 27.9 tok/s |
| Single-user throughput | 39.0 tok/s |
| Concurrent batch-8 throughput | 260.0 tok/s |
| TTFT p50 | 127.9 ms |
| TTFT p95 | 136.8 ms |
Cost (NVFP4 serving)
| Metric | Value |
|---|---|
| Cost per 1M output tokens | $0.4978 |
| Comparable OpenAI GPT-4o output | $10 per 1M |
| Savings factor | 20.1x cheaper than GPT-4o |
Capability preservation (academic, lm-evaluation-harness, 500 samples each)
| Benchmark | Score |
|---|---|
| HellaSwag (common sense) | 76.0% |
| TruthfulQA (hallucination resistance) | 30.0% |
| MMLU-Pro | not run |
Fine-tune quality vs base (LLM judge, pairwise)
| Metric | Value |
|---|---|
Win rate vs meta-llama/Llama-3.1-8B-Instruct |
32.0% |
Honest read: at 378 training examples the adapter does not consistently beat the base model in pairwise quality. The pipeline runs end-to-end and the artifacts ship, but production tone match for a real customer service deployment requires a refresh round on the operator's own ticket history (typically 2,000 to 10,000 examples). This is documented rather than hidden.
Safety (with Pylox safety gateway, 50-prompt red team)
| Metric | Value |
|---|---|
| Adversarial block rate | 77.78% |
| False positive rate on benign controls | 20.0% |
The 20% false positive rate on benign controls indicates the gateway is over-conservative for general customer support traffic. A DPO alignment pass with refusal-behavior pairs would reduce this; that pass is the recommended next step before this adapter routes real production traffic.
Quickstart
PEFT (research / batch)
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "pyloxsystems/cs-bitext-llama-3.1-8b-lora"
tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)
messages = [
{"role": "system", "content": "You are a polite customer support assistant. Resolve the issue or escalate."},
{"role": "user", "content": "My order #4482 was supposed to arrive yesterday and the tracking is stuck."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM with NVFP4 base and EAGLE-3 speculative decoding
vllm serve nvidia/Llama-3.1-8B-Instruct-NVFP4 \
--enable-lora \
--lora-modules cs-bitext=pyloxsystems/cs-bitext-llama-3.1-8b-lora \
--speculative-config '{
"method": "eagle3",
"model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
"num_speculative_tokens": 5
}'
Intended use
- Draft customer support replies in standard B2C tone
- Triage and categorize incoming tickets
- Suggest resolutions for common L1 inquiries
- Pipeline demonstration for evaluating the Pylox Forge stack end-to-end
Out of scope
- Tone calibration outside the Bitext synthetic distribution (use a refresh round on real ticket history)
- Multilingual support (English only)
- Multi-turn agentic loops without an external orchestrator
- Replacement for human in regulated escalations (refunds, compliance, legal)
- Production routing without the recommended DPO alignment pass
Limitations
- 378 training examples is small relative to production-grade fine-tunes. Fine-tune quality vs base is below the 50/50 line.
- Bitext is synthetic-style: drift expected on real-world ticket idiom without a refresh.
- 20% false positive rate on benign safety controls indicates the gateway is over-conservative; DPO refresh recommended.
- Sequence length capped at 2048; longer ticket histories must be truncated or summarized externally.
License
Inherits the Llama 3.1 Community License from the base model.
Citation
@misc{pylox_cs_bitext_2026,
author = {Girard, Emilio},
title = {Pylox Customer Service 8B (cs-bitext)},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/pyloxsystems/cs-bitext-llama-3.1-8b-lora}}
}
Pylox Forge is a solo-operated LLM fine-tuning lab on NVIDIA Grace Blackwell. Site: pyloxforge.com. Other adapters: pyloxsystems on Hugging Face.
- Downloads last month
- 31
Model tree for pyloxsystems/cs-bitext-llama-3.1-8b-lora
Base model
meta-llama/Llama-3.1-8B