Pylox Customer Service 8B (cs-bitext)

A LoRA adapter for meta-llama/Llama-3.1-8B-Instruct, fine-tuned on the Bitext customer support conversation dataset. Built end-to-end on a single NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB unified memory) with the same NF4 train, NVFP4 serve, EAGLE-3 speculative decoding stack used across the Pylox Forge portfolio.

This is a small-dataset seed adapter intended as a portfolio demonstration of the pipeline. The fine-tune-quality measurement against the base model is below the 50/50 line on this volume of data (see Evaluation), and is documented honestly. A larger refresh round on the operator's own ticket history is the recommended path to production-grade tone match.

Model details

Adapter: LoRA (PEFT, rank 32, alpha 64, dropout 0.1)
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Base model: meta-llama/Llama-3.1-8B-Instruct
Recommended serving base: nvidia/Llama-3.1-8B-Instruct-NVFP4
Speculative head: RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3
License: Llama 3.1 Community License
Hardware: NVIDIA Grace Blackwell GB10 (DGX Spark, 128 GB UMA)

Training data and technique

Source: bitext/Bitext-customer-support-llm-chatbot-training-dataset (synthetic-style customer support conversations)
Examples after preprocessing: 378
Format: Standard chat messages format with assistant-only loss
Method: NF4 QLoRA SFT (4-bit NormalFloat base, bfloat16 compute, double quantization)
Hyperparameters: 3 epochs, cosine LR (2e-4 peak), max_seq_length 2048, sequence packing enabled, NEFTune noise alpha 5, paged_adamw_8bit optimizer, gradient accumulation 16

Evaluation

Numbers are pulled directly from the local benchmark JSON. No invented values.

Throughput on Grace Blackwell GB10

Metric	Value
Sustained throughput	27.9 tok/s
Single-user throughput	39.0 tok/s
Concurrent batch-8 throughput	260.0 tok/s
TTFT p50	127.9 ms
TTFT p95	136.8 ms

Cost (NVFP4 serving)

Metric	Value
Cost per 1M output tokens	$0.4978
Comparable OpenAI GPT-4o output	$10 per 1M
Savings factor	20.1x cheaper than GPT-4o

Capability preservation (academic, lm-evaluation-harness, 500 samples each)

Benchmark	Score
HellaSwag (common sense)	76.0%
TruthfulQA (hallucination resistance)	30.0%
MMLU-Pro	not run

Fine-tune quality vs base (LLM judge, pairwise)

Metric	Value
Win rate vs `meta-llama/Llama-3.1-8B-Instruct`	32.0%

Honest read: at 378 training examples the adapter does not consistently beat the base model in pairwise quality. The pipeline runs end-to-end and the artifacts ship, but production tone match for a real customer service deployment requires a refresh round on the operator's own ticket history (typically 2,000 to 10,000 examples). This is documented rather than hidden.

Safety (with Pylox safety gateway, 50-prompt red team)

Metric	Value
Adversarial block rate	77.78%
False positive rate on benign controls	20.0%

The 20% false positive rate on benign controls indicates the gateway is over-conservative for general customer support traffic. A DPO alignment pass with refusal-behavior pairs would reduce this; that pass is the recommended next step before this adapter routes real production traffic.

Quickstart

PEFT (research / batch)

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "meta-llama/Llama-3.1-8B-Instruct"
adapter_id = "pyloxsystems/cs-bitext-llama-3.1-8b-lora"

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id, torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_id)

messages = [
    {"role": "system", "content": "You are a polite customer support assistant. Resolve the issue or escalate."},
    {"role": "user", "content": "My order #4482 was supposed to arrive yesterday and the tracking is stuck."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

vLLM with NVFP4 base and EAGLE-3 speculative decoding

vllm serve nvidia/Llama-3.1-8B-Instruct-NVFP4 \
    --enable-lora \
    --lora-modules cs-bitext=pyloxsystems/cs-bitext-llama-3.1-8b-lora \
    --speculative-config '{
        "method": "eagle3",
        "model": "RedHatAI/Llama-3.1-8B-Instruct-speculator.eagle3",
        "num_speculative_tokens": 5
    }'

Intended use

Draft customer support replies in standard B2C tone
Triage and categorize incoming tickets
Suggest resolutions for common L1 inquiries
Pipeline demonstration for evaluating the Pylox Forge stack end-to-end

Out of scope

Tone calibration outside the Bitext synthetic distribution (use a refresh round on real ticket history)
Multilingual support (English only)
Multi-turn agentic loops without an external orchestrator
Replacement for human in regulated escalations (refunds, compliance, legal)
Production routing without the recommended DPO alignment pass

Limitations

378 training examples is small relative to production-grade fine-tunes. Fine-tune quality vs base is below the 50/50 line.
Bitext is synthetic-style: drift expected on real-world ticket idiom without a refresh.
20% false positive rate on benign safety controls indicates the gateway is over-conservative; DPO refresh recommended.
Sequence length capped at 2048; longer ticket histories must be truncated or summarized externally.

License

Inherits the Llama 3.1 Community License from the base model.

Citation

@misc{pylox_cs_bitext_2026,
  author       = {Girard, Emilio},
  title        = {Pylox Customer Service 8B (cs-bitext)},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pyloxsystems/cs-bitext-llama-3.1-8b-lora}}
}

Pylox Forge is a solo-operated LLM fine-tuning lab on NVIDIA Grace Blackwell. Site: pyloxforge.com. Other adapters: pyloxsystems on Hugging Face.

Downloads last month: 31

Model tree for pyloxsystems/cs-bitext-llama-3.1-8b-lora

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(2296)

this model

pyloxsystems
/

cs-bitext-llama-3.1-8b-lora