DKSplit Qwen 3.5 9B LoRA r128: Domain Name Segmentation

LoRA adapter fine-tuned on Qwen 3.5 9B for domain name segmentation (splitting concatenated strings into words).

This is the research and evaluation companion to DKSplit, the production BiLSTM-CRF segmenter. The LLM is used as a teacher model for labeling and cross-validation, not as the production runtime.

Performance

5,000-sample benchmark (primary)

Model	Strict EM	Lenient EM
BiLSTM-CRF (DKSplit v1.0.0)	86.9%	90.4%
Qwen 3.5 9B LoRA r128 (this model)	84.96%	88.82%
Qwen 3.5 9B zero-shot (detailed prompt)	63.82%	67.16%

1,000-sample benchmark

Model	Strict EM	Lenient EM
BiLSTM-CRF (DKSplit v1.0.0)	86.5%	91.5%
Qwen 3.5 9B LoRA r128 (this model)	85.8%	90.3%

Strict EM counts only exact matches against truth. Lenient EM also accepts the might_right alternative for genuinely ambiguous cases.

The BiLSTM-CRF outperforms this LLM on both benchmarks while being ~1000x cheaper to run (9 MB, CPU-only, ~800 samples/s single-thread).

Character mutation rate (100K real domains)

Configuration	Mutation rate
Zero-shot	5.62%
This model (trained, epoch 3)	0.25%

Mutation = output characters differ from input after removing spaces. Training reduces character hallucination by 22x.

Cross-prompt robustness (5,000-sample, Lenient EM)

Model x Inference Prompt	new_prompt	adv_prompt	detailed_prompt
r128_new (trained on simple prompt)	87.90%	87.56%	87.44%
r128_adv (trained on advanced prompt)	88.38%	88.62%	88.82%

After training, prompt choice has negligible impact on output (<1pp difference). Behavior is baked into the weights.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "ABTdomain/dksplit-qwen-lora")
model.eval()

tokenizer = AutoTokenizer.from_pretrained("ABTdomain/dksplit-qwen-lora", trust_remote_code=True)

# Inference
system = "You are a domain name segmentation tool. Given a concatenated string that might be in any language, split it into separate words in the most accurate way. Do not add or remove any characters. Output ONLY the segmented result, nothing else."

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": "chatgptlogin"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, do_sample=False)

result = tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(result.strip())
# chatgpt login

Examples

Input	Output
chatgptlogin	chatgpt login
spotifywrapped	spotify wrapped
ethereumwallet	ethereum wallet
whatsappstatus	whatsapp status
escribirenvozalta	escribir en voz alta
candidiasenuncamais	candidiase nunca mais
mercibeaucoup	merci beaucoup
robertdeniro	robert de niro

Training Details

Parameter	Value
Base model	Qwen 3.5 9B
Method	LoRA
Rank	128
Alpha	256
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params	~300M (3.3% of 8,950M)
Training data	5M labeled domain segmentation samples
Training prompt	Advanced (multilingual, character-preserving)
Epochs	3
Batch size	32 (effective: 4 x 2 x 4 GPU)
Learning rate	2e-4, cosine schedule, 5% warmup
Distributed	DeepSpeed ZeRO-1
GPU hours	~209h
Infrastructure	4x A100-SXM-64GB, Leonardo Booster (CINECA, Italy)
Framework	PEFT 0.18.1

Key Findings

Parameter capacity matters: LoRA r64 (116M trainable) saturates at 82.1%; r128 (300M trainable) reaches 88.82%
Training bakes behavior into weights: swapping the inference prompt after SFT does not change output
Training eliminates character hallucination: mutation rate drops from 5.62% to 0.25%
Full fine-tune is not worth it: 4xA100 yields only 8 samples/s for full FT (ETA 40 days); LoRA r128 is sufficient
The BiLSTM-CRF is still better for production: 9 MB, CPU-only, faster, and higher accuracy

When to Use This Model

Cross-validating BiLSTM-CRF labels during benchmark construction
Research into LLM segmentation behavior on novel domains
Offline batch evaluation where latency is not a constraint
Generating alternative segmentations for ambiguous inputs

For production use, install the BiLSTM-CRF:

pip install dksplit

Adapter Files

File	Size
adapter_model.safetensors	444 MB
adapter_config.json	LoRA r128, alpha 256
tokenizer.json	Qwen 3.5 tokenizer

Acknowledgements

Trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (EHPC-AIF-2026PG01-281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

License

CC BY 4.0. Attribution required: credit "DKSplit by ABTdomain" in your README, documentation, about page, or API response metadata.

Downloads last month: 25

Model tree for ABTdomain/dksplit-qwen-lora

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(401)

this model

ABTdomain
/

dksplit-qwen-lora