DKSplit Qwen3.5 9B LoRA r128 - Domain Name Segmentation

LoRA adapter for splitting concatenated domain names into component words. Fine-tuned on 5.4M labeled domain segmentation samples.

Given a concatenated string like chatgptlogin, the model outputs chatgpt login.

Performance

Evaluated on 1,000 randomly sampled domains from the Newly Registered Domains Database (NRDS) (April 2026 .com feed), human-audited ground truth:

Model Benchmark Real-World
This model (r128, 5M) 90.1% 85.0%
DKSplit v0.3.1 (BiLSTM, 9.47M) 87.6% 85.0%
Qwen3 9B LoRA r64 (95K) 85.2% 82.8%
Gemma 4 31B zero-shot 72.8% 72.8%
Qwen3 9B zero-shot 58.1% 58.2%

Full benchmark details: ABTdomain/dksplit-benchmark

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "ABTdomain/dksplit-qwen-lora")
model.eval()

tokenizer = AutoTokenizer.from_pretrained("ABTdomain/dksplit-qwen-lora", trust_remote_code=True)

# Inference
system = "You are a domain name segmentation tool. Given a concatenated string, split it into separate words with spaces. Output ONLY the segmented result, nothing else."

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": "chatgptlogin"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=64, do_sample=False)

result = tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(result.strip())
# chatgpt login

Examples

Input Output
chatgptlogin chatgpt login
spotifywrapped spotify wrapped
ethereumwallet ethereum wallet
whatsappstatus whatsapp status
escribirenvozalta escribir en voz alta
candidiasenuncamais candidiase nunca mais
mercibeaucoup merci beaucoup
robertdeniro robert de niro

Training Details

Parameter Value
Base model Qwen3.5 9B
Method LoRA
Rank 128
Alpha 256
Trainable params 300M
Training data 5.4M labeled domain segmentation samples
Epochs 2 (best checkpoint)
Batch size 32 (effective)
Learning rate 2e-4
Infrastructure Leonardo Booster, NVIDIA A100 (EuroHPC JU)

Adapter Details

File Size
adapter_model.safetensors 445 MB
adapter_config.json LoRA r128, alpha 256
tokenizer.json Qwen3.5 tokenizer

Known Limitations

  • Over-segmentation: Tends to split unfamiliar strings into too many pieces (e.g., carlitad becomes carl it ad)
  • Input corruption: May slightly alter character sequences due to pre-trained language priors (e.g., changing misspellings to different misspellings)
  • Speed: ~4 samples/sec on A100 (vs ~900/sec for BiLSTM ONNX)
  • Requires GPU: A100 or equivalent for bfloat16 inference

For production use, we recommend DKSplit (BiLSTM-CRF, CPU, 9 MB) which achieves the same real-world accuracy at 200x the speed.

Links

Acknowledgements

Trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281).

License

Apache 2.0

Please attribute as: DKsplit by ABTdomain

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ABTdomain/dksplit-qwen-lora

Finetuned
Qwen/Qwen3.5-9B
Adapter
(205)
this model