Instructions to use ABTdomain/dksplit-qwen-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ABTdomain/dksplit-qwen-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-9B") model = PeftModel.from_pretrained(base_model, "ABTdomain/dksplit-qwen-lora") - Notebooks
- Google Colab
- Kaggle
DKSplit Qwen3.5 9B LoRA r128 - Domain Name Segmentation
LoRA adapter for splitting concatenated domain names into component words. Fine-tuned on 5.4M labeled domain segmentation samples.
Given a concatenated string like chatgptlogin, the model outputs chatgpt login.
Performance
Evaluated on 1,000 randomly sampled domains from the Newly Registered Domains Database (NRDS) (April 2026 .com feed), human-audited ground truth:
| Model | Benchmark | Real-World |
|---|---|---|
| This model (r128, 5M) | 90.1% | 85.0% |
| DKSplit v0.3.1 (BiLSTM, 9.47M) | 87.6% | 85.0% |
| Qwen3 9B LoRA r64 (95K) | 85.2% | 82.8% |
| Gemma 4 31B zero-shot | 72.8% | 72.8% |
| Qwen3 9B zero-shot | 58.1% | 58.2% |
Full benchmark details: ABTdomain/dksplit-benchmark
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
# Load base model + adapter
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "ABTdomain/dksplit-qwen-lora")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("ABTdomain/dksplit-qwen-lora", trust_remote_code=True)
# Inference
system = "You are a domain name segmentation tool. Given a concatenated string, split it into separate words with spaces. Output ONLY the segmented result, nothing else."
messages = [
{"role": "system", "content": system},
{"role": "user", "content": "chatgptlogin"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, do_sample=False)
result = tokenizer.decode(output[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(result.strip())
# chatgpt login
Examples
| Input | Output |
|---|---|
| chatgptlogin | chatgpt login |
| spotifywrapped | spotify wrapped |
| ethereumwallet | ethereum wallet |
| whatsappstatus | whatsapp status |
| escribirenvozalta | escribir en voz alta |
| candidiasenuncamais | candidiase nunca mais |
| mercibeaucoup | merci beaucoup |
| robertdeniro | robert de niro |
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen3.5 9B |
| Method | LoRA |
| Rank | 128 |
| Alpha | 256 |
| Trainable params | 300M |
| Training data | 5.4M labeled domain segmentation samples |
| Epochs | 2 (best checkpoint) |
| Batch size | 32 (effective) |
| Learning rate | 2e-4 |
| Infrastructure | Leonardo Booster, NVIDIA A100 (EuroHPC JU) |
Adapter Details
| File | Size |
|---|---|
| adapter_model.safetensors | 445 MB |
| adapter_config.json | LoRA r128, alpha 256 |
| tokenizer.json | Qwen3.5 tokenizer |
Known Limitations
- Over-segmentation: Tends to split unfamiliar strings into too many pieces (e.g.,
carlitadbecomescarl it ad) - Input corruption: May slightly alter character sequences due to pre-trained language priors (e.g., changing misspellings to different misspellings)
- Speed: ~4 samples/sec on A100 (vs ~900/sec for BiLSTM ONNX)
- Requires GPU: A100 or equivalent for bfloat16 inference
For production use, we recommend DKSplit (BiLSTM-CRF, CPU, 9 MB) which achieves the same real-world accuracy at 200x the speed.
Links
- DKSplit (Python): pypi.org/project/dksplit
- GitHub: github.com/ABTdomain/dksplit
- Benchmark: huggingface.co/datasets/ABTdomain/dksplit-benchmark
- Blog: Training Domain Segmentation on EuroHPC
Acknowledgements
Trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (project AIFAC_P02_281).
License
Apache 2.0
Please attribute as: DKsplit by ABTdomain
- Downloads last month
- 4