--- language: - en - de - fr - es - it tags: - word-segmentation - onnx - bilstm-crf - text-processing - domain-names library_name: onnxruntime pipeline_tag: token-classification --- # DKSplit Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words. ## Model Description - **Architecture:** BiLSTM-CRF (384 embedding, 768 hidden, 3 layers) - **Format:** ONNX with INT8 quantization - **Size:** ~9MB - **Input:** Lowercase a-z, 0-9 (max 64 characters) ## Usage ### Install ```bash pip install dksplit ``` ### Python ```python import dksplit dksplit.split("chatgptlogin") # ['chatgpt', 'login'] dksplit.split_batch(["openaikey", "microsoftoffice"]) # [['openai', 'key'], ['microsoft', 'office']] ``` ### Direct ONNX ```python import onnxruntime as ort import numpy as np session = ort.InferenceSession("dksplit-int8.onnx") # See GitHub for full inference code ``` ## Files - `dksplit-int8.onnx` - ONNX model (INT8 quantized) - `dksplit.npz` - CRF parameters ## Limitations - Input: a-z, 0-9 only - Max length: 64 characters - Non-Latin scripts: use Romanized form ## Links - [PyPI](https://pypi.org/project/dksplit/) - [GitHub](https://github.com/ABTdomain/dksplit) - [DomainKits](https://domainkits.com) - [ABTdomain](https://ABTdomain.com) ## License MIT