---
language:
  - en
  - de
  - fr
  - es
  - it
tags:
  - word-segmentation
  - onnx
  - bilstm-crf
  - text-processing
  - domain-names
library_name: onnxruntime
pipeline_tag: token-classification
---

# DKSplit

Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

## Model Description

- **Architecture:** BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
- **Format:** ONNX with INT8 quantization
- **Size:** ~9MB
- **Input:** Lowercase a-z, 0-9 (max 64 characters)

## Usage

### Install
```bash
pip install dksplit
```

### Python
```python
import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]
```

### Direct ONNX
```python
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code
```

## Files

- `dksplit-int8.onnx` - ONNX model (INT8 quantized)
- `dksplit.npz` - CRF parameters

## Limitations

- Input: a-z, 0-9 only
- Max length: 64 characters
- Non-Latin scripts: use Romanized form

## Links

- [PyPI](https://pypi.org/project/dksplit/)
- [GitHub](https://github.com/ABTdomain/dksplit)
- [DomainKits](https://domainkits.com)
- [ABTdomain](https://ABTdomain.com)

## License

MIT