dksplit / README.md
ABTdomain's picture
Update README.md
fbea672 verified
metadata
language:
  - en
  - de
  - fr
  - es
  - it
tags:
  - word-segmentation
  - onnx
  - bilstm-crf
  - text-processing
  - domain-names
library_name: onnxruntime
pipeline_tag: token-classification

DKSplit

Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

Model Description

  • Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
  • Format: ONNX with INT8 quantization
  • Size: ~9MB
  • Input: Lowercase a-z, 0-9 (max 64 characters)

Usage

Install

pip install dksplit

Python

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]

Direct ONNX

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code

Files

  • dksplit-int8.onnx - ONNX model (INT8 quantized)
  • dksplit.npz - CRF parameters

Limitations

  • Input: a-z, 0-9 only
  • Max length: 64 characters
  • Non-Latin scripts: use Romanized form

Links

License

MIT