dksplit / README.md

ABTdomain

Update README.md

fbea672 verified 28 days ago

preview code

raw

history blame contribute delete

1.34 kB

metadata

language:
  - en
  - de
  - fr
  - es
  - it
tags:
  - word-segmentation
  - onnx
  - bilstm-crf
  - text-processing
  - domain-names
library_name: onnxruntime
pipeline_tag: token-classification

DKSplit

Word segmentation model for concatenated text. Split domain names, brand names, and phrases into words.

Model Description

Architecture: BiLSTM-CRF (384 embedding, 768 hidden, 3 layers)
Format: ONNX with INT8 quantization
Size: ~9MB
Input: Lowercase a-z, 0-9 (max 64 characters)

Usage

Install

pip install dksplit

Python

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split_batch(["openaikey", "microsoftoffice"])
# [['openai', 'key'], ['microsoft', 'office']]

Direct ONNX

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("dksplit-int8.onnx")
# See GitHub for full inference code

Files

dksplit-int8.onnx - ONNX model (INT8 quantized)
dksplit.npz - CRF parameters

Limitations

Input: a-z, 0-9 only
Max length: 64 characters
Non-Latin scripts: use Romanized form

License

MIT

ABTdomain
/

dksplit

DKSplit

Model Description

Usage

Install

Python

Direct ONNX

Files

Limitations

Links

License