Update README.md

6dc481d verified about 1 month ago

2.37 kB

tags:
  - multilingual
  - embeddings
  - retrieval
  - transformer-free
  - safetensors
license: mit

PipeOwl-1.3-multilingual(Geometric Embedding)

A transformer-free semantic retrieval engine.

PipeOwl performs deterministic vocabulary scoring over a static embedding field:

score = α⋅base + β⋅Δfield

where:

base = cosine similarity in embedding space
Δfield = static scalar field bias

Features:

O(n) over vocabulary.
No attention.
No transformer weights.

Architecture

Static embedding table (V × D)
Aligned vocabulary index
Optional scalar bias field
Linear scoring
Pluggable decoder stage
Targeted for CPU environments and low-latency systems (e.g. IME).

Model Specs

item	value
vocab size	495,090
embedding dim	1024
storage format	safetensors
model size	~2.03 GB
languages	multilingual (Chinese / English dominant)
startup time	~30s
query latency	~103-104 ms

Attribution

DATA_SOURCES.md

Quickstart

git clone https://huggingface.co/WangKaiLin/PipeOwl-1.3-multilingual
cd PipeOwl-1.3-multilingual

pip install numpy safetensors

python quickstart.py

See full experimental notes here:

https://hackmd.io/@galaxy4552/SyWQ92cFWx

Example:

Please enter words： 雪鴞

Top-K Tokens:
1.004 | 雪鴞
0.823 | 鴟鴞
0.820 | 鴞
0.700 | 長耳鴞
0.686 | 雪橇

Please enter words： happy

Top-K Tokens:
0.998 | happy
0.888 | happiness
0.863 | heureux
0.857 | happyness
0.854 | gelukkig

Repository Structure

pipeowl-1.3-multilingual/
 ├ README.md
 ├ config.json
 ├ DATA_SOURCES.md 
 ├ LICENSE
 ├ quickstart.py
 ├ engine.py
 ├ vocabulary.json
 └ pipeowl.safetensors

Multilingual Vocabulary

PipeOwl-1.3 uses a mixed multilingual vocabulary containing:

Chinese words
English words
Mathematical symbols
Symbolic / byte fallback tokens

Total vocabulary size: 495k tokens

All tokens share the same embedding field.

PipeOwl 是一個基於靜態語義場的幾何檢索系統。

核心公式：

score = α⋅base + β⋅Δfield

其中：

base = embedding cosine similarity
delta = 靜態場偏移量
α / β 為可調權重

提供一種 O(n) 的輕量語義計分方法，適合低延遲環境（如輸入法）。

LICENSE

MIT