metadata
tags:
- multilingual
- embeddings
- retrieval
- transformer-free
- safetensors
license: mit
PipeOwl-1.3-multilingual(Geometric Embedding)
A transformer-free semantic retrieval engine.
PipeOwl performs deterministic vocabulary scoring over a static embedding field:
score = α⋅base + β⋅Δfield
where:
- base = cosine similarity in embedding space
- Δfield = static scalar field bias
Features:
- O(n) over vocabulary.
- No attention.
- No transformer weights.
Architecture
- Static embedding table (V × D)
- Aligned vocabulary index
- Optional scalar bias field
- Linear scoring
- Pluggable decoder stage
- Targeted for CPU environments and low-latency systems (e.g. IME).
Model Specs
| item | value |
|---|---|
| vocab size | 495,090 |
| embedding dim | 1024 |
| storage format | safetensors |
| model size | ~2.03 GB |
| languages | multilingual (Chinese / English dominant) |
| startup time | ~30s |
| query latency | ~103-104 ms |
Attribution
Quickstart
git clone https://huggingface.co/WangKaiLin/PipeOwl-1.3-multilingual
cd PipeOwl-1.3-multilingual
pip install numpy safetensors
python quickstart.py
See full experimental notes here:
https://hackmd.io/@galaxy4552/SyWQ92cFWx
Example:
Please enter words: 雪鴞
Top-K Tokens:
1.004 | 雪鴞
0.823 | 鴟鴞
0.820 | 鴞
0.700 | 長耳鴞
0.686 | 雪橇
Please enter words: happy
Top-K Tokens:
0.998 | happy
0.888 | happiness
0.863 | heureux
0.857 | happyness
0.854 | gelukkig
Repository Structure
pipeowl-1.3-multilingual/
├ README.md
├ config.json
├ DATA_SOURCES.md
├ LICENSE
├ quickstart.py
├ engine.py
├ vocabulary.json
└ pipeowl.safetensors
Multilingual Vocabulary
PipeOwl-1.3 uses a mixed multilingual vocabulary containing:
- Chinese words
- English words
- Mathematical symbols
- Symbolic / byte fallback tokens
Total vocabulary size: 495k tokens
All tokens share the same embedding field.
PipeOwl 是一個基於靜態語義場的幾何檢索系統。
核心公式:
score = α⋅base + β⋅Δfield
其中:
- base = embedding cosine similarity
- delta = 靜態場偏移量
- α / β 為可調權重
提供一種 O(n) 的輕量語義計分方法, 適合低延遲環境(如輸入法)。
LICENSE
MIT