Pipeowl-1.8.3-jp-Whitebox (Geometric Embedding)

A transformer-free semantic retrieval engine.

PipeOwl performs deterministic vocabulary scoring over a static embedding field:

score = α⋅base + (1 - α⋅base)⋅Δfield

BPB：用 byte 當單位
token NLL：用 token 當單位

token NLL: 12.943284891453972

where:

base = cosine similarity in embedding space
Δfield = static scalar field bias

Features:

O(n) over vocabulary.
No attention.
No transformer weights.
CPU-friendly (<16MB model)

Architecture

Static embedding table (V × D)
Aligned vocabulary index
Optional scalar bias field (Δfield)
Linear scoring
Pluggable decoder stage
Targeted for CPU environments and low-latency systems (e.g. IME).

Model Specs

item	value
vocab size	26155
embedding dim	256
storage format	safetensors (FP16)
model size	~13.2 MB
languages	Japanese
startup time	<1s
query latency	~1 ms (CPU, full vocabulary scan)

Quickstart

git clone https://huggingface.co/WangKaiLin/Pipeowl-1.8.3-jp-Whitebox
cd Pipeowl-1.8.3-jp-Whitebox

pip install numpy safetensors

python debug.py

Example:

Example semantic retrieval results:

Please enter words： 東京

Top-K Debug:
1 東京 | base=1.000 | delta=0.478 | final=1.000
2 は | base=-0.294 | delta=0.907 | final=0.880
3 大阪 | base=0.679 | delta=0.346 | final=0.790
4 パリ | base=0.597 | delta=0.419 | final=0.766
5 名古屋 | base=0.646 | delta=0.284 | final=0.747

Please enter words： 大阪

Top-K Debug:
1 大阪 | base=1.000 | delta=0.346 | final=1.000
2 は | base=-0.200 | delta=0.907 | final=0.889
3 東京 | base=0.679 | delta=0.478 | final=0.832
4 関西 | base=0.756 | delta=0.252 | final=0.817
5 尼崎 | base=0.710 | delta=0.367 | final=0.816

Repository Structure

Pipeowl-1.8.3-jp-Whitebox/
 ├ README.md
 ├ config.json
 ├ DATA_SOURCES.md
 ├ debug.py
 ├ LICENSE
 ├ quickstart.py
 ├ engine.py
 ├ vocabulary.json
 └ pipeowl_fp16.safetensors