metadata
language:
- ja
tags:
- embeddings
- retrieval
- transformer-free
- safetensors
- edge-ai
license: mit
PipeOwl-1.5-Japanese (Geometric Embedding)
A transformer-free semantic retrieval engine.
PipeOwl performs deterministic vocabulary scoring over a static embedding field:
score = α⋅base + β⋅Δfield
where:
- base = cosine similarity in embedding space
- Δfield = static scalar field bias
Features:
- O(n) over vocabulary.
- No attention.
- No transformer weights.
- CPU-friendly (<40MB model)
Architecture
- Static embedding table (V × D)
- Aligned vocabulary index
- Optional scalar bias field (Δfield)
- Linear scoring
- Pluggable decoder stage
- Targeted for CPU environments and low-latency systems (e.g. IME).
Model Specs
| item | value |
|---|---|
| vocab size | 26155 |
| embedding dim | 768 |
| storage format | safetensors (FP16) |
| model size | ~38.7 MB |
| languages | Japanese |
| startup time | <1s |
| query latency | ~3-4 ms (CPU, full vocabulary scan) |
Attribution
Quickstart
git clone https://huggingface.co/WangKaiLin/PipeOwl-1.5-jp
cd PipeOwl-1.5-jp
pip install numpy safetensors
python quickstart.py
Example:
Example semantic retrieval results:
Please enter words: 日
Top-K Tokens:
0.974 | 日
0.794 | 日の
0.789 | 翌日
0.777 | 週
0.775 | 週間
Please enter words: 行
Top-K Tokens:
0.961 | 行
0.794 | 行こ
0.787 | 執り行
0.787 | 入
0.784 | 起
Please enter words: 東京
Top-K Tokens:
0.979 | 東京
0.872 | 大阪
0.868 | 名古屋
0.849 | 横浜
0.848 | 目黒
Benchmark (CPU)
Environment:
- Vocab size: 26,155
- Embedding dimension: 768
- Hardware: CPU
Average query latency:
- PipeOwl: 0.0036 sec
- BM25: 0.0421 sec
- Embedding: 0.0283 sec
- FAISS Flat: 0.0324 sec
- FAISS HNSW: 0.0230 sec
| Comparison | Speedup |
|---|---|
| vs BM25 | 11.7× faster |
| vs Embedding | 7.9× faster |
| vs FAISS Flat | 9.0× faster |
| vs FAISS HNSW | 6.4× faster |
PipeOwl shows 6–12× lower latency compared with common retrieval baselines in this setup.
repo: https://huggingface.co/datasets/WangKaiLin/pipeowl-1.5-jp-benchmark
Repository Structure
pipeowl-1.5-jp/
├ README.md
├ config.json
├ DATA_SOURCES.md
├ LICENSE
├ quickstart.py
├ engine.py
├ vocabulary.json
└ pipeowl_fp16.safetensors
LICENSE
MIT