PipeOwl-1.4-multilingual (Geometric Embedding)

This release introduces FP16 storage to reduce model size and startup time.

A transformer-free semantic retrieval engine.

PipeOwl performs deterministic vocabulary scoring over a static embedding field:

score = α⋅base + β⋅Δfield

where:

  • base = cosine similarity in embedding space
  • Δfield = static scalar field bias

Features:

  • O(n) over vocabulary.
  • No attention.
  • No transformer weights.

Changes in 1.4

  • Embedding storage converted from FP32 to FP16
  • Model size reduced from ~1.9GB → ~1.01GB
  • Startup time improved from ~30s → ~2s
  • Same scoring pipeline

Architecture

  • Static embedding table (V × D)
  • Aligned vocabulary index
  • Optional scalar bias field
  • Linear scoring
  • Pluggable decoder stage
  • Targeted for CPU environments and low-latency systems (e.g. IME).

Model Specs

item value
vocab size 495,090
embedding dim 1024
storage format safetensors (FP16)
model size ~1.01 GB
languages multilingual (Chinese / English dominant)
startup time ~2s
query latency ~101-105 ms (CPU)

Attribution

DATA_SOURCES.md

Quickstart

git clone https://huggingface.co/WangKaiLin/PipeOwl-1.4-multilingual
cd PipeOwl-1.4-multilingual

pip install numpy safetensors

python quickstart.py

See full experimental notes here:

https://hackmd.io/@galaxy4552/SyWQ92cFWx

Example:

Please enter words: 雪鴞

Top-K Tokens:
1.004 | 雪鴞
0.823 | 鴟鴞
0.820 | 鴞
0.700 | 長耳鴞
0.686 | 雪橇

Please enter words: happy

Top-K Tokens:
0.998 | happy
0.888 | happiness
0.863 | heureux
0.857 | happyness
0.854 | gelukkig

Repository Structure

pipeowl-1.4-multilingual/
 ├ README.md
 ├ config.json
 ├ DATA_SOURCES.md 
 ├ LICENSE
 ├ quickstart.py
 ├ engine.py
 ├ vocabulary.json
 └ pipeowl_fp16.safetensors

Multilingual Vocabulary

PipeOwl-1.4 uses a mixed multilingual vocabulary containing:

  • Chinese words
  • English words
  • Mathematical symbols
  • Symbolic / byte fallback tokens

Total vocabulary size: 495k tokens

All tokens share the same embedding field.

PipeOwl 是一個基於靜態語義場的幾何檢索系統。

核心公式:

score = α⋅base + β⋅Δfield

其中:

  • base = embedding cosine similarity
  • delta = 靜態場偏移量
  • α / β 為可調權重

提供一種 O(n) 的輕量語義計分方法, 適合低延遲環境(如輸入法)。

LICENSE

MIT

Downloads last month
83
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including WangKaiLin/PipeOwl-1.4-multilingual