| | --- |
| | language: zh |
| | tags: |
| | - embeddings |
| | - retrieval |
| | - numpy |
| | - transformer-free |
| | license: mit |
| | --- |
| | |
| | # PipeOwl-1.0 (Geometric Embedding) |
| |
|
| | PipeOwl is a transformer-free geometric embedding package built on a **static embedding field** stored as NumPy arrays. |
| |
|
| | This repo provides: |
| | - `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized) |
| | - `L1_base_vocab.json`: list of vocab strings aligned to embedding rows |
| | - `delta_base_scalar.npy`: float32 (V,) optional scalar bias field |
| | - minimal inference engine (`engine.py`) and usage script (`quickstart.py`) |
| |
|
| | --- |
| |
|
| | ## Attribution |
| | The base embedding vectors were generated using **BGE (Apache-2.0)** via inference (model outputs). |
| | This repository **does not redistribute any original BGE model weights**. |
| |
|
| | --- |
| |
|
| | ## Quickstart |
| |
|
| | ```bash |
| | pip install numpy |
| | python quickstart.py |
| | ``` |
| | Or minimal usage: |
| | ```python |
| | from engine import PipeOwlEngine, PipeOwlConfig |
| | |
| | engine = PipeOwlEngine(PipeOwlConfig()) |
| | q = engine.encode("雪鴞好可愛") |
| | # use q for similarity / retrieval |
| | ``` |
| |
|
| | Files |
| |
|
| | - data/L1_base_embeddings.npy : embedding table (float32, V×1024) |
| | - data/L1_base_vocab.json : vocab aligned with rows |
| | - data/delta_base_scalar.npy : scalar bias (float32, V) |
| | - engine.py : minimal runtime |
| | - quickstart.py : example script |
| |
|
| | Notes |
| |
|
| | No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field. |
| | |
| | --- |
| | |
| | ## Parameter Size |
| | ~165M embedding parameters (static matrix) |
| | |
| | ## Intended Use |
| | - Semantic similarity |
| | - Lightweight retrieval |
| | - Geometric experimentation |
| | |
| | ## Limitations |
| | - No contextual modeling |
| | - No token interaction modeling |
| | - Domain performance varies |
| | |
| | --- |
| | |
| | ## Stress Test Results (Hard Retrieval Setting) |
| | |
| | - corpus size = 1200 |
| | - eval size = 200 |
| | - ood ratio = 0.28 |
| | |
| | | Model | in-domain MRR@10 | OOD MRR@10 | |
| | |--------|-----------------|------------| |
| | | MiniLM | 0.019 | 0.026 | |
| | | BGE | 0.026 | 0.009 | |
| | | PipeOwl | 0.013 | 0.023 | |
| | |
| | Note: This test uses a harder corpus and adversarial-style queries. |
| | Absolute scores are low due to difficulty scaling. |
| | |
| | See full experimental notes here: |
| | <https://hackmd.io/@galaxy4552/BkpUEnTwbl> |
| | |
| | --- |
| | |
| | ```bash |
| | pipeowl/ |
| | │ |
| | ├─ README.md |
| | ├─ LICENSE |
| | │ |
| | ├─ engine.py |
| | ├─ quickstart.py |
| | │ |
| | └─ data/ |
| | ├─ L1_base_embeddings.npy |
| | ├─ delta_base_scalar.npy |
| | └─ L1_base_vocab.json |
| | ``` |