File size: 2,351 Bytes
61af7c0 374ff6e 61af7c0 374ff6e 61af7c0 374ff6e 61af7c0 05f513d 61af7c0 323d478 f37ad93 61af7c0 323d478 61af7c0 53eb854 61af7c0 f37ad93 61af7c0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | ---
language: zh
tags:
- embeddings
- retrieval
- numpy
- transformer-free
license: mit
---
# PipeOwl-1.0 (Geometric Embedding)
PipeOwl is a transformer-free geometric embedding package built on a **static embedding field** stored as NumPy arrays.
This repo provides:
- `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized)
- `L1_base_vocab.json`: list of vocab strings aligned to embedding rows
- `delta_base_scalar.npy`: float32 (V,) optional scalar bias field
- minimal inference engine (`engine.py`) and usage script (`quickstart.py`)
---
## Attribution
The base embedding vectors were generated using **BGE (Apache-2.0)** via inference (model outputs).
This repository **does not redistribute any original BGE model weights**.
---
## Quickstart
```bash
pip install numpy
python quickstart.py
```
Or minimal usage:
```python
from engine import PipeOwlEngine, PipeOwlConfig
engine = PipeOwlEngine(PipeOwlConfig())
q = engine.encode("雪鴞好可愛")
# use q for similarity / retrieval
```
Files
- data/L1_base_embeddings.npy : embedding table (float32, V×1024)
- data/L1_base_vocab.json : vocab aligned with rows
- data/delta_base_scalar.npy : scalar bias (float32, V)
- engine.py : minimal runtime
- quickstart.py : example script
Notes
No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.
---
## Parameter Size
~165M embedding parameters (static matrix)
## Intended Use
- Semantic similarity
- Lightweight retrieval
- Geometric experimentation
## Limitations
- No contextual modeling
- No token interaction modeling
- Domain performance varies
---
## Stress Test Results (Hard Retrieval Setting)
- corpus size = 1200
- eval size = 200
- ood ratio = 0.28
| Model | in-domain MRR@10 | OOD MRR@10 |
|--------|-----------------|------------|
| MiniLM | 0.019 | 0.026 |
| BGE | 0.026 | 0.009 |
| PipeOwl | 0.013 | 0.023 |
Note: This test uses a harder corpus and adversarial-style queries.
Absolute scores are low due to difficulty scaling.
See full experimental notes here:
<https://hackmd.io/@galaxy4552/BkpUEnTwbl>
---
```bash
pipeowl/
│
├─ README.md
├─ LICENSE
│
├─ engine.py
├─ quickstart.py
│
└─ data/
├─ L1_base_embeddings.npy
├─ delta_base_scalar.npy
└─ L1_base_vocab.json
``` |