File size: 2,351 Bytes
61af7c0
 
 
 
 
 
 
 
 
374ff6e
61af7c0
374ff6e
61af7c0
374ff6e
61af7c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
05f513d
61af7c0
 
 
323d478
f37ad93
 
 
 
 
61af7c0
 
323d478
61af7c0
 
 
 
53eb854
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61af7c0
 
f37ad93
 
 
61af7c0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
language: zh
tags:
- embeddings
- retrieval
- numpy
- transformer-free
license: mit
---

# PipeOwl-1.0 (Geometric Embedding)

PipeOwl is a transformer-free geometric embedding package built on a **static embedding field** stored as NumPy arrays.

This repo provides:
- `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized)
- `L1_base_vocab.json`: list of vocab strings aligned to embedding rows
- `delta_base_scalar.npy`: float32 (V,) optional scalar bias field
- minimal inference engine (`engine.py`) and usage script (`quickstart.py`)

---

## Attribution
The base embedding vectors were generated using **BGE (Apache-2.0)** via inference (model outputs).
This repository **does not redistribute any original BGE model weights**.

---

## Quickstart

```bash
pip install numpy
python quickstart.py
```
Or minimal usage:
```python
from engine import PipeOwlEngine, PipeOwlConfig

engine = PipeOwlEngine(PipeOwlConfig())
q = engine.encode("雪鴞好可愛")
# use q for similarity / retrieval
```

Files

- data/L1_base_embeddings.npy : embedding table (float32, V×1024)
- data/L1_base_vocab.json : vocab aligned with rows
- data/delta_base_scalar.npy : scalar bias (float32, V)
- engine.py : minimal runtime
- quickstart.py : example script

Notes

No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.

---

## Parameter Size
~165M embedding parameters (static matrix)

## Intended Use
- Semantic similarity
- Lightweight retrieval
- Geometric experimentation

## Limitations
- No contextual modeling
- No token interaction modeling
- Domain performance varies

---

## Stress Test Results (Hard Retrieval Setting)

- corpus size = 1200
- eval size = 200
- ood ratio = 0.28

| Model | in-domain MRR@10 | OOD MRR@10 |
|--------|-----------------|------------|
| MiniLM | 0.019 | 0.026 |
| BGE | 0.026 | 0.009 |
| PipeOwl | 0.013 | 0.023 |

Note: This test uses a harder corpus and adversarial-style queries.
Absolute scores are low due to difficulty scaling.

See full experimental notes here:
<https://hackmd.io/@galaxy4552/BkpUEnTwbl>

---

```bash
pipeowl/

├─ README.md
├─ LICENSE

├─ engine.py
├─ quickstart.py

└─ data/
    ├─ L1_base_embeddings.npy
    ├─ delta_base_scalar.npy
    └─ L1_base_vocab.json
```