WangKaiLin commited on
Commit
61af7c0
·
verified ·
1 Parent(s): 374ff6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -12
README.md CHANGED
@@ -1,15 +1,92 @@
1
- ## Model Type
2
- Geometric embedding field (non-neural, transformer-free)
 
 
 
 
 
 
 
3
 
4
- ## Parameter Size
5
- ~165M embedding parameters (static matrix)
6
 
7
- ## Intended Use
8
- - Semantic similarity
9
- - Lightweight retrieval
10
- - Geometric experimentation
11
 
12
- ## Limitations
13
- - No contextual modeling
14
- - No token interaction modeling
15
- - Domain performance varies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: zh
3
+ tags:
4
+ - embeddings
5
+ - retrieval
6
+ - numpy
7
+ - transformer-free
8
+ license: mit
9
+ ---
10
 
11
+ # PipeOwl-1.0 (Geometric Embedding)
 
12
 
13
+ PipeOwl is a transformer-free geometric embedding package built on a **static embedding field** stored as NumPy arrays.
 
 
 
14
 
15
+ This repo provides:
16
+ - `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized)
17
+ - `L1_base_vocab.json`: list of vocab strings aligned to embedding rows
18
+ - `delta_base_scalar.npy`: float32 (V,) optional scalar bias field
19
+ - minimal inference engine (`engine.py`) and usage script (`quickstart.py`)
20
+
21
+ ---
22
+
23
+ ## Attribution
24
+ The base embedding vectors were generated using **BGE (Apache-2.0)** via inference (model outputs).
25
+ This repository **does not redistribute any original BGE model weights**.
26
+
27
+ ---
28
+
29
+ ## Quickstart
30
+
31
+ ```bash
32
+ pip install numpy
33
+ python quickstart.py
34
+ ```
35
+ Or minimal usage:
36
+ ```python
37
+ from engine import PipeOwlEngine, PipeOwlConfig
38
+
39
+ engine = PipeOwlEngine(PipeOwlConfig())
40
+ q = engine.encode("雪鴞好可愛")
41
+ ```
42
+
43
+ ---
44
+
45
+ # use q for similarity / retrieval
46
+ Files
47
+ data/L1_base_embeddings.npy : embedding table (float32, V×1024)
48
+ data/L1_base_vocab.json : vocab aligned with rows
49
+ data/delta_base_scalar.npy : scalar bias (float32, V)
50
+ engine.py : minimal runtime
51
+ quickstart.py : example script
52
+
53
+ Notes
54
+ No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.
55
+
56
+ ---
57
+
58
+ ## Stress Test Results (Hard Retrieval Setting)
59
+
60
+ corpus size = 1200
61
+ eval size = 200
62
+ ood ratio = 0.28
63
+
64
+ | Model | in-domain MRR@10 | OOD MRR@10 |
65
+ |--------|-----------------|------------|
66
+ | MiniLM | 0.019 | 0.026 |
67
+ | BGE | 0.026 | 0.009 |
68
+ | PipeOwl | 0.013 | 0.023 |
69
+
70
+ Note: This test uses a harder corpus and adversarial-style queries.
71
+ Absolute scores are low due to difficulty scaling.
72
+
73
+ See full experimental notes here:
74
+ <https://hackmd.io/@galaxy4552/BkpUEnTwbl>
75
+
76
+ ---
77
+
78
+ ```bash
79
+ pipeowl/
80
+
81
+ ├─ README.md
82
+ ├─ model_card.md
83
+ ├─ LICENSE
84
+
85
+ ├─ engine.py
86
+ ├─ quickstart.py
87
+
88
+ └─ data/
89
+ ├─ L1_base_embeddings.npy
90
+ ├─ delta_base_scalar.npy
91
+ └─ L1_base_vocab.json
92
+ ```