WangKaiLin
/

PipeOwl

transformer-free

Model card Files Files and versions

WangKaiLin commited on 15 days ago

Commit

61af7c0

·

verified ·

1 Parent(s): 374ff6e

Update README.md

Files changed (1) hide show

README.md +89 -12

README.md CHANGED Viewed

@@ -1,15 +1,92 @@
-## Model Type
-Geometric embedding field (non-neural, transformer-free)
-## Parameter Size
-~165M embedding parameters (static matrix)
-## Intended Use
-- Semantic similarity
-- Lightweight retrieval
-- Geometric experimentation
-## Limitations
-- No contextual modeling
-- No token interaction modeling
-- Domain performance varies

+---
+language: zh
+tags:
+- embeddings
+- retrieval
+- numpy
+- transformer-free
+license: mit
+---
+# PipeOwl-1.0 (Geometric Embedding)
+PipeOwl is a transformer-free geometric embedding package built on a **static embedding field** stored as NumPy arrays.
+This repo provides:
+- `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized)
+- `L1_base_vocab.json`: list of vocab strings aligned to embedding rows
+- `delta_base_scalar.npy`: float32 (V,) optional scalar bias field
+- minimal inference engine (`engine.py`) and usage script (`quickstart.py`)
+---
+## Attribution
+The base embedding vectors were generated using **BGE (Apache-2.0)** via inference (model outputs).
+This repository **does not redistribute any original BGE model weights**.
+---
+## Quickstart
+```bash
+pip install numpy
+python quickstart.py
+```
+Or minimal usage:
+```python
+from engine import PipeOwlEngine, PipeOwlConfig
+engine = PipeOwlEngine(PipeOwlConfig())
+q = engine.encode("雪鴞好可愛")
+```
+---
+# use q for similarity / retrieval
+Files
+data/L1_base_embeddings.npy : embedding table (float32, V×1024)
+data/L1_base_vocab.json : vocab aligned with rows
+data/delta_base_scalar.npy : scalar bias (float32, V)
+engine.py : minimal runtime
+quickstart.py : example script
+Notes
+No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.
+---
+## Stress Test Results (Hard Retrieval Setting)
+corpus size = 1200
+eval size = 200
+ood ratio = 0.28
+| Model | in-domain MRR@10 | OOD MRR@10 |
+|--------|-----------------|------------|
+| MiniLM | 0.019 | 0.026 |
+| BGE | 0.026 | 0.009 |
+| PipeOwl | 0.013 | 0.023 |
+Note: This test uses a harder corpus and adversarial-style queries.
+Absolute scores are low due to difficulty scaling.
+See full experimental notes here:
+<https://hackmd.io/@galaxy4552/BkpUEnTwbl>
+---
+```bash
+pipeowl/
+│
+├─ README.md
+├─ model_card.md
+├─ LICENSE
+│
+├─ engine.py
+├─ quickstart.py
+│
+└─ data/
+    ├─ L1_base_embeddings.npy
+    ├─ delta_base_scalar.npy
+    └─ L1_base_vocab.json
+```