WangKaiLin
/

PipeOwl

transformer-free

Model card Files Files and versions

PipeOwl / README.md

WangKaiLin's picture

Update README.md

53eb854 verified 14 days ago

|

history blame contribute delete

2.35 kB

	---
	language: zh
	tags:
	- embeddings
	- retrieval
	- numpy
	- transformer-free
	license: mit
	---

	# PipeOwl-1.0 (Geometric Embedding)

	PipeOwl is a transformer-free geometric embedding package built on a static embedding field stored as NumPy arrays.

	This repo provides:
	- `L1_base_embeddings.npy`: float32 (V, 1024) embedding table (unit-normalized)
	- `L1_base_vocab.json`: list of vocab strings aligned to embedding rows
	- `delta_base_scalar.npy`: float32 (V,) optional scalar bias field
	- minimal inference engine (`engine.py`) and usage script (`quickstart.py`)

	---

	## Attribution
	The base embedding vectors were generated using BGE (Apache-2.0) via inference (model outputs).
	This repository does not redistribute any original BGE model weights.

	---

	## Quickstart

	```bash
	pip install numpy
	python quickstart.py
	```
	Or minimal usage:
	```python
	from engine import PipeOwlEngine, PipeOwlConfig

	engine = PipeOwlEngine(PipeOwlConfig())
	q = engine.encode("雪鴞好可愛")
	# use q for similarity / retrieval
	```

	Files

	- data/L1_base_embeddings.npy : embedding table (float32, V×1024)
	- data/L1_base_vocab.json : vocab aligned with rows
	- data/delta_base_scalar.npy : scalar bias (float32, V)
	- engine.py : minimal runtime
	- quickstart.py : example script

	Notes

	No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.

	---

	## Parameter Size
	~165M embedding parameters (static matrix)

	## Intended Use
	- Semantic similarity
	- Lightweight retrieval
	- Geometric experimentation

	## Limitations
	- No contextual modeling
	- No token interaction modeling
	- Domain performance varies

	---

	## Stress Test Results (Hard Retrieval Setting)

	- corpus size = 1200
	- eval size = 200
	- ood ratio = 0.28

	\| Model \| in-domain MRR@10 \| OOD MRR@10 \|
	\|--------\|-----------------\|------------\|
	\| MiniLM \| 0.019 \| 0.026 \|
	\| BGE \| 0.026 \| 0.009 \|
	\| PipeOwl \| 0.013 \| 0.023 \|

	Note: This test uses a harder corpus and adversarial-style queries.
	Absolute scores are low due to difficulty scaling.

	See full experimental notes here:
	<https://hackmd.io/@galaxy4552/BkpUEnTwbl>

	---

	```bash
	pipeowl/
	│
	├─ README.md
	├─ LICENSE
	│
	├─ engine.py
	├─ quickstart.py
	│
	└─ data/
	├─ L1_base_embeddings.npy
	├─ delta_base_scalar.npy
	└─ L1_base_vocab.json
	```