OpenOneRec
/

OneRec-tokenizer

Model card Files Files and versions

OpenOneRec commited on Jan 5

Commit

4d4af6d

·

verified ·

1 Parent(s): caee226

Upload 1.7B Open

Files changed (2) hide show

README.md +57 -3
model.pt +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,57 @@
----
-license: apache-2.0
----

+# Residual K-Means Tokenizer
+A residual K-means model for vector quantization. It encodes continuous embeddings into discrete codes through hierarchical clustering.
+## Files
+- `res_kmeans.py` - Model definition
+- `train_res_kmeans.py` - Training script
+- `infer_res_kmeans.py` - Inference script
+## Installation
+```bash
+pip install torch numpy pandas pyarrow faiss tqdm
+```
+## Usage
+### Training
+```bash
+python train_res_kmeans.py \
+    --data_path ./data/embeddings.parquet \
+    --model_path ./checkpoints \
+    --n_layers 3 \
+    --codebook_size 8192 \
+    --dim 4096
+```
+**Arguments:**
+- `--data_path`: Path to parquet file(s) with `embedding` column
+- `--model_path`: Directory to save the model
+- `--n_layers`: Number of residual layers (default: 3)
+- `--codebook_size`: Size of each codebook (default: 8192)
+- `--dim`: Embedding dimension (default: 4096)
+- `--seed`: Random seed (default: 42)
+### Inference
+```bash
+python infer_res_kmeans.py \
+    --model_path ./checkpoints/model.pt \
+    --emb_path ./data/embeddings.parquet \
+    --output_path ./output/codes.parquet
+```
+**Arguments:**
+- `--model_path`: Path to trained model checkpoint
+- `--emb_path`: Path to parquet file with `pid` and `embedding` columns
+- `--output_path`: Output path (default: `{emb_path}_codes.parquet`)
+- `--batch_size`: Inference batch size (default: 10000)
+- `--device`: Device to use (default: cuda if available)
+- `--n_layers`: Number of layers to use (default: all)
+**Input format:** Parquet with columns `pid`, `embedding`
+**Output format:** Parquet with columns `pid`, `codes`

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5a019612c8c85e27351eb3e19d9663c387faa8f6b7d5cea22c6e76f9a6e134e
+size 402655183