OpenOneRec commited on
Commit
4d4af6d
·
verified ·
1 Parent(s): caee226

Upload 1.7B Open

Browse files
Files changed (2) hide show
  1. README.md +57 -3
  2. model.pt +3 -0
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Residual K-Means Tokenizer
2
+
3
+ A residual K-means model for vector quantization. It encodes continuous embeddings into discrete codes through hierarchical clustering.
4
+
5
+ ## Files
6
+
7
+ - `res_kmeans.py` - Model definition
8
+ - `train_res_kmeans.py` - Training script
9
+ - `infer_res_kmeans.py` - Inference script
10
+
11
+ ## Installation
12
+
13
+ ```bash
14
+ pip install torch numpy pandas pyarrow faiss tqdm
15
+ ```
16
+
17
+ ## Usage
18
+
19
+ ### Training
20
+
21
+ ```bash
22
+ python train_res_kmeans.py \
23
+ --data_path ./data/embeddings.parquet \
24
+ --model_path ./checkpoints \
25
+ --n_layers 3 \
26
+ --codebook_size 8192 \
27
+ --dim 4096
28
+ ```
29
+
30
+ **Arguments:**
31
+ - `--data_path`: Path to parquet file(s) with `embedding` column
32
+ - `--model_path`: Directory to save the model
33
+ - `--n_layers`: Number of residual layers (default: 3)
34
+ - `--codebook_size`: Size of each codebook (default: 8192)
35
+ - `--dim`: Embedding dimension (default: 4096)
36
+ - `--seed`: Random seed (default: 42)
37
+
38
+ ### Inference
39
+
40
+ ```bash
41
+ python infer_res_kmeans.py \
42
+ --model_path ./checkpoints/model.pt \
43
+ --emb_path ./data/embeddings.parquet \
44
+ --output_path ./output/codes.parquet
45
+ ```
46
+
47
+ **Arguments:**
48
+ - `--model_path`: Path to trained model checkpoint
49
+ - `--emb_path`: Path to parquet file with `pid` and `embedding` columns
50
+ - `--output_path`: Output path (default: `{emb_path}_codes.parquet`)
51
+ - `--batch_size`: Inference batch size (default: 10000)
52
+ - `--device`: Device to use (default: cuda if available)
53
+ - `--n_layers`: Number of layers to use (default: all)
54
+
55
+ **Input format:** Parquet with columns `pid`, `embedding`
56
+
57
+ **Output format:** Parquet with columns `pid`, `codes`
model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5a019612c8c85e27351eb3e19d9663c387faa8f6b7d5cea22c6e76f9a6e134e
3
+ size 402655183