jsanzolac/ga_wikipedia
Viewer • Updated • 6.41M • 27
GloVe word embeddings trained on English Wikipedia where each "word" is an
OpenAI cl100k_base BPE token id.
jsanzolac/ga_wikipedia (English Wikipedia dump 2023-11-01)tiktoken.cl100k_base| File | Purpose |
|---|---|
vectors.txt |
GloVe text format: <bpe_id> v1 v2 ... v512 |
vectors.bin |
Binary format (-binary 2) |
vocab.txt |
BPE id and its corpus count |
token_id_to_string.json |
Mapping from BPE id → decoded cl100k_base string |
import numpy as np, tiktoken
from huggingface_hub import hf_hub_download
vec_path = hf_hub_download("jsanzolac/bpe_glove_512", "vectors.txt")
enc = tiktoken.get_encoding("cl100k_base")
embeddings = {}
with open(vec_path) as f:
for line in f:
parts = line.rstrip().split(" ")
embeddings[int(parts[0])] = np.asarray(parts[1:], dtype=np.float32)
def embed(text):
ids = enc.encode(text)
return np.mean([embeddings[i] for i in ids if i in embeddings], axis=0)