bpe_glove_512

GloVe word embeddings trained on English Wikipedia where each "word" is an OpenAI cl100k_base BPE token id.

Training

  • Corpus: jsanzolac/ga_wikipedia (English Wikipedia dump 2023-11-01)
  • Tokenizer: tiktoken.cl100k_base
  • Implementation: stanfordnlp/GloVe
  • Vector size: 512
  • Min vocab count: 1
  • Window size: 15
  • Iterations: 15
  • x_max: 10

Files

File Purpose
vectors.txt GloVe text format: <bpe_id> v1 v2 ... v512
vectors.bin Binary format (-binary 2)
vocab.txt BPE id and its corpus count
token_id_to_string.json Mapping from BPE id → decoded cl100k_base string

Quick start

import numpy as np, tiktoken
from huggingface_hub import hf_hub_download

vec_path = hf_hub_download("jsanzolac/bpe_glove_512", "vectors.txt")
enc = tiktoken.get_encoding("cl100k_base")

embeddings = {}
with open(vec_path) as f:
    for line in f:
        parts = line.rstrip().split(" ")
        embeddings[int(parts[0])] = np.asarray(parts[1:], dtype=np.float32)

def embed(text):
    ids = enc.encode(text)
    return np.mean([embeddings[i] for i in ids if i in embeddings], axis=0)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsanzolac/bpe_glove_512

Adapters
2 models

Dataset used to train jsanzolac/bpe_glove_512