bpe_glove_512

GloVe word embeddings trained on English Wikipedia where each "word" is an OpenAI cl100k_base BPE token id.

Training

Corpus: jsanzolac/ga_wikipedia (English Wikipedia dump 2023-11-01)
Tokenizer: tiktoken.cl100k_base
Implementation: stanfordnlp/GloVe
Vector size: 512
Min vocab count: 1
Window size: 15
Iterations: 15
x_max: 10

Files

File	Purpose
`vectors.txt`	GloVe text format: `<bpe_id> v1 v2 ... v512`
`vectors.bin`	Binary format (`-binary 2`)
`vocab.txt`	BPE id and its corpus count
`token_id_to_string.json`	Mapping from BPE id → decoded `cl100k_base` string

Quick start

import numpy as np, tiktoken
from huggingface_hub import hf_hub_download

vec_path = hf_hub_download("jsanzolac/bpe_glove_512", "vectors.txt")
enc = tiktoken.get_encoding("cl100k_base")

embeddings = {}
with open(vec_path) as f:
    for line in f:
        parts = line.rstrip().split(" ")
        embeddings[int(parts[0])] = np.asarray(parts[1:], dtype=np.float32)

def embed(text):
    ids = enc.encode(text)
    return np.mean([embeddings[i] for i in ids if i in embeddings], axis=0)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsanzolac/bpe_glove_512

Adapters

2 models

jsanzolac
/

bpe_glove_512

bpe_glove_512

Training

Files

Quick start

Model tree for jsanzolac/bpe_glove_512

Dataset used to train jsanzolac/bpe_glove_512