Update vocab size

#42

by mathemakitten - opened Nov 28, 2022

base: refs/heads/main

←

from: refs/pr/42

Discussion Files changed

-1

mathemakitten

Nov 28, 2022

Per https://huggingface.co/bigscience/bloom-560m/blob/main/config.json, vocab size is 250880 not 250680.

Update vocab size5dca9466

mathemakitten

Nov 28, 2022

I can't figure out how to update my PR in this interface, but perhaps there should be a note somewhere indicating the padding is 200 and actual vocab size is 250680. The model config.json says the vocab size is 250,880 but the card says 250,680, which is confusing to newcomers to BLOOM because 256,901,120 embedding parameters / 1024 embedding dim = 250,880, not 250,680.

mathemakitten changed pull request status to closed Nov 28, 2022

TimeRobber

BigScience Workshop org Nov 29, 2022

Okay so 250,880 is the dimension in the embedding matrix. However the tokenizer only generates 250680 different tokens. I think the config.json sets the value to 250880 as the embedding matrix had that number of rows.

christopher

BigScience Workshop org Nov 29, 2022

@mathemakitten You can click on the PR label/button thingy and it will take you to that branch so you can update the PR in the GUI

julien-c

BigScience Workshop org Dec 3, 2022

(or using git command line if that is an option @mathemakitten )

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment