Dataset

Can you just reveal the dataset size for now?

We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.

Thanx, Sounds interesting, You might have used generalising data too, I guess so

deleted

Nov 5, 2023

Can you just reveal the dataset size for now?

We used approximately 3T tokens. The detailed number and its construction will be described in the technical report.

Can you provide me your email or discord, I want to have a talk with you?

FancyZhao

Nov 5, 2023

Can you provide me your email or discord, I want to have a talk with you?

Sure, you reach us through email yi@01.ai.

breadlicker45

Nov 5, 2023

This comment has been hidden

dong0213

Nov 6, 2023

This comment has been hidden

markding

Feb 15, 2024

Any update on the datasets? We're keeping track of LLM openness at https://opening-up-chatgpt.github.io and Yi 34B Chat is currently in the bottom 5 (out of >30 'open' instruction tuned models) by degrees of openness because so little of source code, training data, instruction tuning etc. is shared or documented.

ehartford changed discussion status to closed Mar 10, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment