Deduplicating Training Data Makes Language Models Better
Paper
โข
2107.06499
โข
Published
โข
4
(์ฃผ)๋ฏธ๋์ด๊ทธ๋ฃน์ฌ๋๊ณผ์ฒ๊ณผ (์ฃผ)๋ง์ปค์ LLM ์ฐ๊ตฌ ์ปจ์์์์์ ๊ฐ๋ฐ๋ ๋ชจ๋ธ์
๋๋ค
The license is cc-by-nc-sa.
Model Developers SeungyooLee (DopeorNope)
Input Models input text only.
Output Models generate text only.
Model Architecture
pub-llama-13b-v6 is an auto-regressive language model based on the LLaMA2 transformer architecture.
Training Dataset
DopeorNope/OpenOrca-near-dedup-v1 dataset was created by Near dedup algorithm to reduce similarity.
We will open it soon.