| | --- |
| | license: apache-2.0 |
| | tags: [gpt2] |
| | language: ko |
| | --- |
| | |
| | # KoGPT2-small |
| |
|
| | | Model | Batch Size | Tokenizer | Vocab Size | Max Length | Parameter Size | |
| | |:---: | :------: | :-----: | :------: | :----: | :------: | |
| | |GPT2 | 64 | BPE | 30,000 | 1024 | 108M | |
| |
|
| |
|
| | # DataSet |
| | - AIhub - μΉλ°μ΄ν° κΈ°λ° νκ΅μ΄ λ§λμΉ λ°μ΄ν° (4.8M) |
| | - KoWiki dump 230701 (1.4M) |
| |
|
| |
|
| | # Inference Example |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, GPT2LMHeadModel |
| | |
| | text = "μ΄λμ΄ νλ€λ©΄?" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('dataslab/GPT2-small') |
| | model = GPT2LMHeadModel.from_pretrained('dataslab/GPT2-small') |
| | |
| | inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False) |
| | |
| | outputs = model.generate(inputs['input_ids'], max_length=128, |
| | repetition_penalty=2.0, |
| | pad_token_id=tokenizer.pad_token_id, |
| | eos_token_id=tokenizer.eos_token_id, |
| | bos_token_id=tokenizer.bos_token_id, |
| | use_cache=True, |
| | temperature = 0.5) |
| | outputs = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | |
| | # μΆλ ₯ κ²°κ³Ό : 'μ΄λμ΄ νλ€λ©΄ μ΄λμ νμ§ μλ κ²μ΄ μ’λ€. νμ§λ§ μ΄λ μκ°μ λ¦μΆλ κ²μ μ€νλ € 건κ°μ μ’μ§ μλ€.. νΉνλ μ₯μκ°μ μ΄λμΌλ‘ μΈν΄ νΌλ‘κ° μμ΄κ³ λ©΄μλ ₯μ΄ λ¨μ΄μ§λ©΄, νΌλ‘κ°μ΄ μ¬ν΄μ Έμ μ λ€κΈ° μ΄λ €μ΄ κ²½μ°κ° λ§λ€. μ΄λ° κ²½μ°λΌλ©΄ νμλ³΄λ€ λ λ§μ μμΌλ‘ κ³Όμμ νκ±°λ 무리ν λ€μ΄μ΄νΈλ₯Ό ν μ μλ€. λ°λΌμ μλ¨ μ‘°μ κ³Ό ν¨κ» μμ 보좩μ μ κ²½ μ¨μΌ νλ€. λν κ³Όλν μμμ΄ μ²΄μ€ κ°λμ λμμ μ£Όλ―λ‘ μ μ ν μ΄λλμ μ μ§νλ κ²λ μ€μνλ€.' |
| | ``` |