WONBKIM commited on
Commit
4e59e66
ยท
verified ยท
1 Parent(s): aa89e3f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -40
README.md CHANGED
@@ -1,41 +1,41 @@
1
- ---
2
- license: apache-2.0
3
- tags: [gpt2]
4
- language: ko
5
- ---
6
-
7
- # KoGPT2-small
8
-
9
- | Model | Batch Size | Tokenizer | Vocab Size | Max Length | Parameter Size |
10
- |:---: | :------: | :-----: | :------: | :----: | :------: |
11
- |GPT2 | 64 | BPE | 30,000 | 1024 | 108M |
12
-
13
-
14
- # DataSet
15
- - AIhub - ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ (4.8M)
16
- - KoWiki dump 230701 (1.4M)
17
-
18
-
19
- # Inference Example
20
-
21
- ```python
22
- from transformers import AutoTokenizer, GPT2LMHeadModel
23
-
24
- text = "์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด"
25
-
26
- tokenizer = AutoTokenizer.from_pretrained('dataslab/GPT2-small')
27
- model = GPT2LMHeadModel.from_pretrained('dataslab/GPT2-small')
28
-
29
- inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False)
30
-
31
- outputs = model.generate(inputs['input_ids'], max_length=128,
32
- repetition_penalty=2.0,
33
- pad_token_id=tokenizer.pad_token_id,
34
- eos_token_id=tokenizer.eos_token_id,
35
- bos_token_id=tokenizer.bos_token_id,
36
- use_cache=True,
37
- temperature = 0.5)
38
- outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
39
-
40
- # ์ถœ๋ ฅ ๊ฒฐ๊ณผ : '์ถœ๊ทผ์ด ํž˜๋“ค๋ฉด ์ถœ๊ทผ์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ํ•˜์ง€๋งŒ ์ถœํ‡ด๊ทผ ์‹œ๊ฐ„์„ ๋Šฆ์ถ”๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ๊ฑด๊ฐ•์— ์ข‹์ง€ ์•Š๋‹ค.. ํŠนํžˆ๋‚˜ ์žฅ์‹œ๊ฐ„์˜ ์—…๋ฌด๋กœ ์ธํ•ด ํ”ผ๋กœ๊ฐ€ ์Œ“์ด๊ณ  ๋ฉด์—ญ๋ ฅ์ด ๋–จ์–ด์ง€๋ฉด, ํ”ผ๋กœ๊ฐ์ด ์‹ฌํ•ด์ ธ์„œ ์ž ๋“ค๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ผ๋ฉด ํ‰์†Œ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์œผ๋กœ ๊ณผ์‹์„ ํ•˜๊ฑฐ๋‚˜ ๋ฌด๋ฆฌํ•œ ๋‹ค์ด์–ดํŠธ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹๋‹จ ์กฐ์ ˆ๊ณผ ํ•จ๊ป˜ ์˜์–‘ ๋ณด์ถฉ์— ์‹ ๊ฒฝ ์จ์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ๊ณผ๋„ํ•œ ์Œ์‹์ด ์ฒด์ค‘ ๊ฐ๋Ÿ‰์— ๋„์›€์„ ์ฃผ๋ฏ€๋กœ ์ ์ ˆํ•œ ์šด๋™๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜๋‹ค.'
41
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ tags: [gpt2]
4
+ language: ko
5
+ ---
6
+
7
+ # KoGPT2-small
8
+
9
+ | Model | Batch Size | Tokenizer | Vocab Size | Max Length | Parameter Size |
10
+ |:---: | :------: | :-----: | :------: | :----: | :------: |
11
+ |GPT2 | 64 | BPE | 30,000 | 1024 | 108M |
12
+
13
+
14
+ # DataSet
15
+ - AIhub - ์›น๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ํ•œ๊ตญ์–ด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ (4.8M)
16
+ - KoWiki dump 230701 (1.4M)
17
+
18
+
19
+ # Inference Example
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, GPT2LMHeadModel
23
+
24
+ text = "์šด๋™์ด ํž˜๋“ค๋ฉด?"
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained('dataslab/GPT2-small')
27
+ model = GPT2LMHeadModel.from_pretrained('dataslab/GPT2-small')
28
+
29
+ inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=False)
30
+
31
+ outputs = model.generate(inputs['input_ids'], max_length=128,
32
+ repetition_penalty=2.0,
33
+ pad_token_id=tokenizer.pad_token_id,
34
+ eos_token_id=tokenizer.eos_token_id,
35
+ bos_token_id=tokenizer.bos_token_id,
36
+ use_cache=True,
37
+ temperature = 0.5)
38
+ outputs = tokenizer.decode(outputs[0], skip_special_tokens=True)
39
+
40
+ # ์ถœ๋ ฅ ๊ฒฐ๊ณผ : '์šด๋™์ด ํž˜๋“ค๋ฉด ์šด๋™์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ํ•˜์ง€๋งŒ ์šด๋™ ์‹œ๊ฐ„์„ ๋Šฆ์ถ”๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ๊ฑด๊ฐ•์— ์ข‹์ง€ ์•Š๋‹ค.. ํŠนํžˆ๋‚˜ ์žฅ์‹œ๊ฐ„์˜ ์šด๋™์œผ๋กœ ์ธํ•ด ํ”ผ๋กœ๊ฐ€ ์Œ“์ด๊ณ  ๋ฉด์—ญ๋ ฅ์ด ๋–จ์–ด์ง€๋ฉด, ํ”ผ๋กœ๊ฐ์ด ์‹ฌํ•ด์ ธ์„œ ์ž ๋“ค๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ผ๋ฉด ํ‰์†Œ๋ณด๋‹ค ๋” ๋งŽ์€ ์–‘์œผ๋กœ ๊ณผ์‹์„ ํ•˜๊ฑฐ๋‚˜ ๋ฌด๋ฆฌํ•œ ๋‹ค์ด์–ดํŠธ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹๋‹จ ์กฐ์ ˆ๊ณผ ํ•จ๊ป˜ ์˜์–‘ ๋ณด์ถฉ์— ์‹ ๊ฒฝ ์จ์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ๊ณผ๋„ํ•œ ์Œ์‹์ด ์ฒด์ค‘ ๊ฐ๋Ÿ‰์— ๋„์›€์„ ์ฃผ๋ฏ€๋กœ ์ ์ ˆํ•œ ์šด๋™๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•˜๋‹ค.'
41
  ```