Jisu80609 commited on
Commit
fc41744
Β·
1 Parent(s): 803f1c4

Initial commit for the custom_summarization_dataset

Browse files
README.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Dataset Card for Custom Text Dataset
3
+
4
+ ## Dataset Name
5
+ μ»€μŠ€ν…€ CNN/DailyMail μΆ”μΆœ μš”μ•½ 데이터셋
6
+
7
+ ## Overview
8
+ 이 데이터셋은 CNN/DailyMail λ‰΄μŠ€ κΈ°μ‚¬μ—μ„œ μΆ”μΆœν•œ λ¬Έμž₯λ“€κ³Ό ν•΄λ‹Ή λ¬Έμž₯의 μš”μ•½μœΌλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆλ‹€.
9
+ 이 데이터셋은 ν›ˆλ ¨ 및 ν…ŒμŠ€νŠΈλ₯Ό μœ„ν•΄ μ»€μŠ€ν…€ν•œ μ†Œκ·œλͺ¨ ν•˜μœ„ 데이터셋을 ν¬ν•¨ν•˜κ³  μžˆλ‹€.
10
+
11
+ ## Composition
12
+ - ν›ˆλ ¨ 데이터: ν•˜λ‚˜μ˜ λ¬Έμž₯κ³Ό 그에 λŒ€ν•œ μš”μ•½μ΄ ν¬ν•¨λœ μƒ˜ν”Œ.
13
+ - ν…ŒμŠ€νŠΈ 데이터: CNN/DailyMail λ°μ΄ν„°μ…‹μ˜ 원본 ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ 100개의 μƒ˜ν”Œμ„ μΆ”μΆœ.
14
+
15
+ ## Collection Process
16
+ ν›ˆλ ¨ λ°μ΄ν„°λŠ” μˆ˜μž‘μ—…μœΌλ‘œ μƒμ„±λ˜μ—ˆμœΌλ©°, ν…ŒμŠ€νŠΈ λ°μ΄ν„°λŠ” `cnn_dailymail` λ°μ΄ν„°μ…‹μ˜ ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ—μ„œ μΆ”μΆœλ˜μ—ˆλ‹€.
17
+
18
+ ## Preprocessing
19
+ Hugging Face `datasets` 라이브러리λ₯Ό μ‚¬μš©ν•˜μ—¬ 데이터λ₯Ό μ „μ²˜λ¦¬ν–ˆλ‹€. ν›ˆλ ¨ 및 ν…ŒμŠ€νŠΈ 데이터셋은 Hugging Faceμ—μ„œ μ‚¬μš©ν•  수 μžˆλŠ” ν˜•μ‹μœΌλ‘œ μ €μž₯λ˜μ—ˆλ‹€.
20
+
21
+ ## How to Use
22
+ ```python
23
+ from datasets import load_from_disk
24
+
25
+ train_dataset = load_from_disk('./results/custom_dataset/train')
26
+ test_dataset = load_from_disk('./results/custom_dataset/test')
27
+ ```
28
+
29
+ ## Evaluation
30
+ 이 데이터셋은 ROUGE와 같은 전톡적인 μš”μ•½ 평가 μ§€ν‘œλ₯Ό μ‚¬μš©ν•˜μ—¬ 평가할 수 μžˆλ‹€.
31
+
32
+ ## Limitations
33
+ ν›ˆλ ¨ 데이터셋은 맀우 적은 μ–‘μœΌλ‘œ, μΌλ°˜ν™”κ°€ μ–΄λ €μšΈ 수 μžˆλ‹€. ν…ŒμŠ€νŠΈ λ°μ΄ν„°λŠ” μ™ΈλΆ€ μΆœμ²˜μ—μ„œ κ°€μ Έμ™”μœΌλ©°, 원본 데이터셋에 μ‘΄μž¬ν•˜λŠ” 편ν–₯이 포함될 수 μžˆλ‹€.
34
+
35
+ ## Ethical Considerations
36
+ 이 λ°μ΄ν„°μ…‹μ—λŠ” λ―Όκ°ν•œ μ •μΉ˜μ  μ£Όμ œμ™€ κ΄€λ ¨λœ λ‚΄μš©μ΄ ν¬ν•¨λ˜μ–΄ μžˆλ‹€. μ‚¬μš©μžλŠ” μš”μ•½μ—μ„œ λ°œμƒν•  수 μžˆλŠ” μ˜€ν•΄λ‚˜ 편ν–₯을 μ£Όμ˜ν•΄μ•Ό ν•œλ‹€
37
+
test/dataset_dict.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"splits": ["test"]}
test/test/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1e6aa13a3e10a33624931f6c220c9618528323886bd7b7ac334af681b8dc0646
3
+ size 346576
test/test/dataset_info.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "sentence": {
6
+ "feature": {
7
+ "dtype": "string",
8
+ "_type": "Value"
9
+ },
10
+ "_type": "Sequence"
11
+ },
12
+ "labels": {
13
+ "feature": {
14
+ "dtype": "string",
15
+ "_type": "Value"
16
+ },
17
+ "_type": "Sequence"
18
+ }
19
+ },
20
+ "homepage": "",
21
+ "license": ""
22
+ }
test/test/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "a966e5e39a3a551f",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }
train/dataset_dict.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"splits": ["train"]}
train/train/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c3b84a293ed7afd9641f578c760558feab774e12174775ffef3bd6d130873903
3
+ size 1400
train/train/dataset_info.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "sentence": {
6
+ "dtype": "string",
7
+ "_type": "Value"
8
+ },
9
+ "labels": {
10
+ "dtype": "string",
11
+ "_type": "Value"
12
+ }
13
+ },
14
+ "homepage": "",
15
+ "license": ""
16
+ }
train/train/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "a1df46296853828f",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }