compendious commited on
Commit
f179148
·
1 Parent(s): f71ba81

more data cleaning. Tuning data and then tuning the model is next

Browse files
.github/README.md CHANGED
@@ -133,11 +133,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
133
 
134
  ### SQuALITY (Long-Document QA)
135
 
 
 
136
  Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
137
 
138
- <details>
139
-
140
- <summary>BibTeX</summary>
141
 
142
  ```bibtex
143
  @article{wang2022squality,
@@ -157,10 +157,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
157
 
158
  ### MS MARCO (Concise QA)
159
 
 
 
160
  Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
161
 
162
- <details>
163
- <summary>BibTeX</summary>
164
 
165
  ```bibtex
166
  @inproceedings{nguyen2016msmarco,
@@ -174,6 +175,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
174
 
175
  </details>
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  ## License
178
 
179
  [GPL-3.0](LICENSE.md)
 
133
 
134
  ### SQuALITY (Long-Document QA)
135
 
136
+ This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
137
+
138
  Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
139
 
140
+ <details> <summary>BibTeX</summary>
 
 
141
 
142
  ```bibtex
143
  @article{wang2022squality,
 
157
 
158
  ### MS MARCO (Concise QA)
159
 
160
+ This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
161
+
162
  Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
163
 
164
+ <details><summary>BibTeX</summary>
 
165
 
166
  ```bibtex
167
  @inproceedings{nguyen2016msmarco,
 
175
 
176
  </details>
177
 
178
+ ### QMSum
179
+
180
+ This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
181
+
182
+ Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
183
+
184
+ <details><summary>BibTeX</summary>
185
+
186
+ ```bibtex
187
+ @inproceedings{zhong2021qmsum,
188
+ title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
189
+ author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
190
+ booktitle={North American Association for Computational Linguistics (NAACL)},
191
+ year={2021}
192
+ }
193
+ ```
194
+
195
+ </details>
196
+
197
  ## License
198
 
199
  [GPL-3.0](LICENSE.md)
README.md CHANGED
@@ -87,14 +87,18 @@ Runs on `http://localhost:8000`. Interactive docs at `/docs`.
87
 
88
  ### Run the Frontend
89
 
 
 
90
  ```bash
91
  cd frontend
92
- npm install # or whatever replacement for npm you may be using
93
  npm run dev
94
  ```
95
 
96
  Runs on `http://localhost:5173`.
97
 
 
 
98
  ## Data
99
 
100
  <!-- markdownlint-disable MD033 -->
@@ -146,11 +150,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
146
 
147
  ### SQuALITY (Long-Document QA)
148
 
149
- Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
150
 
151
- <details>
152
 
153
- <summary>BibTeX</summary>
154
 
155
  ```bibtex
156
  @article{wang2022squality,
@@ -170,10 +174,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
170
 
171
  ### MS MARCO (Concise QA)
172
 
 
 
173
  Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
174
 
175
- <details>
176
- <summary>BibTeX</summary>
177
 
178
  ```bibtex
179
  @inproceedings{nguyen2016msmarco,
@@ -187,6 +192,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
187
 
188
  </details>
189
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  ## License
191
 
192
  [GPL-3.0](LICENSE.md)
 
87
 
88
  ### Run the Frontend
89
 
90
+ In another terminal, run:
91
+
92
  ```bash
93
  cd frontend
94
+ npm install # or use any npm alternative
95
  npm run dev
96
  ```
97
 
98
  Runs on `http://localhost:5173`.
99
 
100
+ **Development Setup**: The frontend dev server will automatically proxy API calls to the backend. Just access the app at `http://localhost:5173` during development.
101
+
102
  ## Data
103
 
104
  <!-- markdownlint-disable MD033 -->
 
150
 
151
  ### SQuALITY (Long-Document QA)
152
 
153
+ This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
154
 
155
+ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
156
 
157
+ <details> <summary>BibTeX</summary>
158
 
159
  ```bibtex
160
  @article{wang2022squality,
 
174
 
175
  ### MS MARCO (Concise QA)
176
 
177
+ This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
178
+
179
  Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
180
 
181
+ <details><summary>BibTeX</summary>
 
182
 
183
  ```bibtex
184
  @inproceedings{nguyen2016msmarco,
 
192
 
193
  </details>
194
 
195
+ ### QMSum
196
+
197
+ This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
198
+
199
+ Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
200
+
201
+ <details><summary>BibTeX</summary>
202
+
203
+ ```bibtex
204
+ @inproceedings{zhong2021qmsum,
205
+ title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
206
+ author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
207
+ booktitle={North American Association for Computational Linguistics (NAACL)},
208
+ year={2021}
209
+ }
210
+ ```
211
+
212
+ </details>
213
+
214
  ## License
215
 
216
  [GPL-3.0](LICENSE.md)
backend/app.py CHANGED
@@ -32,7 +32,10 @@ app.add_middleware(
32
  allow_headers=["Content-Type", "X-API-Key"],
33
  )
34
 
35
- app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
 
 
 
36
 
37
  def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
38
  if not API_KEY:
 
32
  allow_headers=["Content-Type", "X-API-Key"],
33
  )
34
 
35
+ # Only mount frontend in production when dist/ exists
36
+ import os
37
+ if os.path.isdir("frontend/dist"):
38
+ app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
39
 
40
  def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
41
  if not API_KEY:
backend/ollama.py CHANGED
@@ -28,7 +28,7 @@ def build_prompt(title: Optional[str], text: str) -> str:
28
  return (
29
  f"{instructions}\n"
30
  "Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
31
- "or ANYTHING along those lines whether it be in meaning or phrasing. "
32
  "Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
33
  f"Article:\n{text}\n\n"
34
  "Summary:"
 
28
  return (
29
  f"{instructions}\n"
30
  "Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
31
+ "or any similar phrasing, whether the similarity be in meaning or otherwise. Get straight to the point."
32
  "Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
33
  f"Article:\n{text}\n\n"
34
  "Summary:"
scripts/clean.py CHANGED
@@ -9,14 +9,22 @@ import os
9
  def run_script(script_path):
10
  subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
11
 
12
- # Run both cleaning scripts in parallel for speed
13
  t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
14
  t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
 
 
 
15
 
16
  t1.start()
17
  t2.start()
 
 
 
18
 
19
  t1.join()
20
  t2.join()
 
 
 
21
 
22
  print("All cleaning scripts completed.")
 
9
  def run_script(script_path):
10
  subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
11
 
 
12
  t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
13
  t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
14
+ t3 = threading.Thread(target=run_script, args=("cleaners/clean_msm.py",))
15
+ t4 = threading.Thread(target=run_script, args=("cleaners/clean_qmsum.py",))
16
+ t5 = threading.Thread(target=run_script, args=("cleaners/clean_squality.py",))
17
 
18
  t1.start()
19
  t2.start()
20
+ t3.start()
21
+ t4.start()
22
+ t5.start()
23
 
24
  t1.join()
25
  t2.join()
26
+ t3.join()
27
+ t4.join()
28
+ t5.join()
29
 
30
  print("All cleaning scripts completed.")
scripts/cleaners/clean_msm.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RANDOMLY Takes 10,000 lines from ../raw_data/raw_data_msmarco_train.csv, 1,000 lines from ../raw_data/raw_data_msmarco_val.csv. Then converts each one to JSONL.
3
+
4
+ """
5
+ import random
6
+ import json
7
+ import csv
8
+ import os
9
+
10
+ def reservoir_sample_csv(file_path, k):
11
+ rows = []
12
+ with open(file_path, 'r', encoding='utf-8') as f:
13
+ reader = csv.DictReader(f)
14
+ for row in reader:
15
+ rows.append(row)
16
+ if len(rows) <= k:
17
+ return rows
18
+ return random.sample(rows, k)
19
+
20
+ def write_jsonl(rows, output_path):
21
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
22
+ with open(output_path, 'w', encoding='utf-8') as f:
23
+ for i, row in enumerate(rows):
24
+ new_data = {
25
+ "id": i,
26
+ "original_source": "MSMarco",
27
+ "query": row.get("query", ""),
28
+ "answers": row.get("answers", ""),
29
+ "passage": row.get("finalpassage", "")
30
+ }
31
+ json.dump(new_data, f, indent=2)
32
+ f.write('\n')
33
+
34
+ print("Cleaning MSMarco dataset...")
35
+
36
+ ta = '../raw_data/raw_data_msmarco_train.csv'
37
+ tb = '../raw_data/raw_data_msmarco_val.csv'
38
+
39
+ train_loc = '../clean1/msm/msmarco_train_10k.jsonl'
40
+ test_loc = '../clean1/msm/msmarco_val_1k.jsonl'
41
+
42
+ print("Sampling rows from raw data CSV files...")
43
+
44
+ train_rows = reservoir_sample_csv(ta, 10000)
45
+ test_rows = reservoir_sample_csv(tb, 1000)
46
+
47
+ print("Collected Samples. Writing to JSONL files...")
48
+
49
+ write_jsonl(train_rows, train_loc)
50
+ write_jsonl(test_rows, test_loc)
51
+
52
+ print("Done")
scripts/cleaners/clean_qmsum.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RANDOMLY Takes 10,000 lines from ../raw_data/raw_qmsum_train.jsonl, 1,000 lines from ../raw_data/raw_qmsum_test.jsonl, and 1,000 lines from ../raw_data/raw_qmsum_val.jsonl. Then converts each one to simplified JSONL.
3
+
4
+ """
5
+ import random
6
+ import json
7
+ import os
8
+
9
+ def reservoir_sample(file_path, k):
10
+ reservoir = []
11
+ with open(file_path, 'r', encoding='utf-8') as f:
12
+ for i, line in enumerate(f):
13
+ if i < k:
14
+ reservoir.append(line.strip())
15
+ else:
16
+ j = random.randint(0, i)
17
+ if j < k:
18
+ reservoir[j] = line.strip()
19
+ return reservoir
20
+
21
+ def write_jsonl(lines, output_path):
22
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
23
+ with open(output_path, 'w', encoding='utf-8') as f:
24
+ for i, line in enumerate(lines):
25
+ data = json.loads(line)
26
+ # Extract first general query and answer if available
27
+ general_query = ""
28
+ general_answer = ""
29
+ if "general_query_list" in data and len(data["general_query_list"]) > 0:
30
+ general_query = data["general_query_list"][0].get("query", "")
31
+ general_answer = data["general_query_list"][0].get("answer", "")
32
+
33
+ new_data = {
34
+ "id": i,
35
+ "original_source": "QMSum",
36
+ "general_query": general_query,
37
+ "general_answer": general_answer,
38
+ "topic_list": data.get("topic_list", [])
39
+ }
40
+ json.dump(new_data, f, indent=2)
41
+ f.write('\n')
42
+
43
+ print("Cleaning QMSum dataset...")
44
+
45
+ ta = '../raw_data/raw_qmsum_train.jsonl'
46
+ tb = '../raw_data/raw_qmsum_test.jsonl'
47
+ vc = '../raw_data/raw_qmsum_val.jsonl'
48
+
49
+ train_loc = '../clean1/qmsum/qmsum_train_10k.jsonl'
50
+ test_loc = '../clean1/qmsum/qmsum_test_1k.jsonl'
51
+ val_loc = '../clean1/qmsum/qmsum_val_1k.jsonl'
52
+
53
+ print("Sampling lines from raw data files...")
54
+
55
+ train_lines = reservoir_sample(ta, 10000)
56
+ test_lines = reservoir_sample(tb, 1000)
57
+ val_lines = reservoir_sample(vc, 1000)
58
+
59
+ print("Collected Samples. Writing to JSONL files...")
60
+
61
+ write_jsonl(train_lines, train_loc)
62
+ write_jsonl(test_lines, test_loc)
63
+ write_jsonl(val_lines, val_loc)
64
+
65
+ print("Done")
scripts/cleaners/clean_squality.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ RANDOMLY Takes 10,000 lines from ../raw_data/raw_squality_train.jsonl, 1,000 lines from ../raw_data/raw_squality_test.jsonl, and 1,000 lines from ../raw_data/raw_squality_val.jsonl. Then converts each one to simplified JSONL.
3
+
4
+ """
5
+ import random
6
+ import json
7
+ import os
8
+
9
+ def reservoir_sample(file_path, k):
10
+ reservoir = []
11
+ with open(file_path, 'r', encoding='utf-8') as f:
12
+ for i, line in enumerate(f):
13
+ if i < k:
14
+ reservoir.append(line.strip())
15
+ else:
16
+ j = random.randint(0, i)
17
+ if j < k:
18
+ reservoir[j] = line.strip()
19
+ return reservoir
20
+
21
+ def write_jsonl(lines, output_path):
22
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
23
+ with open(output_path, 'w', encoding='utf-8') as f:
24
+ for i, line in enumerate(lines):
25
+ data = json.loads(line)
26
+ # Extract key fields from the data
27
+ source_type = data.get("source_type", "")
28
+ query_synthesized = data.get("query_synthesized", "")
29
+ summary = data.get("summary", "")
30
+ document = data.get("document", "")
31
+
32
+ new_data = {
33
+ "id": i,
34
+ "original_source": "SQuALITY",
35
+ "source_type": source_type,
36
+ "query": query_synthesized,
37
+ "summary": summary,
38
+ "document": document[:500] if document else "" # Truncate long documents
39
+ }
40
+ json.dump(new_data, f, indent=2)
41
+ f.write('\n')
42
+
43
+ print("Cleaning SQuALITY dataset...")
44
+
45
+ ta = '../raw_data/raw_squality_train.jsonl'
46
+ tb = '../raw_data/raw_squality_test.jsonl'
47
+ vc = '../raw_data/raw_squality_val.jsonl'
48
+
49
+ train_loc = '../clean1/squality/squality_train_10k.jsonl'
50
+ test_loc = '../clean1/squality/squality_test_1k.jsonl'
51
+ val_loc = '../clean1/squality/squality_val_1k.jsonl'
52
+
53
+ print("Sampling lines from raw data files...")
54
+
55
+ train_lines = reservoir_sample(ta, 10000)
56
+ test_lines = reservoir_sample(tb, 1000)
57
+ val_lines = reservoir_sample(vc, 1000)
58
+
59
+ print("Collected Samples. Writing to JSONL files...")
60
+
61
+ write_jsonl(train_lines, train_loc)
62
+ write_jsonl(test_lines, test_loc)
63
+ write_jsonl(val_lines, val_loc)
64
+
65
+ print("Done")