Spaces:
Build error
Build error
Commit ·
f179148
1
Parent(s): f71ba81
more data cleaning. Tuning data and then tuning the model is next
Browse files- .github/README.md +25 -5
- README.md +30 -6
- backend/app.py +4 -1
- backend/ollama.py +1 -1
- scripts/clean.py +9 -1
- scripts/cleaners/clean_msm.py +52 -0
- scripts/cleaners/clean_qmsum.py +65 -0
- scripts/cleaners/clean_squality.py +65 -0
.github/README.md
CHANGED
|
@@ -133,11 +133,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
|
|
| 133 |
|
| 134 |
### SQuALITY (Long-Document QA)
|
| 135 |
|
|
|
|
|
|
|
| 136 |
Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
|
| 137 |
|
| 138 |
-
<details>
|
| 139 |
-
|
| 140 |
-
<summary>BibTeX</summary>
|
| 141 |
|
| 142 |
```bibtex
|
| 143 |
@article{wang2022squality,
|
|
@@ -157,10 +157,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
|
|
| 157 |
|
| 158 |
### MS MARCO (Concise QA)
|
| 159 |
|
|
|
|
|
|
|
| 160 |
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 161 |
|
| 162 |
-
<details>
|
| 163 |
-
<summary>BibTeX</summary>
|
| 164 |
|
| 165 |
```bibtex
|
| 166 |
@inproceedings{nguyen2016msmarco,
|
|
@@ -174,6 +175,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
|
|
| 174 |
|
| 175 |
</details>
|
| 176 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
## License
|
| 178 |
|
| 179 |
[GPL-3.0](LICENSE.md)
|
|
|
|
| 133 |
|
| 134 |
### SQuALITY (Long-Document QA)
|
| 135 |
|
| 136 |
+
This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
|
| 137 |
+
|
| 138 |
Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
|
| 139 |
|
| 140 |
+
<details> <summary>BibTeX</summary>
|
|
|
|
|
|
|
| 141 |
|
| 142 |
```bibtex
|
| 143 |
@article{wang2022squality,
|
|
|
|
| 157 |
|
| 158 |
### MS MARCO (Concise QA)
|
| 159 |
|
| 160 |
+
This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
|
| 161 |
+
|
| 162 |
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 163 |
|
| 164 |
+
<details><summary>BibTeX</summary>
|
|
|
|
| 165 |
|
| 166 |
```bibtex
|
| 167 |
@inproceedings{nguyen2016msmarco,
|
|
|
|
| 175 |
|
| 176 |
</details>
|
| 177 |
|
| 178 |
+
### QMSum
|
| 179 |
+
|
| 180 |
+
This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
|
| 181 |
+
|
| 182 |
+
Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
|
| 183 |
+
|
| 184 |
+
<details><summary>BibTeX</summary>
|
| 185 |
+
|
| 186 |
+
```bibtex
|
| 187 |
+
@inproceedings{zhong2021qmsum,
|
| 188 |
+
title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
|
| 189 |
+
author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
|
| 190 |
+
booktitle={North American Association for Computational Linguistics (NAACL)},
|
| 191 |
+
year={2021}
|
| 192 |
+
}
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
</details>
|
| 196 |
+
|
| 197 |
## License
|
| 198 |
|
| 199 |
[GPL-3.0](LICENSE.md)
|
README.md
CHANGED
|
@@ -87,14 +87,18 @@ Runs on `http://localhost:8000`. Interactive docs at `/docs`.
|
|
| 87 |
|
| 88 |
### Run the Frontend
|
| 89 |
|
|
|
|
|
|
|
| 90 |
```bash
|
| 91 |
cd frontend
|
| 92 |
-
npm install # or
|
| 93 |
npm run dev
|
| 94 |
```
|
| 95 |
|
| 96 |
Runs on `http://localhost:5173`.
|
| 97 |
|
|
|
|
|
|
|
| 98 |
## Data
|
| 99 |
|
| 100 |
<!-- markdownlint-disable MD033 -->
|
|
@@ -146,11 +150,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
|
|
| 146 |
|
| 147 |
### SQuALITY (Long-Document QA)
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
| 152 |
|
| 153 |
-
<summary>BibTeX</summary>
|
| 154 |
|
| 155 |
```bibtex
|
| 156 |
@article{wang2022squality,
|
|
@@ -170,10 +174,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
|
|
| 170 |
|
| 171 |
### MS MARCO (Concise QA)
|
| 172 |
|
|
|
|
|
|
|
| 173 |
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 174 |
|
| 175 |
-
<details>
|
| 176 |
-
<summary>BibTeX</summary>
|
| 177 |
|
| 178 |
```bibtex
|
| 179 |
@inproceedings{nguyen2016msmarco,
|
|
@@ -187,6 +192,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
|
|
| 187 |
|
| 188 |
</details>
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
## License
|
| 191 |
|
| 192 |
[GPL-3.0](LICENSE.md)
|
|
|
|
| 87 |
|
| 88 |
### Run the Frontend
|
| 89 |
|
| 90 |
+
In another terminal, run:
|
| 91 |
+
|
| 92 |
```bash
|
| 93 |
cd frontend
|
| 94 |
+
npm install # or use any npm alternative
|
| 95 |
npm run dev
|
| 96 |
```
|
| 97 |
|
| 98 |
Runs on `http://localhost:5173`.
|
| 99 |
|
| 100 |
+
**Development Setup**: The frontend dev server will automatically proxy API calls to the backend. Just access the app at `http://localhost:5173` during development.
|
| 101 |
+
|
| 102 |
## Data
|
| 103 |
|
| 104 |
<!-- markdownlint-disable MD033 -->
|
|
|
|
| 150 |
|
| 151 |
### SQuALITY (Long-Document QA)
|
| 152 |
|
| 153 |
+
This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
|
| 154 |
|
| 155 |
+
Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
|
| 156 |
|
| 157 |
+
<details> <summary>BibTeX</summary>
|
| 158 |
|
| 159 |
```bibtex
|
| 160 |
@article{wang2022squality,
|
|
|
|
| 174 |
|
| 175 |
### MS MARCO (Concise QA)
|
| 176 |
|
| 177 |
+
This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
|
| 178 |
+
|
| 179 |
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 180 |
|
| 181 |
+
<details><summary>BibTeX</summary>
|
|
|
|
| 182 |
|
| 183 |
```bibtex
|
| 184 |
@inproceedings{nguyen2016msmarco,
|
|
|
|
| 192 |
|
| 193 |
</details>
|
| 194 |
|
| 195 |
+
### QMSum
|
| 196 |
+
|
| 197 |
+
This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
|
| 198 |
+
|
| 199 |
+
Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
|
| 200 |
+
|
| 201 |
+
<details><summary>BibTeX</summary>
|
| 202 |
+
|
| 203 |
+
```bibtex
|
| 204 |
+
@inproceedings{zhong2021qmsum,
|
| 205 |
+
title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
|
| 206 |
+
author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
|
| 207 |
+
booktitle={North American Association for Computational Linguistics (NAACL)},
|
| 208 |
+
year={2021}
|
| 209 |
+
}
|
| 210 |
+
```
|
| 211 |
+
|
| 212 |
+
</details>
|
| 213 |
+
|
| 214 |
## License
|
| 215 |
|
| 216 |
[GPL-3.0](LICENSE.md)
|
backend/app.py
CHANGED
|
@@ -32,7 +32,10 @@ app.add_middleware(
|
|
| 32 |
allow_headers=["Content-Type", "X-API-Key"],
|
| 33 |
)
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
|
| 38 |
if not API_KEY:
|
|
|
|
| 32 |
allow_headers=["Content-Type", "X-API-Key"],
|
| 33 |
)
|
| 34 |
|
| 35 |
+
# Only mount frontend in production when dist/ exists
|
| 36 |
+
import os
|
| 37 |
+
if os.path.isdir("frontend/dist"):
|
| 38 |
+
app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
|
| 39 |
|
| 40 |
def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
|
| 41 |
if not API_KEY:
|
backend/ollama.py
CHANGED
|
@@ -28,7 +28,7 @@ def build_prompt(title: Optional[str], text: str) -> str:
|
|
| 28 |
return (
|
| 29 |
f"{instructions}\n"
|
| 30 |
"Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
|
| 31 |
-
"or
|
| 32 |
"Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
|
| 33 |
f"Article:\n{text}\n\n"
|
| 34 |
"Summary:"
|
|
|
|
| 28 |
return (
|
| 29 |
f"{instructions}\n"
|
| 30 |
"Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
|
| 31 |
+
"or any similar phrasing, whether the similarity be in meaning or otherwise. Get straight to the point."
|
| 32 |
"Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
|
| 33 |
f"Article:\n{text}\n\n"
|
| 34 |
"Summary:"
|
scripts/clean.py
CHANGED
|
@@ -9,14 +9,22 @@ import os
|
|
| 9 |
def run_script(script_path):
|
| 10 |
subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
|
| 11 |
|
| 12 |
-
# Run both cleaning scripts in parallel for speed
|
| 13 |
t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
|
| 14 |
t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
t1.start()
|
| 17 |
t2.start()
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
t1.join()
|
| 20 |
t2.join()
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
print("All cleaning scripts completed.")
|
|
|
|
| 9 |
def run_script(script_path):
|
| 10 |
subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
|
| 11 |
|
|
|
|
| 12 |
t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
|
| 13 |
t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
|
| 14 |
+
t3 = threading.Thread(target=run_script, args=("cleaners/clean_msm.py",))
|
| 15 |
+
t4 = threading.Thread(target=run_script, args=("cleaners/clean_qmsum.py",))
|
| 16 |
+
t5 = threading.Thread(target=run_script, args=("cleaners/clean_squality.py",))
|
| 17 |
|
| 18 |
t1.start()
|
| 19 |
t2.start()
|
| 20 |
+
t3.start()
|
| 21 |
+
t4.start()
|
| 22 |
+
t5.start()
|
| 23 |
|
| 24 |
t1.join()
|
| 25 |
t2.join()
|
| 26 |
+
t3.join()
|
| 27 |
+
t4.join()
|
| 28 |
+
t5.join()
|
| 29 |
|
| 30 |
print("All cleaning scripts completed.")
|
scripts/cleaners/clean_msm.py
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RANDOMLY Takes 10,000 lines from ../raw_data/raw_data_msmarco_train.csv, 1,000 lines from ../raw_data/raw_data_msmarco_val.csv. Then converts each one to JSONL.
|
| 3 |
+
|
| 4 |
+
"""
|
| 5 |
+
import random
|
| 6 |
+
import json
|
| 7 |
+
import csv
|
| 8 |
+
import os
|
| 9 |
+
|
| 10 |
+
def reservoir_sample_csv(file_path, k):
|
| 11 |
+
rows = []
|
| 12 |
+
with open(file_path, 'r', encoding='utf-8') as f:
|
| 13 |
+
reader = csv.DictReader(f)
|
| 14 |
+
for row in reader:
|
| 15 |
+
rows.append(row)
|
| 16 |
+
if len(rows) <= k:
|
| 17 |
+
return rows
|
| 18 |
+
return random.sample(rows, k)
|
| 19 |
+
|
| 20 |
+
def write_jsonl(rows, output_path):
|
| 21 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 22 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 23 |
+
for i, row in enumerate(rows):
|
| 24 |
+
new_data = {
|
| 25 |
+
"id": i,
|
| 26 |
+
"original_source": "MSMarco",
|
| 27 |
+
"query": row.get("query", ""),
|
| 28 |
+
"answers": row.get("answers", ""),
|
| 29 |
+
"passage": row.get("finalpassage", "")
|
| 30 |
+
}
|
| 31 |
+
json.dump(new_data, f, indent=2)
|
| 32 |
+
f.write('\n')
|
| 33 |
+
|
| 34 |
+
print("Cleaning MSMarco dataset...")
|
| 35 |
+
|
| 36 |
+
ta = '../raw_data/raw_data_msmarco_train.csv'
|
| 37 |
+
tb = '../raw_data/raw_data_msmarco_val.csv'
|
| 38 |
+
|
| 39 |
+
train_loc = '../clean1/msm/msmarco_train_10k.jsonl'
|
| 40 |
+
test_loc = '../clean1/msm/msmarco_val_1k.jsonl'
|
| 41 |
+
|
| 42 |
+
print("Sampling rows from raw data CSV files...")
|
| 43 |
+
|
| 44 |
+
train_rows = reservoir_sample_csv(ta, 10000)
|
| 45 |
+
test_rows = reservoir_sample_csv(tb, 1000)
|
| 46 |
+
|
| 47 |
+
print("Collected Samples. Writing to JSONL files...")
|
| 48 |
+
|
| 49 |
+
write_jsonl(train_rows, train_loc)
|
| 50 |
+
write_jsonl(test_rows, test_loc)
|
| 51 |
+
|
| 52 |
+
print("Done")
|
scripts/cleaners/clean_qmsum.py
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RANDOMLY Takes 10,000 lines from ../raw_data/raw_qmsum_train.jsonl, 1,000 lines from ../raw_data/raw_qmsum_test.jsonl, and 1,000 lines from ../raw_data/raw_qmsum_val.jsonl. Then converts each one to simplified JSONL.
|
| 3 |
+
|
| 4 |
+
"""
|
| 5 |
+
import random
|
| 6 |
+
import json
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
def reservoir_sample(file_path, k):
|
| 10 |
+
reservoir = []
|
| 11 |
+
with open(file_path, 'r', encoding='utf-8') as f:
|
| 12 |
+
for i, line in enumerate(f):
|
| 13 |
+
if i < k:
|
| 14 |
+
reservoir.append(line.strip())
|
| 15 |
+
else:
|
| 16 |
+
j = random.randint(0, i)
|
| 17 |
+
if j < k:
|
| 18 |
+
reservoir[j] = line.strip()
|
| 19 |
+
return reservoir
|
| 20 |
+
|
| 21 |
+
def write_jsonl(lines, output_path):
|
| 22 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 23 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 24 |
+
for i, line in enumerate(lines):
|
| 25 |
+
data = json.loads(line)
|
| 26 |
+
# Extract first general query and answer if available
|
| 27 |
+
general_query = ""
|
| 28 |
+
general_answer = ""
|
| 29 |
+
if "general_query_list" in data and len(data["general_query_list"]) > 0:
|
| 30 |
+
general_query = data["general_query_list"][0].get("query", "")
|
| 31 |
+
general_answer = data["general_query_list"][0].get("answer", "")
|
| 32 |
+
|
| 33 |
+
new_data = {
|
| 34 |
+
"id": i,
|
| 35 |
+
"original_source": "QMSum",
|
| 36 |
+
"general_query": general_query,
|
| 37 |
+
"general_answer": general_answer,
|
| 38 |
+
"topic_list": data.get("topic_list", [])
|
| 39 |
+
}
|
| 40 |
+
json.dump(new_data, f, indent=2)
|
| 41 |
+
f.write('\n')
|
| 42 |
+
|
| 43 |
+
print("Cleaning QMSum dataset...")
|
| 44 |
+
|
| 45 |
+
ta = '../raw_data/raw_qmsum_train.jsonl'
|
| 46 |
+
tb = '../raw_data/raw_qmsum_test.jsonl'
|
| 47 |
+
vc = '../raw_data/raw_qmsum_val.jsonl'
|
| 48 |
+
|
| 49 |
+
train_loc = '../clean1/qmsum/qmsum_train_10k.jsonl'
|
| 50 |
+
test_loc = '../clean1/qmsum/qmsum_test_1k.jsonl'
|
| 51 |
+
val_loc = '../clean1/qmsum/qmsum_val_1k.jsonl'
|
| 52 |
+
|
| 53 |
+
print("Sampling lines from raw data files...")
|
| 54 |
+
|
| 55 |
+
train_lines = reservoir_sample(ta, 10000)
|
| 56 |
+
test_lines = reservoir_sample(tb, 1000)
|
| 57 |
+
val_lines = reservoir_sample(vc, 1000)
|
| 58 |
+
|
| 59 |
+
print("Collected Samples. Writing to JSONL files...")
|
| 60 |
+
|
| 61 |
+
write_jsonl(train_lines, train_loc)
|
| 62 |
+
write_jsonl(test_lines, test_loc)
|
| 63 |
+
write_jsonl(val_lines, val_loc)
|
| 64 |
+
|
| 65 |
+
print("Done")
|
scripts/cleaners/clean_squality.py
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RANDOMLY Takes 10,000 lines from ../raw_data/raw_squality_train.jsonl, 1,000 lines from ../raw_data/raw_squality_test.jsonl, and 1,000 lines from ../raw_data/raw_squality_val.jsonl. Then converts each one to simplified JSONL.
|
| 3 |
+
|
| 4 |
+
"""
|
| 5 |
+
import random
|
| 6 |
+
import json
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
def reservoir_sample(file_path, k):
|
| 10 |
+
reservoir = []
|
| 11 |
+
with open(file_path, 'r', encoding='utf-8') as f:
|
| 12 |
+
for i, line in enumerate(f):
|
| 13 |
+
if i < k:
|
| 14 |
+
reservoir.append(line.strip())
|
| 15 |
+
else:
|
| 16 |
+
j = random.randint(0, i)
|
| 17 |
+
if j < k:
|
| 18 |
+
reservoir[j] = line.strip()
|
| 19 |
+
return reservoir
|
| 20 |
+
|
| 21 |
+
def write_jsonl(lines, output_path):
|
| 22 |
+
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
| 23 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 24 |
+
for i, line in enumerate(lines):
|
| 25 |
+
data = json.loads(line)
|
| 26 |
+
# Extract key fields from the data
|
| 27 |
+
source_type = data.get("source_type", "")
|
| 28 |
+
query_synthesized = data.get("query_synthesized", "")
|
| 29 |
+
summary = data.get("summary", "")
|
| 30 |
+
document = data.get("document", "")
|
| 31 |
+
|
| 32 |
+
new_data = {
|
| 33 |
+
"id": i,
|
| 34 |
+
"original_source": "SQuALITY",
|
| 35 |
+
"source_type": source_type,
|
| 36 |
+
"query": query_synthesized,
|
| 37 |
+
"summary": summary,
|
| 38 |
+
"document": document[:500] if document else "" # Truncate long documents
|
| 39 |
+
}
|
| 40 |
+
json.dump(new_data, f, indent=2)
|
| 41 |
+
f.write('\n')
|
| 42 |
+
|
| 43 |
+
print("Cleaning SQuALITY dataset...")
|
| 44 |
+
|
| 45 |
+
ta = '../raw_data/raw_squality_train.jsonl'
|
| 46 |
+
tb = '../raw_data/raw_squality_test.jsonl'
|
| 47 |
+
vc = '../raw_data/raw_squality_val.jsonl'
|
| 48 |
+
|
| 49 |
+
train_loc = '../clean1/squality/squality_train_10k.jsonl'
|
| 50 |
+
test_loc = '../clean1/squality/squality_test_1k.jsonl'
|
| 51 |
+
val_loc = '../clean1/squality/squality_val_1k.jsonl'
|
| 52 |
+
|
| 53 |
+
print("Sampling lines from raw data files...")
|
| 54 |
+
|
| 55 |
+
train_lines = reservoir_sample(ta, 10000)
|
| 56 |
+
test_lines = reservoir_sample(tb, 1000)
|
| 57 |
+
val_lines = reservoir_sample(vc, 1000)
|
| 58 |
+
|
| 59 |
+
print("Collected Samples. Writing to JSONL files...")
|
| 60 |
+
|
| 61 |
+
write_jsonl(train_lines, train_loc)
|
| 62 |
+
write_jsonl(test_lines, test_loc)
|
| 63 |
+
write_jsonl(val_lines, val_loc)
|
| 64 |
+
|
| 65 |
+
print("Done")
|