Spaces:

compendious
/

precis

Build error

App Files Files Community

compendious commited on Mar 9

Commit

f179148

1 Parent(s): f71ba81

more data cleaning. Tuning data and then tuning the model is next

Browse files

Files changed (8) hide show

.github/README.md +25 -5
README.md +30 -6
backend/app.py +4 -1
backend/ollama.py +1 -1
scripts/clean.py +9 -1
scripts/cleaners/clean_msm.py +52 -0
scripts/cleaners/clean_qmsum.py +65 -0
scripts/cleaners/clean_squality.py +65 -0

.github/README.md CHANGED Viewed

@@ -133,11 +133,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
 ### SQuALITY (Long-Document QA)
 Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
-<details>
-<summary>BibTeX</summary>
 ```bibtex
 @article{wang2022squality,
@@ -157,10 +157,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
 ### MS MARCO (Concise QA)
 Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
-<details>
-<summary>BibTeX</summary>
 ```bibtex
 @inproceedings{nguyen2016msmarco,
@@ -174,6 +175,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
 </details>
 ## License
 [GPL-3.0](LICENSE.md)

 ### SQuALITY (Long-Document QA)
+This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
 Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
+<details> <summary>BibTeX</summary>
 ```bibtex
 @article{wang2022squality,
 ### MS MARCO (Concise QA)
+This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
 Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
+<details><summary>BibTeX</summary>
 ```bibtex
 @inproceedings{nguyen2016msmarco,
 </details>
+### QMSum
+This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
+Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
+<details><summary>BibTeX</summary>
+```bibtex
+@inproceedings{zhong2021qmsum,
+   title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
+   author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
+   booktitle={North American Association for Computational Linguistics (NAACL)},
+   year={2021}
+}
+```
+</details>
 ## License
 [GPL-3.0](LICENSE.md)

README.md CHANGED Viewed

@@ -87,14 +87,18 @@ Runs on `http://localhost:8000`. Interactive docs at `/docs`.
 ### Run the Frontend
 ```bash
 cd frontend
-npm install   # or whatever replacement for npm you may be using
 npm run dev
 ```
 Runs on `http://localhost:5173`.
 ## Data
 <!-- markdownlint-disable MD033 -->
@@ -146,11 +150,11 @@ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenari
 ### SQuALITY (Long-Document QA)
-Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
-<details>
-<summary>BibTeX</summary>
 ```bibtex
 @article{wang2022squality,
@@ -170,10 +174,11 @@ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: B
 ### MS MARCO (Concise QA)
 Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
-<details>
-<summary>BibTeX</summary>
 ```bibtex
 @inproceedings{nguyen2016msmarco,
@@ -187,6 +192,25 @@ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng,
 </details>
 ## License
 [GPL-3.0](LICENSE.md)

 ### Run the Frontend
+In another terminal, run:
 ```bash
 cd frontend
+npm install   # or use any npm alternative
 npm run dev
 ```
 Runs on `http://localhost:5173`.
+**Development Setup**: The frontend dev server will automatically proxy API calls to the backend. Just access the app at `http://localhost:5173` during development.
 ## Data
 <!-- markdownlint-disable MD033 -->
 ### SQuALITY (Long-Document QA)
+This dataset contains around 6000 stories ("long documents") from Project Gutenberg, along with human-written summaries and question-answer pairs. The dataset is designed to test the ability of models to understand and summarize long-form content. GitHub repo: [https://github.com/nyu-mll/SQuALITY](https://github.com/nyu-mll/SQuALITY)
+Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
+<details> <summary>BibTeX</summary>
 ```bibtex
 @article{wang2022squality,
 ### MS MARCO (Concise QA)
+This is a massive dataset of real user queries from Bing, along with passages from web documents that are relevant to those queries.
 Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
+<details><summary>BibTeX</summary>
 ```bibtex
 @inproceedings{nguyen2016msmarco,
 </details>
+### QMSum
+This dataset is for specifically taking in transcripts and answering questions about them. The GitHub repo for the dataset [and other details is here](https://github.com/Yale-LILY/QMSum).
+Zhong, M., Yin, D., Yu, T., Zaidi, A., Mutuma, M., Jha, R., Awadallah, A. H., Celikyilmaz, A., Liu, Y., Qiu, X., & Radev, D. (2021). *QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization*. NAACL 2021. [https://arxiv.org/abs/2104.05938](https://arxiv.org/abs/2104.05938)
+<details><summary>BibTeX</summary>
+```bibtex
+@inproceedings{zhong2021qmsum,
+   title={{QMS}um: {A} {N}ew {B}enchmark for {Q}uery-based {M}ulti-domain {M}eeting {S}ummarization},
+   author={Zhong, Ming and Yin, Da and Yu, Tao and Zaidi, Ahmad and Mutuma, Mutethia and Jha, Rahul and Hassan Awadallah, Ahmed and Celikyilmaz, Asli and Liu, Yang and Qiu, Xipeng and Radev, Dragomir},
+   booktitle={North American Association for Computational Linguistics (NAACL)},
+   year={2021}
+}
+```
+</details>
 ## License
 [GPL-3.0](LICENSE.md)

backend/app.py CHANGED Viewed

@@ -32,7 +32,10 @@ app.add_middleware(
     allow_headers=["Content-Type", "X-API-Key"],
 )
-app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
 def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
     if not API_KEY:

     allow_headers=["Content-Type", "X-API-Key"],
 )
+# Only mount frontend in production when dist/ exists
+import os
+if os.path.isdir("frontend/dist"):
+    app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
 def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
     if not API_KEY:

backend/ollama.py CHANGED Viewed

@@ -28,7 +28,7 @@ def build_prompt(title: Optional[str], text: str) -> str:
     return (
         f"{instructions}\n"
         "Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
-        "or ANYTHING along those lines whether it be in meaning or phrasing. "
         "Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
         f"Article:\n{text}\n\n"
         "Summary:"

     return (
         f"{instructions}\n"
         "Do not add opinions, commentary, or filler phrases like 'The article discusses' or 'This document provides'.\n"
+        "or any similar phrasing, whether the similarity be in meaning or otherwise. Get straight to the point."
         "Output the summary sentence only. The sentence should be no longer than 200 characetrs long. Nothing else should be included.\n\n"
         f"Article:\n{text}\n\n"
         "Summary:"

scripts/clean.py CHANGED Viewed

@@ -9,14 +9,22 @@ import os
 def run_script(script_path):
     subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
-# Run both cleaning scripts in parallel for speed
 t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
 t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
 t1.start()
 t2.start()
 t1.join()
 t2.join()
 print("All cleaning scripts completed.")

 def run_script(script_path):
     subprocess.run(["python", script_path], cwd=os.path.dirname(__file__))
 t1 = threading.Thread(target=run_script, args=("cleaners/clean_ms.py",))
 t2 = threading.Thread(target=run_script, args=("cleaners/clean_ds.py",))
+t3 = threading.Thread(target=run_script, args=("cleaners/clean_msm.py",))
+t4 = threading.Thread(target=run_script, args=("cleaners/clean_qmsum.py",))
+t5 = threading.Thread(target=run_script, args=("cleaners/clean_squality.py",))
 t1.start()
 t2.start()
+t3.start()
+t4.start()
+t5.start()
 t1.join()
 t2.join()
+t3.join()
+t4.join()
+t5.join()
 print("All cleaning scripts completed.")

scripts/cleaners/clean_msm.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""
+RANDOMLY Takes 10,000 lines from ../raw_data/raw_data_msmarco_train.csv, 1,000 lines from ../raw_data/raw_data_msmarco_val.csv. Then converts each one to JSONL.
+"""
+import random
+import json
+import csv
+import os
+def reservoir_sample_csv(file_path, k):
+    rows = []
+    with open(file_path, 'r', encoding='utf-8') as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            rows.append(row)
+    if len(rows) <= k:
+        return rows
+    return random.sample(rows, k)
+def write_jsonl(rows, output_path):
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with open(output_path, 'w', encoding='utf-8') as f:
+        for i, row in enumerate(rows):
+            new_data = {
+                "id": i,
+                "original_source": "MSMarco",
+                "query": row.get("query", ""),
+                "answers": row.get("answers", ""),
+                "passage": row.get("finalpassage", "")
+            }
+            json.dump(new_data, f, indent=2)
+            f.write('\n')
+print("Cleaning MSMarco dataset...")
+ta = '../raw_data/raw_data_msmarco_train.csv'
+tb = '../raw_data/raw_data_msmarco_val.csv'
+train_loc = '../clean1/msm/msmarco_train_10k.jsonl'
+test_loc = '../clean1/msm/msmarco_val_1k.jsonl'
+print("Sampling rows from raw data CSV files...")
+train_rows = reservoir_sample_csv(ta, 10000)
+test_rows = reservoir_sample_csv(tb, 1000)
+print("Collected Samples. Writing to JSONL files...")
+write_jsonl(train_rows, train_loc)
+write_jsonl(test_rows, test_loc)
+print("Done")

scripts/cleaners/clean_qmsum.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""
+RANDOMLY Takes 10,000 lines from ../raw_data/raw_qmsum_train.jsonl, 1,000 lines from ../raw_data/raw_qmsum_test.jsonl, and 1,000 lines from ../raw_data/raw_qmsum_val.jsonl. Then converts each one to simplified JSONL.
+"""
+import random
+import json
+import os
+def reservoir_sample(file_path, k):
+    reservoir = []
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for i, line in enumerate(f):
+            if i < k:
+                reservoir.append(line.strip())
+            else:
+                j = random.randint(0, i)
+                if j < k:
+                    reservoir[j] = line.strip()
+    return reservoir
+def write_jsonl(lines, output_path):
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with open(output_path, 'w', encoding='utf-8') as f:
+        for i, line in enumerate(lines):
+            data = json.loads(line)
+            # Extract first general query and answer if available
+            general_query = ""
+            general_answer = ""
+            if "general_query_list" in data and len(data["general_query_list"]) > 0:
+                general_query = data["general_query_list"][0].get("query", "")
+                general_answer = data["general_query_list"][0].get("answer", "")
+            new_data = {
+                "id": i,
+                "original_source": "QMSum",
+                "general_query": general_query,
+                "general_answer": general_answer,
+                "topic_list": data.get("topic_list", [])
+            }
+            json.dump(new_data, f, indent=2)
+            f.write('\n')
+print("Cleaning QMSum dataset...")
+ta = '../raw_data/raw_qmsum_train.jsonl'
+tb = '../raw_data/raw_qmsum_test.jsonl'
+vc = '../raw_data/raw_qmsum_val.jsonl'
+train_loc = '../clean1/qmsum/qmsum_train_10k.jsonl'
+test_loc = '../clean1/qmsum/qmsum_test_1k.jsonl'
+val_loc = '../clean1/qmsum/qmsum_val_1k.jsonl'
+print("Sampling lines from raw data files...")
+train_lines = reservoir_sample(ta, 10000)
+test_lines = reservoir_sample(tb, 1000)
+val_lines = reservoir_sample(vc, 1000)
+print("Collected Samples. Writing to JSONL files...")
+write_jsonl(train_lines, train_loc)
+write_jsonl(test_lines, test_loc)
+write_jsonl(val_lines, val_loc)
+print("Done")

scripts/cleaners/clean_squality.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""
+RANDOMLY Takes 10,000 lines from ../raw_data/raw_squality_train.jsonl, 1,000 lines from ../raw_data/raw_squality_test.jsonl, and 1,000 lines from ../raw_data/raw_squality_val.jsonl. Then converts each one to simplified JSONL.
+"""
+import random
+import json
+import os
+def reservoir_sample(file_path, k):
+    reservoir = []
+    with open(file_path, 'r', encoding='utf-8') as f:
+        for i, line in enumerate(f):
+            if i < k:
+                reservoir.append(line.strip())
+            else:
+                j = random.randint(0, i)
+                if j < k:
+                    reservoir[j] = line.strip()
+    return reservoir
+def write_jsonl(lines, output_path):
+    os.makedirs(os.path.dirname(output_path), exist_ok=True)
+    with open(output_path, 'w', encoding='utf-8') as f:
+        for i, line in enumerate(lines):
+            data = json.loads(line)
+            # Extract key fields from the data
+            source_type = data.get("source_type", "")
+            query_synthesized = data.get("query_synthesized", "")
+            summary = data.get("summary", "")
+            document = data.get("document", "")
+            new_data = {
+                "id": i,
+                "original_source": "SQuALITY",
+                "source_type": source_type,
+                "query": query_synthesized,
+                "summary": summary,
+                "document": document[:500] if document else ""  # Truncate long documents
+            }
+            json.dump(new_data, f, indent=2)
+            f.write('\n')
+print("Cleaning SQuALITY dataset...")
+ta = '../raw_data/raw_squality_train.jsonl'
+tb = '../raw_data/raw_squality_test.jsonl'
+vc = '../raw_data/raw_squality_val.jsonl'
+train_loc = '../clean1/squality/squality_train_10k.jsonl'
+test_loc = '../clean1/squality/squality_test_1k.jsonl'
+val_loc = '../clean1/squality/squality_val_1k.jsonl'
+print("Sampling lines from raw data files...")
+train_lines = reservoir_sample(ta, 10000)
+test_lines = reservoir_sample(tb, 1000)
+val_lines = reservoir_sample(vc, 1000)
+print("Collected Samples. Writing to JSONL files...")
+write_jsonl(train_lines, train_loc)
+write_jsonl(test_lines, test_loc)
+write_jsonl(val_lines, val_loc)
+print("Done")