Spaces:
Build error
Build error
Commit ·
830b470
1
Parent(s): 81760e6
data stuff
Browse files- .github/README.md +86 -45
- Dockerfile +43 -0
- README.md +87 -43
- backend/app.py +2 -0
- scripts/pull.py +32 -0
.github/README.md
CHANGED
|
@@ -1,7 +1,5 @@
|
|
| 1 |
# Précis
|
| 2 |
|
| 3 |
-
<!-- This version of the README is created just for HuggingFace to work -->
|
| 4 |
-
|
| 5 |
A system for compressing long-form content into clear, structured summaries. Précis is designed for videos, articles, and papers. Paste a YouTube link, drop in an article, or upload a text file. Précis will pulls the key facts into a single sentence using a local LLM via [Ollama](https://ollama.com).
|
| 6 |
|
| 7 |
## Features
|
|
@@ -36,7 +34,32 @@ All `/summarize/*` endpoints accept an optional `model` field to override the de
|
|
| 36 |
|
| 37 |
### Run the Fine-Tuning
|
| 38 |
|
| 39 |
-
Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
### Start the Backend
|
| 42 |
|
|
@@ -59,70 +82,87 @@ npm run dev
|
|
| 59 |
|
| 60 |
Runs on `http://localhost:5173`.
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
<!--
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
|
|
|
|
|
|
|
|
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
@article{zhu2021mediasum,
|
| 70 |
-
title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
|
| 71 |
-
author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
|
| 72 |
-
journal={arXiv preprint arXiv:2103.06410},
|
| 73 |
-
year={2021}
|
| 74 |
-
}
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
| 81 |
|
|
|
|
|
|
|
|
|
|
| 82 |
@inproceedings{chen-etal-2021-dialogsum,
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
url = "https://aclanthology.org/2021.findings-acl.449",
|
| 94 |
-
doi = "10.18653/v1/2021.findings-acl.449",
|
| 95 |
-
pages = "5062--5074",
|
| 96 |
}
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
|
| 100 |
-
|
| 101 |
|
| 102 |
-
|
| 103 |
|
| 104 |
-
<
|
| 105 |
|
|
|
|
|
|
|
|
|
|
| 106 |
@article{wang2022squality,
|
| 107 |
-
title
|
| 108 |
-
author
|
| 109 |
-
journal
|
| 110 |
-
year
|
| 111 |
archivePrefix = {arXiv},
|
| 112 |
-
eprint
|
| 113 |
-
primaryClass
|
| 114 |
-
doi
|
| 115 |
-
url
|
| 116 |
}
|
|
|
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
-
|
| 123 |
|
| 124 |
-
<
|
|
|
|
| 125 |
|
|
|
|
| 126 |
@inproceedings{nguyen2016msmarco,
|
| 127 |
title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
|
| 128 |
author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
|
|
@@ -130,8 +170,9 @@ Runs on `http://localhost:5173`.
|
|
| 130 |
year = {2016},
|
| 131 |
publisher = {CEUR-WS.org}
|
| 132 |
}
|
|
|
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
## License
|
| 137 |
|
|
|
|
| 1 |
# Précis
|
| 2 |
|
|
|
|
|
|
|
| 3 |
A system for compressing long-form content into clear, structured summaries. Précis is designed for videos, articles, and papers. Paste a YouTube link, drop in an article, or upload a text file. Précis will pulls the key facts into a single sentence using a local LLM via [Ollama](https://ollama.com).
|
| 4 |
|
| 5 |
## Features
|
|
|
|
| 34 |
|
| 35 |
### Run the Fine-Tuning
|
| 36 |
|
| 37 |
+
Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba).
|
| 38 |
+
|
| 39 |
+
You can pull the raw models with:
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
ollama pull phi4-mini:latest
|
| 43 |
+
ollama pull qwen3:4b
|
| 44 |
+
# And any other models you may want
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
<!--
|
| 48 |
+
You can also just download the fine-tuned versions right away from HuggingFace by running the following script, which downloads the fine-tuned models from my HuggingFace space:
|
| 49 |
+
|
| 50 |
+
```bash
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
-->
|
| 54 |
+
|
| 55 |
+
### Test the Quality of the Fine-Tuning
|
| 56 |
+
|
| 57 |
+
Run the following script on the `test` split in order to get a sense of how accurately the model is summarizing the context. The script will use the BERTScore metric (which compares the sentiment of the generated summary with the sentiment of the reference summary) to give you a score out of 1.0, where higher is better. BERT is the most appropriate metric for this task since we want to ensure that the generated summary captures the same key facts as the reference summary, without penalizing different wording.
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
# Make sure you have the appropriate libraries installed (see requirements.txt and the instructions for running the backend).
|
| 61 |
+
python -m scripts.test --model phi4-mini:latest
|
| 62 |
+
```
|
| 63 |
|
| 64 |
### Start the Backend
|
| 65 |
|
|
|
|
| 82 |
|
| 83 |
Runs on `http://localhost:5173`.
|
| 84 |
|
| 85 |
+
## Data
|
| 86 |
|
| 87 |
+
<!-- markdownlint-disable MD033 -->
|
| 88 |
|
| 89 |
+
References for datasets/papers used in this project (with BibTeX available if you need to cite them formally).
|
| 90 |
+
|
| 91 |
+
### MediaSum (Interview Summarization)
|
| 92 |
+
|
| 93 |
+
Zhu, C., Liu, Y., Mei, J., & Zeng, M. (2021). *MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization*. arXiv:2103.06410. [https://arxiv.org/abs/2103.06410](https://arxiv.org/abs/2103.06410)
|
| 94 |
|
| 95 |
+
<details>
|
| 96 |
+
<summary>BibTeX</summary>
|
| 97 |
+
|
| 98 |
+
```bibtex
|
| 99 |
@article{zhu2021mediasum,
|
| 100 |
+
title = {MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
|
| 101 |
+
author = {Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
|
| 102 |
+
journal = {arXiv preprint arXiv:2103.06410},
|
| 103 |
+
year = {2021}
|
| 104 |
+
}
|
| 105 |
+
```
|
| 106 |
|
| 107 |
+
</details>
|
| 108 |
|
| 109 |
+
### DialogSum (Dialogue Summarization)
|
| 110 |
|
| 111 |
+
Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenario Dialogue Summarization Dataset*. Findings of ACL-IJCNLP 2021. [https://aclanthology.org/2021.findings-acl.449](https://aclanthology.org/2021.findings-acl.449)
|
| 112 |
+
|
| 113 |
+
<details>
|
| 114 |
|
| 115 |
+
<summary>BibTeX</summary>
|
| 116 |
+
|
| 117 |
+
```bibtex
|
| 118 |
@inproceedings{chen-etal-2021-dialogsum,
|
| 119 |
+
title = {{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset},
|
| 120 |
+
author = {Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue},
|
| 121 |
+
booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
|
| 122 |
+
month = aug,
|
| 123 |
+
year = {2021},
|
| 124 |
+
address = {Online},
|
| 125 |
+
publisher = {Association for Computational Linguistics},
|
| 126 |
+
url = {https://aclanthology.org/2021.findings-acl.449},
|
| 127 |
+
doi = {10.18653/v1/2021.findings-acl.449},
|
| 128 |
+
pages = {5062--5074}
|
|
|
|
|
|
|
|
|
|
| 129 |
}
|
| 130 |
+
```
|
| 131 |
|
| 132 |
+
</details>
|
| 133 |
|
| 134 |
+
### SQuALITY (Long-Document QA)
|
| 135 |
|
| 136 |
+
Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
|
| 137 |
|
| 138 |
+
<details>
|
| 139 |
|
| 140 |
+
<summary>BibTeX</summary>
|
| 141 |
+
|
| 142 |
+
```bibtex
|
| 143 |
@article{wang2022squality,
|
| 144 |
+
title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
|
| 145 |
+
author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
|
| 146 |
+
journal = {arXiv preprint arXiv:2205.11465},
|
| 147 |
+
year = {2022},
|
| 148 |
archivePrefix = {arXiv},
|
| 149 |
+
eprint = {2205.11465},
|
| 150 |
+
primaryClass = {cs.CL},
|
| 151 |
+
doi = {10.48550/arXiv.2205.11465},
|
| 152 |
+
url = {https://doi.org/10.48550/arXiv.2205.11465}
|
| 153 |
}
|
| 154 |
+
```
|
| 155 |
|
| 156 |
+
</details>
|
| 157 |
|
| 158 |
+
### MS MARCO (Concise QA)
|
| 159 |
|
| 160 |
+
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 161 |
|
| 162 |
+
<details>
|
| 163 |
+
<summary>BibTeX</summary>
|
| 164 |
|
| 165 |
+
```bibtex
|
| 166 |
@inproceedings{nguyen2016msmarco,
|
| 167 |
title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
|
| 168 |
author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
|
|
|
|
| 170 |
year = {2016},
|
| 171 |
publisher = {CEUR-WS.org}
|
| 172 |
}
|
| 173 |
+
```
|
| 174 |
|
| 175 |
+
</details>
|
| 176 |
|
| 177 |
## License
|
| 178 |
|
Dockerfile
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM ubuntu:22.04
|
| 2 |
+
|
| 3 |
+
# Set environment variables
|
| 4 |
+
ENV PYTHONUNBUFFERED=1 \
|
| 5 |
+
PYTHONDONTWRITEBYTECODE=1 \
|
| 6 |
+
NODE_ENV=production
|
| 7 |
+
|
| 8 |
+
# Install system dependencies
|
| 9 |
+
RUN apt-get update && apt-get install -y \
|
| 10 |
+
python3.11 \
|
| 11 |
+
python3.11-venv \
|
| 12 |
+
python3-pip \
|
| 13 |
+
nodejs \
|
| 14 |
+
npm \
|
| 15 |
+
git \
|
| 16 |
+
curl \
|
| 17 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 18 |
+
|
| 19 |
+
# Set Python 3.11 as default
|
| 20 |
+
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
|
| 21 |
+
update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
|
| 22 |
+
|
| 23 |
+
WORKDIR /app
|
| 24 |
+
|
| 25 |
+
# Copy requirements and install Python dependencies
|
| 26 |
+
COPY requirements.txt .
|
| 27 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 28 |
+
|
| 29 |
+
# Copy frontend
|
| 30 |
+
COPY frontend ./frontend
|
| 31 |
+
WORKDIR /app/frontend
|
| 32 |
+
RUN npm install && npm run build
|
| 33 |
+
|
| 34 |
+
# Copy backend
|
| 35 |
+
WORKDIR /app
|
| 36 |
+
COPY backend ./backend
|
| 37 |
+
COPY backend/app.py .
|
| 38 |
+
|
| 39 |
+
# Expose port (HF Spaces uses 7860)
|
| 40 |
+
EXPOSE 7860
|
| 41 |
+
|
| 42 |
+
# Start the FastAPI server
|
| 43 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
|
README.md
CHANGED
|
@@ -9,6 +9,7 @@ python_version: "3.11"
|
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
---
|
|
|
|
| 12 |
|
| 13 |
# Précis
|
| 14 |
|
|
@@ -46,7 +47,32 @@ All `/summarize/*` endpoints accept an optional `model` field to override the de
|
|
| 46 |
|
| 47 |
### Run the Fine-Tuning
|
| 48 |
|
| 49 |
-
Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
### Start the Backend
|
| 52 |
|
|
@@ -69,70 +95,87 @@ npm run dev
|
|
| 69 |
|
| 70 |
Runs on `http://localhost:5173`.
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
<!--
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
@article{zhu2021mediasum,
|
| 80 |
-
title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
|
| 81 |
-
author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
|
| 82 |
-
journal={arXiv preprint arXiv:2103.06410},
|
| 83 |
-
year={2021}
|
| 84 |
-
}
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
|
|
|
|
|
|
|
|
|
| 92 |
@inproceedings{chen-etal-2021-dialogsum,
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
url = "https://aclanthology.org/2021.findings-acl.449",
|
| 104 |
-
doi = "10.18653/v1/2021.findings-acl.449",
|
| 105 |
-
pages = "5062--5074",
|
| 106 |
}
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
<
|
| 115 |
|
|
|
|
|
|
|
|
|
|
| 116 |
@article{wang2022squality,
|
| 117 |
-
title
|
| 118 |
-
author
|
| 119 |
-
journal
|
| 120 |
-
year
|
| 121 |
archivePrefix = {arXiv},
|
| 122 |
-
eprint
|
| 123 |
-
primaryClass
|
| 124 |
-
doi
|
| 125 |
-
url
|
| 126 |
}
|
|
|
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
<
|
|
|
|
| 135 |
|
|
|
|
| 136 |
@inproceedings{nguyen2016msmarco,
|
| 137 |
title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
|
| 138 |
author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
|
|
@@ -140,8 +183,9 @@ Runs on `http://localhost:5173`.
|
|
| 140 |
year = {2016},
|
| 141 |
publisher = {CEUR-WS.org}
|
| 142 |
}
|
|
|
|
| 143 |
|
| 144 |
-
|
| 145 |
|
| 146 |
## License
|
| 147 |
|
|
|
|
| 9 |
app_file: app.py
|
| 10 |
pinned: false
|
| 11 |
---
|
| 12 |
+
<!-- markdownlint-disable MD025 -->
|
| 13 |
|
| 14 |
# Précis
|
| 15 |
|
|
|
|
| 47 |
|
| 48 |
### Run the Fine-Tuning
|
| 49 |
|
| 50 |
+
Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba).
|
| 51 |
+
|
| 52 |
+
You can pull the raw models with:
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
ollama pull phi4-mini:latest
|
| 56 |
+
ollama pull qwen3:4b
|
| 57 |
+
# And any other models you may want
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
<!--
|
| 61 |
+
You can also just download the fine-tuned versions right away from HuggingFace by running the following script, which downloads the fine-tuned models from my HuggingFace space:
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
-->
|
| 67 |
+
|
| 68 |
+
### Test the Quality of the Fine-Tuning
|
| 69 |
+
|
| 70 |
+
Run the following script on the `test` split in order to get a sense of how accurately the model is summarizing the context. The script will use the BERTScore metric (which compares the sentiment of the generated summary with the sentiment of the reference summary) to give you a score out of 1.0, where higher is better. BERT is the most appropriate metric for this task since we want to ensure that the generated summary captures the same key facts as the reference summary, without penalizing different wording.
|
| 71 |
+
|
| 72 |
+
```bash
|
| 73 |
+
# Make sure you have the appropriate libraries installed (see requirements.txt and the instructions for running the backend).
|
| 74 |
+
python -m scripts.test --model phi4-mini:latest
|
| 75 |
+
```
|
| 76 |
|
| 77 |
### Start the Backend
|
| 78 |
|
|
|
|
| 95 |
|
| 96 |
Runs on `http://localhost:5173`.
|
| 97 |
|
| 98 |
+
## Data
|
| 99 |
|
| 100 |
+
<!-- markdownlint-disable MD033 -->
|
| 101 |
|
| 102 |
+
References for datasets/papers used in this project (with BibTeX available if you need to cite them formally).
|
| 103 |
+
|
| 104 |
+
### MediaSum (Interview Summarization)
|
| 105 |
+
|
| 106 |
+
Zhu, C., Liu, Y., Mei, J., & Zeng, M. (2021). *MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization*. arXiv:2103.06410. [https://arxiv.org/abs/2103.06410](https://arxiv.org/abs/2103.06410)
|
| 107 |
|
| 108 |
+
<details>
|
| 109 |
+
<summary>BibTeX</summary>
|
| 110 |
+
|
| 111 |
+
```bibtex
|
| 112 |
@article{zhu2021mediasum,
|
| 113 |
+
title = {MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
|
| 114 |
+
author = {Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
|
| 115 |
+
journal = {arXiv preprint arXiv:2103.06410},
|
| 116 |
+
year = {2021}
|
| 117 |
+
}
|
| 118 |
+
```
|
| 119 |
|
| 120 |
+
</details>
|
| 121 |
|
| 122 |
+
### DialogSum (Dialogue Summarization)
|
| 123 |
|
| 124 |
+
Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenario Dialogue Summarization Dataset*. Findings of ACL-IJCNLP 2021. [https://aclanthology.org/2021.findings-acl.449](https://aclanthology.org/2021.findings-acl.449)
|
| 125 |
+
|
| 126 |
+
<details>
|
| 127 |
|
| 128 |
+
<summary>BibTeX</summary>
|
| 129 |
+
|
| 130 |
+
```bibtex
|
| 131 |
@inproceedings{chen-etal-2021-dialogsum,
|
| 132 |
+
title = {{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset},
|
| 133 |
+
author = {Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue},
|
| 134 |
+
booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
|
| 135 |
+
month = aug,
|
| 136 |
+
year = {2021},
|
| 137 |
+
address = {Online},
|
| 138 |
+
publisher = {Association for Computational Linguistics},
|
| 139 |
+
url = {https://aclanthology.org/2021.findings-acl.449},
|
| 140 |
+
doi = {10.18653/v1/2021.findings-acl.449},
|
| 141 |
+
pages = {5062--5074}
|
|
|
|
|
|
|
|
|
|
| 142 |
}
|
| 143 |
+
```
|
| 144 |
|
| 145 |
+
</details>
|
| 146 |
|
| 147 |
+
### SQuALITY (Long-Document QA)
|
| 148 |
|
| 149 |
+
Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
|
| 150 |
|
| 151 |
+
<details>
|
| 152 |
|
| 153 |
+
<summary>BibTeX</summary>
|
| 154 |
+
|
| 155 |
+
```bibtex
|
| 156 |
@article{wang2022squality,
|
| 157 |
+
title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
|
| 158 |
+
author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
|
| 159 |
+
journal = {arXiv preprint arXiv:2205.11465},
|
| 160 |
+
year = {2022},
|
| 161 |
archivePrefix = {arXiv},
|
| 162 |
+
eprint = {2205.11465},
|
| 163 |
+
primaryClass = {cs.CL},
|
| 164 |
+
doi = {10.48550/arXiv.2205.11465},
|
| 165 |
+
url = {https://doi.org/10.48550/arXiv.2205.11465}
|
| 166 |
}
|
| 167 |
+
```
|
| 168 |
|
| 169 |
+
</details>
|
| 170 |
|
| 171 |
+
### MS MARCO (Concise QA)
|
| 172 |
|
| 173 |
+
Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
|
| 174 |
|
| 175 |
+
<details>
|
| 176 |
+
<summary>BibTeX</summary>
|
| 177 |
|
| 178 |
+
```bibtex
|
| 179 |
@inproceedings{nguyen2016msmarco,
|
| 180 |
title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
|
| 181 |
author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
|
|
|
|
| 183 |
year = {2016},
|
| 184 |
publisher = {CEUR-WS.org}
|
| 185 |
}
|
| 186 |
+
```
|
| 187 |
|
| 188 |
+
</details>
|
| 189 |
|
| 190 |
## License
|
| 191 |
|
backend/app.py
CHANGED
|
@@ -4,6 +4,7 @@ from typing import Optional
|
|
| 4 |
import httpx
|
| 5 |
from fastapi import FastAPI, HTTPException, UploadFile, File, Header, Request
|
| 6 |
from fastapi.middleware.cors import CORSMiddleware
|
|
|
|
| 7 |
|
| 8 |
from config import (
|
| 9 |
OLLAMA_BASE_URL,
|
|
@@ -31,6 +32,7 @@ app.add_middleware(
|
|
| 31 |
allow_headers=["Content-Type", "X-API-Key"],
|
| 32 |
)
|
| 33 |
|
|
|
|
| 34 |
|
| 35 |
def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
|
| 36 |
if not API_KEY:
|
|
|
|
| 4 |
import httpx
|
| 5 |
from fastapi import FastAPI, HTTPException, UploadFile, File, Header, Request
|
| 6 |
from fastapi.middleware.cors import CORSMiddleware
|
| 7 |
+
from fastapi.staticfiles import StaticFiles
|
| 8 |
|
| 9 |
from config import (
|
| 10 |
OLLAMA_BASE_URL,
|
|
|
|
| 32 |
allow_headers=["Content-Type", "X-API-Key"],
|
| 33 |
)
|
| 34 |
|
| 35 |
+
app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
|
| 36 |
|
| 37 |
def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
|
| 38 |
if not API_KEY:
|
scripts/pull.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Pulls raw samples of 10k each from the [cited in README] datasets used in this project.
|
| 3 |
+
In the final version of the training data, a lot of the example outputs are tuned, and they are all merged into a single
|
| 4 |
+
|
| 5 |
+
HuggingFace seems to have disabled this functionality.
|
| 6 |
+
Currently trying to see how to work around it.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import json
|
| 10 |
+
from datasets import load_dataset
|
| 11 |
+
|
| 12 |
+
targets = {
|
| 13 |
+
"mediasum": ("nbroad/mediasum", None, "train"), # Parquet‑exported version, no loader script needed :contentReference[oaicite:0]{index=0}
|
| 14 |
+
"dialogsum": ("knkarthick/dialogsum", None, "train"), # CSV on HF :contentReference[oaicite:1]{index=1}
|
| 15 |
+
"squality": ("mattercalm/squality", None, "train"), # assumed generic supported format
|
| 16 |
+
"msmarco_corpus": ("Hyukkyu/beir-msmarco", "corpus", "train"), # Parquet migrated version :contentReference[oaicite:2]{index=2}
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
for name, (repo, config, split) in targets.items():
|
| 20 |
+
# load with generic loader (no trust_remote_code)
|
| 21 |
+
if config:
|
| 22 |
+
ds = load_dataset(repo, config, split=split)
|
| 23 |
+
else:
|
| 24 |
+
ds = load_dataset(repo, split=split)
|
| 25 |
+
|
| 26 |
+
# take first 10k (shuffling in memory)
|
| 27 |
+
small = ds.shuffle(seed=42).select(range(10_000))
|
| 28 |
+
|
| 29 |
+
out = f"{name}_10k.jsonl"
|
| 30 |
+
with open(out, "w", encoding="utf-8") as f:
|
| 31 |
+
for example in small:
|
| 32 |
+
f.write(json.dumps(example) + "\n")
|