compendious commited on
Commit
830b470
·
1 Parent(s): 81760e6

data stuff

Browse files
Files changed (5) hide show
  1. .github/README.md +86 -45
  2. Dockerfile +43 -0
  3. README.md +87 -43
  4. backend/app.py +2 -0
  5. scripts/pull.py +32 -0
.github/README.md CHANGED
@@ -1,7 +1,5 @@
1
  # Précis
2
 
3
- <!-- This version of the README is created just for HuggingFace to work -->
4
-
5
  A system for compressing long-form content into clear, structured summaries. Précis is designed for videos, articles, and papers. Paste a YouTube link, drop in an article, or upload a text file. Précis will pulls the key facts into a single sentence using a local LLM via [Ollama](https://ollama.com).
6
 
7
  ## Features
@@ -36,7 +34,32 @@ All `/summarize/*` endpoints accept an optional `model` field to override the de
36
 
37
  ### Run the Fine-Tuning
38
 
39
- Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba) (`ollama pull qwen3:4b` to pull it).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  ### Start the Backend
42
 
@@ -59,70 +82,87 @@ npm run dev
59
 
60
  Runs on `http://localhost:5173`.
61
 
62
- <!-- ## Data -->
63
 
64
- <!-- Later, for fine-tuning data details -->
65
 
66
- <!-- Interview Dataset -->
67
- <!--
 
 
 
68
 
 
 
 
 
69
  @article{zhu2021mediasum,
70
- title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
71
- author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
72
- journal={arXiv preprint arXiv:2103.06410},
73
- year={2021}
74
- }
 
75
 
76
- -->
77
 
78
- <!--------------------------------------------------------------------------------------------------->
79
 
80
- <!--
 
 
81
 
 
 
 
82
  @inproceedings{chen-etal-2021-dialogsum,
83
- title = "{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset",
84
- author = "Chen, Yulong and
85
- Liu, Yang and
86
- Chen, Liang and
87
- Zhang, Yue",
88
- booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
89
- month = aug,
90
- year = "2021",
91
- address = "Online",
92
- publisher = "Association for Computational Linguistics",
93
- url = "https://aclanthology.org/2021.findings-acl.449",
94
- doi = "10.18653/v1/2021.findings-acl.449",
95
- pages = "5062--5074",
96
  }
 
97
 
98
- -->
99
 
100
- <!------------------------------------------------------------------------------------------------->
101
 
102
- <!-- "Single question followed by an answer" dataset -->
103
 
104
- <!--
105
 
 
 
 
106
  @article{wang2022squality,
107
- title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
108
- author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
109
- journal = {arXiv preprint arXiv:2205.11465},
110
- year = {2022},
111
  archivePrefix = {arXiv},
112
- eprint = {2205.11465},
113
- primaryClass = {cs.CL},
114
- doi = {10.48550/arXiv.2205.11465},
115
- url = {https://doi.org/10.48550/arXiv.2205.11465}
116
  }
 
117
 
118
- -->
119
 
120
- <!------------------------------------------------------------------------------------------------->
121
 
122
- <!-- High Quality Query-Answer (concise) examples -->
123
 
124
- <!--
 
125
 
 
126
  @inproceedings{nguyen2016msmarco,
127
  title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
128
  author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
@@ -130,8 +170,9 @@ Runs on `http://localhost:5173`.
130
  year = {2016},
131
  publisher = {CEUR-WS.org}
132
  }
 
133
 
134
- -->
135
 
136
  ## License
137
 
 
1
  # Précis
2
 
 
 
3
  A system for compressing long-form content into clear, structured summaries. Précis is designed for videos, articles, and papers. Paste a YouTube link, drop in an article, or upload a text file. Précis will pulls the key facts into a single sentence using a local LLM via [Ollama](https://ollama.com).
4
 
5
  ## Features
 
34
 
35
  ### Run the Fine-Tuning
36
 
37
+ Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba).
38
+
39
+ You can pull the raw models with:
40
+
41
+ ```bash
42
+ ollama pull phi4-mini:latest
43
+ ollama pull qwen3:4b
44
+ # And any other models you may want
45
+ ```
46
+
47
+ <!--
48
+ You can also just download the fine-tuned versions right away from HuggingFace by running the following script, which downloads the fine-tuned models from my HuggingFace space:
49
+
50
+ ```bash
51
+
52
+ ```
53
+ -->
54
+
55
+ ### Test the Quality of the Fine-Tuning
56
+
57
+ Run the following script on the `test` split in order to get a sense of how accurately the model is summarizing the context. The script will use the BERTScore metric (which compares the sentiment of the generated summary with the sentiment of the reference summary) to give you a score out of 1.0, where higher is better. BERT is the most appropriate metric for this task since we want to ensure that the generated summary captures the same key facts as the reference summary, without penalizing different wording.
58
+
59
+ ```bash
60
+ # Make sure you have the appropriate libraries installed (see requirements.txt and the instructions for running the backend).
61
+ python -m scripts.test --model phi4-mini:latest
62
+ ```
63
 
64
  ### Start the Backend
65
 
 
82
 
83
  Runs on `http://localhost:5173`.
84
 
85
+ ## Data
86
 
87
+ <!-- markdownlint-disable MD033 -->
88
 
89
+ References for datasets/papers used in this project (with BibTeX available if you need to cite them formally).
90
+
91
+ ### MediaSum (Interview Summarization)
92
+
93
+ Zhu, C., Liu, Y., Mei, J., & Zeng, M. (2021). *MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization*. arXiv:2103.06410. [https://arxiv.org/abs/2103.06410](https://arxiv.org/abs/2103.06410)
94
 
95
+ <details>
96
+ <summary>BibTeX</summary>
97
+
98
+ ```bibtex
99
  @article{zhu2021mediasum,
100
+ title = {MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
101
+ author = {Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
102
+ journal = {arXiv preprint arXiv:2103.06410},
103
+ year = {2021}
104
+ }
105
+ ```
106
 
107
+ </details>
108
 
109
+ ### DialogSum (Dialogue Summarization)
110
 
111
+ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenario Dialogue Summarization Dataset*. Findings of ACL-IJCNLP 2021. [https://aclanthology.org/2021.findings-acl.449](https://aclanthology.org/2021.findings-acl.449)
112
+
113
+ <details>
114
 
115
+ <summary>BibTeX</summary>
116
+
117
+ ```bibtex
118
  @inproceedings{chen-etal-2021-dialogsum,
119
+ title = {{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset},
120
+ author = {Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue},
121
+ booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
122
+ month = aug,
123
+ year = {2021},
124
+ address = {Online},
125
+ publisher = {Association for Computational Linguistics},
126
+ url = {https://aclanthology.org/2021.findings-acl.449},
127
+ doi = {10.18653/v1/2021.findings-acl.449},
128
+ pages = {5062--5074}
 
 
 
129
  }
130
+ ```
131
 
132
+ </details>
133
 
134
+ ### SQuALITY (Long-Document QA)
135
 
136
+ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
137
 
138
+ <details>
139
 
140
+ <summary>BibTeX</summary>
141
+
142
+ ```bibtex
143
  @article{wang2022squality,
144
+ title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
145
+ author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
146
+ journal = {arXiv preprint arXiv:2205.11465},
147
+ year = {2022},
148
  archivePrefix = {arXiv},
149
+ eprint = {2205.11465},
150
+ primaryClass = {cs.CL},
151
+ doi = {10.48550/arXiv.2205.11465},
152
+ url = {https://doi.org/10.48550/arXiv.2205.11465}
153
  }
154
+ ```
155
 
156
+ </details>
157
 
158
+ ### MS MARCO (Concise QA)
159
 
160
+ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
161
 
162
+ <details>
163
+ <summary>BibTeX</summary>
164
 
165
+ ```bibtex
166
  @inproceedings{nguyen2016msmarco,
167
  title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
168
  author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
 
170
  year = {2016},
171
  publisher = {CEUR-WS.org}
172
  }
173
+ ```
174
 
175
+ </details>
176
 
177
  ## License
178
 
Dockerfile ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM ubuntu:22.04
2
+
3
+ # Set environment variables
4
+ ENV PYTHONUNBUFFERED=1 \
5
+ PYTHONDONTWRITEBYTECODE=1 \
6
+ NODE_ENV=production
7
+
8
+ # Install system dependencies
9
+ RUN apt-get update && apt-get install -y \
10
+ python3.11 \
11
+ python3.11-venv \
12
+ python3-pip \
13
+ nodejs \
14
+ npm \
15
+ git \
16
+ curl \
17
+ && rm -rf /var/lib/apt/lists/*
18
+
19
+ # Set Python 3.11 as default
20
+ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1 && \
21
+ update-alternatives --install /usr/bin/pip pip /usr/bin/pip3 1
22
+
23
+ WORKDIR /app
24
+
25
+ # Copy requirements and install Python dependencies
26
+ COPY requirements.txt .
27
+ RUN pip install --no-cache-dir -r requirements.txt
28
+
29
+ # Copy frontend
30
+ COPY frontend ./frontend
31
+ WORKDIR /app/frontend
32
+ RUN npm install && npm run build
33
+
34
+ # Copy backend
35
+ WORKDIR /app
36
+ COPY backend ./backend
37
+ COPY backend/app.py .
38
+
39
+ # Expose port (HF Spaces uses 7860)
40
+ EXPOSE 7860
41
+
42
+ # Start the FastAPI server
43
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -9,6 +9,7 @@ python_version: "3.11"
9
  app_file: app.py
10
  pinned: false
11
  ---
 
12
 
13
  # Précis
14
 
@@ -46,7 +47,32 @@ All `/summarize/*` endpoints accept an optional `model` field to override the de
46
 
47
  ### Run the Fine-Tuning
48
 
49
- Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba) (`ollama pull qwen3:4b` to pull it).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ### Start the Backend
52
 
@@ -69,70 +95,87 @@ npm run dev
69
 
70
  Runs on `http://localhost:5173`.
71
 
72
- <!-- ## Data -->
73
 
74
- <!-- Later, for fine-tuning data details -->
75
 
76
- <!-- Interview Dataset -->
77
- <!--
 
 
 
78
 
 
 
 
 
79
  @article{zhu2021mediasum,
80
- title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
81
- author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
82
- journal={arXiv preprint arXiv:2103.06410},
83
- year={2021}
84
- }
 
85
 
86
- -->
87
 
88
- <!--------------------------------------------------------------------------------------------------->
89
 
90
- <!--
 
 
91
 
 
 
 
92
  @inproceedings{chen-etal-2021-dialogsum,
93
- title = "{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset",
94
- author = "Chen, Yulong and
95
- Liu, Yang and
96
- Chen, Liang and
97
- Zhang, Yue",
98
- booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
99
- month = aug,
100
- year = "2021",
101
- address = "Online",
102
- publisher = "Association for Computational Linguistics",
103
- url = "https://aclanthology.org/2021.findings-acl.449",
104
- doi = "10.18653/v1/2021.findings-acl.449",
105
- pages = "5062--5074",
106
  }
 
107
 
108
- -->
109
 
110
- <!------------------------------------------------------------------------------------------------->
111
 
112
- <!-- "Single question followed by an answer" dataset -->
113
 
114
- <!--
115
 
 
 
 
116
  @article{wang2022squality,
117
- title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
118
- author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
119
- journal = {arXiv preprint arXiv:2205.11465},
120
- year = {2022},
121
  archivePrefix = {arXiv},
122
- eprint = {2205.11465},
123
- primaryClass = {cs.CL},
124
- doi = {10.48550/arXiv.2205.11465},
125
- url = {https://doi.org/10.48550/arXiv.2205.11465}
126
  }
 
127
 
128
- -->
129
 
130
- <!------------------------------------------------------------------------------------------------->
131
 
132
- <!-- High Quality Query-Answer (concise) examples -->
133
 
134
- <!--
 
135
 
 
136
  @inproceedings{nguyen2016msmarco,
137
  title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
138
  author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
@@ -140,8 +183,9 @@ Runs on `http://localhost:5173`.
140
  year = {2016},
141
  publisher = {CEUR-WS.org}
142
  }
 
143
 
144
- -->
145
 
146
  ## License
147
 
 
9
  app_file: app.py
10
  pinned: false
11
  ---
12
+ <!-- markdownlint-disable MD025 -->
13
 
14
  # Précis
15
 
 
47
 
48
  ### Run the Fine-Tuning
49
 
50
+ Follow the scripts in `scripts/`, using any model you prefer. This project has been primarily tested with phi4-mini (from Microsoft) and Qwen 3-4b (from Alibaba).
51
+
52
+ You can pull the raw models with:
53
+
54
+ ```bash
55
+ ollama pull phi4-mini:latest
56
+ ollama pull qwen3:4b
57
+ # And any other models you may want
58
+ ```
59
+
60
+ <!--
61
+ You can also just download the fine-tuned versions right away from HuggingFace by running the following script, which downloads the fine-tuned models from my HuggingFace space:
62
+
63
+ ```bash
64
+
65
+ ```
66
+ -->
67
+
68
+ ### Test the Quality of the Fine-Tuning
69
+
70
+ Run the following script on the `test` split in order to get a sense of how accurately the model is summarizing the context. The script will use the BERTScore metric (which compares the sentiment of the generated summary with the sentiment of the reference summary) to give you a score out of 1.0, where higher is better. BERT is the most appropriate metric for this task since we want to ensure that the generated summary captures the same key facts as the reference summary, without penalizing different wording.
71
+
72
+ ```bash
73
+ # Make sure you have the appropriate libraries installed (see requirements.txt and the instructions for running the backend).
74
+ python -m scripts.test --model phi4-mini:latest
75
+ ```
76
 
77
  ### Start the Backend
78
 
 
95
 
96
  Runs on `http://localhost:5173`.
97
 
98
+ ## Data
99
 
100
+ <!-- markdownlint-disable MD033 -->
101
 
102
+ References for datasets/papers used in this project (with BibTeX available if you need to cite them formally).
103
+
104
+ ### MediaSum (Interview Summarization)
105
+
106
+ Zhu, C., Liu, Y., Mei, J., & Zeng, M. (2021). *MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization*. arXiv:2103.06410. [https://arxiv.org/abs/2103.06410](https://arxiv.org/abs/2103.06410)
107
 
108
+ <details>
109
+ <summary>BibTeX</summary>
110
+
111
+ ```bibtex
112
  @article{zhu2021mediasum,
113
+ title = {MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
114
+ author = {Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
115
+ journal = {arXiv preprint arXiv:2103.06410},
116
+ year = {2021}
117
+ }
118
+ ```
119
 
120
+ </details>
121
 
122
+ ### DialogSum (Dialogue Summarization)
123
 
124
+ Chen, Y., Liu, Y., Chen, L., & Zhang, Y. (2021). *DialogSum: A Real-Life Scenario Dialogue Summarization Dataset*. Findings of ACL-IJCNLP 2021. [https://aclanthology.org/2021.findings-acl.449](https://aclanthology.org/2021.findings-acl.449)
125
+
126
+ <details>
127
 
128
+ <summary>BibTeX</summary>
129
+
130
+ ```bibtex
131
  @inproceedings{chen-etal-2021-dialogsum,
132
+ title = {{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset},
133
+ author = {Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue},
134
+ booktitle = {Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},
135
+ month = aug,
136
+ year = {2021},
137
+ address = {Online},
138
+ publisher = {Association for Computational Linguistics},
139
+ url = {https://aclanthology.org/2021.findings-acl.449},
140
+ doi = {10.18653/v1/2021.findings-acl.449},
141
+ pages = {5062--5074}
 
 
 
142
  }
143
+ ```
144
 
145
+ </details>
146
 
147
+ ### SQuALITY (Long-Document QA)
148
 
149
+ Wang, A., Pang, R. Y., Chen, A., Phang, J., & Bowman, S. R. (2022). *SQuALITY: Building a Long-Document Summarization Dataset the Hard Way*. arXiv:2205.11465. [https://arxiv.org/abs/2205.11465](https://arxiv.org/abs/2205.11465)
150
 
151
+ <details>
152
 
153
+ <summary>BibTeX</summary>
154
+
155
+ ```bibtex
156
  @article{wang2022squality,
157
+ title = {SQuALITY: Building a Long-Document Summarization Dataset the Hard Way},
158
+ author = {Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
159
+ journal = {arXiv preprint arXiv:2205.11465},
160
+ year = {2022},
161
  archivePrefix = {arXiv},
162
+ eprint = {2205.11465},
163
+ primaryClass = {cs.CL},
164
+ doi = {10.48550/arXiv.2205.11465},
165
+ url = {https://doi.org/10.48550/arXiv.2205.11465}
166
  }
167
+ ```
168
 
169
+ </details>
170
 
171
+ ### MS MARCO (Concise QA)
172
 
173
+ Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., & Deng, L. (2016). *MS MARCO: A Human Generated Machine Reading Comprehension Dataset*.
174
 
175
+ <details>
176
+ <summary>BibTeX</summary>
177
 
178
+ ```bibtex
179
  @inproceedings{nguyen2016msmarco,
180
  title = {MS MARCO: A Human Generated Machine Reading Comprehension Dataset},
181
  author = {Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li},
 
183
  year = {2016},
184
  publisher = {CEUR-WS.org}
185
  }
186
+ ```
187
 
188
+ </details>
189
 
190
  ## License
191
 
backend/app.py CHANGED
@@ -4,6 +4,7 @@ from typing import Optional
4
  import httpx
5
  from fastapi import FastAPI, HTTPException, UploadFile, File, Header, Request
6
  from fastapi.middleware.cors import CORSMiddleware
 
7
 
8
  from config import (
9
  OLLAMA_BASE_URL,
@@ -31,6 +32,7 @@ app.add_middleware(
31
  allow_headers=["Content-Type", "X-API-Key"],
32
  )
33
 
 
34
 
35
  def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
36
  if not API_KEY:
 
4
  import httpx
5
  from fastapi import FastAPI, HTTPException, UploadFile, File, Header, Request
6
  from fastapi.middleware.cors import CORSMiddleware
7
+ from fastapi.staticfiles import StaticFiles
8
 
9
  from config import (
10
  OLLAMA_BASE_URL,
 
32
  allow_headers=["Content-Type", "X-API-Key"],
33
  )
34
 
35
+ app.mount("/", StaticFiles(directory="frontend/dist", html=True), name="static")
36
 
37
  def verify_api_key(x_api_key: Optional[str] = Header(default=None, alias="X-API-Key")):
38
  if not API_KEY:
scripts/pull.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pulls raw samples of 10k each from the [cited in README] datasets used in this project.
3
+ In the final version of the training data, a lot of the example outputs are tuned, and they are all merged into a single
4
+
5
+ HuggingFace seems to have disabled this functionality.
6
+ Currently trying to see how to work around it.
7
+ """
8
+
9
+ import json
10
+ from datasets import load_dataset
11
+
12
+ targets = {
13
+ "mediasum": ("nbroad/mediasum", None, "train"), # Parquet‑exported version, no loader script needed :contentReference[oaicite:0]{index=0}
14
+ "dialogsum": ("knkarthick/dialogsum", None, "train"), # CSV on HF :contentReference[oaicite:1]{index=1}
15
+ "squality": ("mattercalm/squality", None, "train"), # assumed generic supported format
16
+ "msmarco_corpus": ("Hyukkyu/beir-msmarco", "corpus", "train"), # Parquet migrated version :contentReference[oaicite:2]{index=2}
17
+ }
18
+
19
+ for name, (repo, config, split) in targets.items():
20
+ # load with generic loader (no trust_remote_code)
21
+ if config:
22
+ ds = load_dataset(repo, config, split=split)
23
+ else:
24
+ ds = load_dataset(repo, split=split)
25
+
26
+ # take first 10k (shuffling in memory)
27
+ small = ds.shuffle(seed=42).select(range(10_000))
28
+
29
+ out = f"{name}_10k.jsonl"
30
+ with open(out, "w", encoding="utf-8") as f:
31
+ for example in small:
32
+ f.write(json.dumps(example) + "\n")