algorembrant commited on
Commit
d2bfe97
·
verified ·
1 Parent(s): a299596

Upload 12 files

Browse files
Files changed (12) hide show
  1. .gitignore +2 -0
  2. GUIDE.md +252 -0
  3. LICENSE +21 -0
  4. README.md +244 -0
  5. ai_client.py +170 -0
  6. cleaner.py +77 -0
  7. config.py +48 -0
  8. fetcher.py +200 -0
  9. main.py +353 -0
  10. pipeline.py +173 -0
  11. requirements.txt +2 -0
  12. summarizer.py +125 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ __pycache__/
2
+ .venv/
GUIDE.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Step-by-Step Setup and Usage Guide
2
+
3
+ Author: algorembrant
4
+
5
+ ---
6
+
7
+ ## Prerequisites
8
+
9
+ | Requirement | Minimum Version | Notes |
10
+ |----------------------|-----------------|--------------------------------------------|
11
+ | Python | 3.8 | 3.10+ recommended |
12
+ | pip | 21.0 | |
13
+ | Anthropic API Key | -- | Required for clean and summarize commands |
14
+
15
+ You need an Anthropic API key to use the `clean`, `summarize`, and `pipeline` commands.
16
+ Obtain one at: https://console.anthropic.com
17
+
18
+ ---
19
+
20
+ ## Step 1 — Get the Code
21
+
22
+ **Option A: Git clone**
23
+ ```bash
24
+ git clone https://github.com/algorembrant/youtube-transcript-toolkit.git
25
+ cd youtube-transcript-toolkit
26
+ ```
27
+
28
+ **Option B: Download ZIP**
29
+ Download and unzip, then open a terminal inside the project folder.
30
+
31
+ ---
32
+
33
+ ## Step 2 — Create a Virtual Environment
34
+
35
+ **macOS / Linux**
36
+ ```bash
37
+ python3 -m venv .venv
38
+ source .venv/bin/activate
39
+ ```
40
+
41
+ **Windows (Command Prompt)**
42
+ ```cmd
43
+ python -m venv .venv
44
+ .venv\Scripts\activate.bat
45
+ ```
46
+
47
+ **Windows (PowerShell)**
48
+ ```powershell
49
+ python -m venv .venv
50
+ .venv\Scripts\Activate.ps1
51
+ ```
52
+
53
+ You should see `(.venv)` at the start of your terminal prompt.
54
+
55
+ ---
56
+
57
+ ## Step 3 — Install Dependencies
58
+
59
+ ```bash
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ Verify:
64
+ ```bash
65
+ pip show anthropic
66
+ pip show youtube-transcript-api
67
+ ```
68
+
69
+ ---
70
+
71
+ ## Step 4 — Set Your Anthropic API Key
72
+
73
+ **macOS / Linux (current session)**
74
+ ```bash
75
+ export ANTHROPIC_API_KEY="sk-ant-your-key-here"
76
+ ```
77
+
78
+ **macOS / Linux (permanent — add to shell profile)**
79
+ ```bash
80
+ echo 'export ANTHROPIC_API_KEY="sk-ant-your-key-here"' >> ~/.zshrc
81
+ source ~/.zshrc
82
+ ```
83
+
84
+ **Windows (Command Prompt)**
85
+ ```cmd
86
+ set ANTHROPIC_API_KEY=sk-ant-your-key-here
87
+ ```
88
+
89
+ **Windows (PowerShell)**
90
+ ```powershell
91
+ $env:ANTHROPIC_API_KEY = "sk-ant-your-key-here"
92
+ ```
93
+
94
+ **Windows (permanent via System Settings)**
95
+ 1. Search "Environment Variables" in Start Menu
96
+ 2. Click "Edit the system environment variables"
97
+ 3. Add a new variable: `ANTHROPIC_API_KEY` = your key
98
+
99
+ The `fetch` and `list` commands do NOT require an API key.
100
+ Only `clean`, `summarize`, and `pipeline` need it.
101
+
102
+ ---
103
+
104
+ ## Step 5 — Run Your First Commands
105
+
106
+ ### Fetch a raw transcript (no API key needed)
107
+
108
+ ```bash
109
+ python main.py fetch "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
110
+ ```
111
+
112
+ ### See what languages are available
113
+
114
+ ```bash
115
+ python main.py list dQw4w9WgXcQ
116
+ ```
117
+
118
+ ### Clean the transcript into paragraphs
119
+
120
+ ```bash
121
+ python main.py clean dQw4w9WgXcQ
122
+ ```
123
+
124
+ ### Summarize the transcript
125
+
126
+ ```bash
127
+ python main.py summarize dQw4w9WgXcQ -m brief
128
+ python main.py summarize dQw4w9WgXcQ -m detailed
129
+ python main.py summarize dQw4w9WgXcQ -m bullets
130
+ python main.py summarize dQw4w9WgXcQ -m outline
131
+ ```
132
+
133
+ ### Run the full pipeline (fetch + clean + summarize)
134
+
135
+ ```bash
136
+ python main.py pipeline dQw4w9WgXcQ -m bullets
137
+ ```
138
+
139
+ ---
140
+
141
+ ## Step 6 — Save Output to Files
142
+
143
+ ### Single video — specify a file path
144
+
145
+ ```bash
146
+ python main.py clean dQw4w9WgXcQ -o cleaned.txt
147
+ python main.py summarize dQw4w9WgXcQ -m detailed -o summary.txt
148
+ ```
149
+
150
+ ### Pipeline — specify a directory (creates 3 files per video)
151
+
152
+ ```bash
153
+ python main.py pipeline dQw4w9WgXcQ -o ./output/
154
+ ```
155
+
156
+ Files created:
157
+ ```
158
+ ./output/
159
+ dQw4w9WgXcQ_transcript.txt
160
+ dQw4w9WgXcQ_cleaned.txt
161
+ dQw4w9WgXcQ_summary.txt
162
+ ```
163
+
164
+ ### Batch — multiple videos at once
165
+
166
+ ```bash
167
+ python main.py pipeline VIDEO_ID_1 VIDEO_ID_2 VIDEO_ID_3 -o ./batch_output/
168
+ ```
169
+
170
+ ---
171
+
172
+ ## Step 7 — Advanced Options
173
+
174
+ ### Use the higher-quality model
175
+
176
+ ```bash
177
+ python main.py clean dQw4w9WgXcQ --quality
178
+ python main.py summarize dQw4w9WgXcQ -m detailed --quality
179
+ ```
180
+
181
+ Default model: `claude-haiku-4-5` (fast, cost-efficient)
182
+ Quality model: `claude-sonnet-4-6` (better for complex or long transcripts)
183
+
184
+ ### Disable streaming (show output only after completion)
185
+
186
+ ```bash
187
+ python main.py clean dQw4w9WgXcQ --no-stream
188
+ ```
189
+
190
+ ### Request a non-English transcript
191
+
192
+ ```bash
193
+ python main.py clean dQw4w9WgXcQ -l ja # Japanese only
194
+ python main.py clean dQw4w9WgXcQ -l es en # Spanish, fall back to English
195
+ ```
196
+
197
+ ### Fetch raw transcript as SRT or JSON
198
+
199
+ ```bash
200
+ python main.py fetch dQw4w9WgXcQ -f srt -o captions.srt
201
+ python main.py fetch dQw4w9WgXcQ -f json -o transcript.json
202
+ python main.py fetch dQw4w9WgXcQ -f vtt -o captions.vtt
203
+ ```
204
+
205
+ ### Fetch with timestamps
206
+
207
+ ```bash
208
+ python main.py fetch dQw4w9WgXcQ -t
209
+ python main.py pipeline dQw4w9WgXcQ -t -o ./output/
210
+ ```
211
+
212
+ ### Pipeline — skip individual steps
213
+
214
+ ```bash
215
+ # Fetch and summarize without cleaning
216
+ python main.py pipeline dQw4w9WgXcQ --skip-clean -m bullets
217
+
218
+ # Fetch and clean without summarizing
219
+ python main.py pipeline dQw4w9WgXcQ --skip-summary
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Troubleshooting
225
+
226
+ | Symptom | Likely Cause | Fix |
227
+ |---------|-------------|-----|
228
+ | `TranscriptsDisabled` error | Video owner disabled captions | Use a different video |
229
+ | `VideoUnavailable` error | Private, deleted, or region-locked | Check URL; try VPN if region-locked |
230
+ | `NoTranscriptFound` | Requested language missing | Run `list` to see available languages |
231
+ | `AuthenticationError` | API key missing or wrong | Check `ANTHROPIC_API_KEY` env variable |
232
+ | `ModuleNotFoundError` | Dependencies not installed | Run `pip install -r requirements.txt` |
233
+ | Chunking messages in stderr | Transcript very long | Normal — multi-pass processing is automatic |
234
+ | Output cuts off mid-sentence | max_tokens limit hit | This is rare; open an issue if it occurs |
235
+
236
+ ---
237
+
238
+ ## Project File Reference
239
+
240
+ ```
241
+ main.py CLI entry point — all five commands
242
+ fetcher.py YouTube direct caption API (no scraping)
243
+ cleaner.py AI paragraph reformatter
244
+ summarizer.py AI summarizer (4 modes)
245
+ pipeline.py Orchestrates the full fetch -> clean -> summarize chain
246
+ ai_client.py Anthropic API wrapper with chunking and streaming
247
+ config.py Constants: model names, chunk size, summary modes
248
+ requirements.txt Two dependencies
249
+ README.md Full project documentation
250
+ GUIDE.md This file
251
+ LICENSE MIT License
252
+ ```
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Rembrant Oyangoren Albeos
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ license: mit
2
+ sdk: static
3
+ colorFrom: blue
4
+ colorTo: red
5
+ tags:
6
+ - youtube
7
+ - transcript
8
+ - api
9
+ - fetch
10
+ - clean
11
+ - summarize
12
+ - python
13
+ - tools
14
+
15
+ ![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python&logoColor=white)
16
+ ![Anthropic](https://img.shields.io/badge/Powered%20by-Anthropic%20Claude-blueviolet?style=flat-square)
17
+ ![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)
18
+ ![No Scraping](https://img.shields.io/badge/No%20Scraping-Direct%20API-brightgreen?style=flat-square)
19
+ ![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square)
20
+ ![Author](https://img.shields.io/badge/Author-algorembrant-orange?style=flat-square)
21
+
22
+ ---
23
+
24
+ # YouTube Transcript Toolkit
25
+
26
+ A fast, zero-scraping command-line toolkit that fetches YouTube transcripts
27
+ directly via the caption API, then uses the Anthropic Claude API to reformat
28
+ them into clean paragraphs and produce multi-mode summaries.
29
+
30
+ No Selenium. No BeautifulSoup. No headless browsers. Two AI-powered
31
+ post-processing features built on top of direct caption API access.
32
+
33
+ ---
34
+
35
+ ## Architecture
36
+
37
+ ```
38
+ main.py CLI entry point — five commands (fetch, list, clean, summarize, pipeline)
39
+ fetcher.py Direct YouTube caption API — no HTML parsing
40
+ cleaner.py AI paragraph reformatter (Anthropic Claude)
41
+ summarizer.py AI summarizer with 4 output modes (Anthropic Claude)
42
+ pipeline.py Orchestrates fetch -> clean -> summarize in one pass
43
+ ai_client.py Shared Anthropic API wrapper with chunking and streaming
44
+ config.py Model names, limits, summary modes, defaults
45
+ ```
46
+
47
+ ---
48
+
49
+ ## Features
50
+
51
+ - Direct caption API — transcript fetch is near-instant regardless of video length
52
+ - Paragraph Cleaner — reformats fragmented auto-captions into readable prose (no content removed)
53
+ - Summarizer — four modes: brief, detailed, bullet points, hierarchical outline
54
+ - Full pipeline — fetch + clean + summarize in a single command
55
+ - Token streaming — see AI output in real time as it generates
56
+ - Automatic chunking — handles transcripts of any length by splitting and merging
57
+ - Fast model by default (claude-haiku), quality model available via --quality flag
58
+ - Batch processing — multiple video IDs/URLs in one command
59
+ - Output formats — plain text, JSON, SRT, WebVTT for the raw transcript
60
+
61
+ ---
62
+
63
+ ## Installation
64
+
65
+ ```bash
66
+ git clone https://github.com/algorembrant/youtube-transcript-toolkit.git
67
+ cd youtube-transcript-toolkit
68
+ python -m venv .venv
69
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
70
+ pip install -r requirements.txt
71
+ export ANTHROPIC_API_KEY="sk-ant-..." # Windows: set ANTHROPIC_API_KEY=sk-ant-...
72
+ ```
73
+
74
+ ---
75
+
76
+ ## Commands
77
+
78
+ ### fetch — raw transcript only (no AI)
79
+
80
+ ```bash
81
+ python main.py fetch "https://www.youtube.com/watch?v=VIDEO_ID"
82
+ python main.py fetch VIDEO_ID -f srt -o transcript.srt
83
+ python main.py fetch VIDEO_ID -f json -o transcript.json
84
+ python main.py fetch VIDEO_ID -t # with timestamps
85
+ python main.py fetch VIDEO_ID -l es en # Spanish, fall back to English
86
+ ```
87
+
88
+ ### list — available languages
89
+
90
+ ```bash
91
+ python main.py list VIDEO_ID
92
+ ```
93
+
94
+ ### clean — reformat into paragraphs
95
+
96
+ ```bash
97
+ python main.py clean VIDEO_ID
98
+ python main.py clean VIDEO_ID -o cleaned.txt
99
+ python main.py clean VIDEO_ID --quality # use higher-quality model
100
+ python main.py clean VIDEO_ID --no-stream # disable live token output
101
+ ```
102
+
103
+ ### summarize — AI-generated summary
104
+
105
+ ```bash
106
+ python main.py summarize VIDEO_ID # brief (default)
107
+ python main.py summarize VIDEO_ID -m detailed
108
+ python main.py summarize VIDEO_ID -m bullets
109
+ python main.py summarize VIDEO_ID -m outline
110
+ python main.py summarize VIDEO_ID -m detailed --quality -o summary.txt
111
+ ```
112
+
113
+ ### pipeline — fetch + clean + summarize
114
+
115
+ ```bash
116
+ python main.py pipeline VIDEO_ID
117
+ python main.py pipeline VIDEO_ID -m bullets -o ./output/
118
+ python main.py pipeline VIDEO_ID --skip-clean # fetch + summarize only
119
+ python main.py pipeline VIDEO_ID --skip-summary # fetch + clean only
120
+ python main.py pipeline ID1 ID2 ID3 -o ./batch/ # batch
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Summary Modes
126
+
127
+ | Mode | Description |
128
+ |------------|--------------------------------------------------|
129
+ | `brief` | 3-5 sentence executive summary |
130
+ | `detailed` | Multi-section prose: Overview, Key Points, etc. |
131
+ | `bullets` | Key takeaways grouped under bold thematic headers|
132
+ | `outline` | Hierarchical Roman-numeral topic outline |
133
+
134
+ ---
135
+
136
+ ## Model Selection
137
+
138
+ | Flag | Model Used | Best For |
139
+ |-------------|-----------------------------|------------------------------------|
140
+ | (default) | claude-haiku-4-5 | Speed, short-to-medium transcripts |
141
+ | `--quality` | claude-sonnet-4-6 | Long transcripts, deep summaries |
142
+
143
+ ---
144
+
145
+ ## CLI Reference
146
+
147
+ ```
148
+ usage: main.py {fetch,list,clean,summarize,pipeline} [options] video [video ...]
149
+
150
+ commands:
151
+ fetch Fetch raw transcript (no AI)
152
+ list List available transcript languages
153
+ clean Fetch + AI paragraph formatting
154
+ summarize Fetch + AI summarization
155
+ pipeline Fetch + clean + summarize in one pass
156
+
157
+ shared options:
158
+ -l, --languages LANG [LANG ...] Language codes, in order of preference
159
+ -o, --output PATH Output file (single) or directory (batch)
160
+ --quality Use higher-quality Claude model
161
+ --no-stream Disable live token streaming
162
+
163
+ fetch / pipeline options:
164
+ -f, --format {text,json,srt,vtt} Raw transcript format (default: text)
165
+ -t, --timestamps Add timestamps to plain-text output
166
+
167
+ clean / summarize / pipeline options:
168
+ -m, --mode {brief,detailed,bullets,outline} Summary mode (default: brief)
169
+
170
+ pipeline options:
171
+ --skip-clean Skip paragraph cleaning step
172
+ --skip-summary Skip summarization step
173
+ ```
174
+
175
+ ---
176
+
177
+ ## Output Files (pipeline with -o)
178
+
179
+ When using `pipeline -o ./output/`, three files are saved per video:
180
+
181
+ ```
182
+ ./output/
183
+ VIDEO_ID_transcript.txt Raw transcript
184
+ VIDEO_ID_cleaned.txt Paragraph-cleaned transcript
185
+ VIDEO_ID_summary.txt Summary
186
+ ```
187
+
188
+ ---
189
+
190
+ ## Chunking Strategy
191
+
192
+ Transcripts larger than 60,000 characters are automatically split into chunks
193
+ at paragraph or sentence boundaries. Each chunk is processed independently,
194
+ then the partial results are merged in a final synthesis pass. This allows
195
+ the toolkit to handle full-length lecture recordings, long-form interviews,
196
+ and documentary transcripts without hitting token limits.
197
+
198
+ ---
199
+
200
+ ## Supported URL Formats
201
+
202
+ ```
203
+ https://www.youtube.com/watch?v=VIDEO_ID
204
+ https://youtu.be/VIDEO_ID
205
+ https://www.youtube.com/shorts/VIDEO_ID
206
+ https://www.youtube.com/embed/VIDEO_ID
207
+ VIDEO_ID (raw 11-character ID)
208
+ ```
209
+
210
+ ---
211
+
212
+ ## Error Reference
213
+
214
+ | Error | Cause |
215
+ |-------------------------|--------------------------------------------------|
216
+ | `TranscriptsDisabled` | Video owner has disabled captions |
217
+ | `VideoUnavailable` | Video is private, deleted, or region-locked |
218
+ | `NoTranscriptFound` | Requested language does not exist |
219
+ | `NoTranscriptAvailable` | No captions of any kind exist for this video |
220
+ | `AuthenticationError` | ANTHROPIC_API_KEY is missing or invalid |
221
+
222
+ ---
223
+
224
+ ## Dependencies
225
+
226
+ | Package | Version | Purpose |
227
+ |------------------------|------------|--------------------------------------|
228
+ | anthropic | >=0.40.0 | Claude API (clean + summarize) |
229
+ | youtube-transcript-api | 0.6.2 | Direct YouTube caption API access |
230
+
231
+ ---
232
+
233
+ ## License
234
+
235
+ MIT License. See `LICENSE` for details.
236
+
237
+ ---
238
+
239
+ ## Disclaimer
240
+
241
+ This tool uses YouTube's publicly accessible caption endpoint and the Anthropic
242
+ API for personal, educational, and research use. An Anthropic API key is required
243
+ for the clean and summarize features. Review YouTube's Terms of Service before
244
+ using this tool in a production or commercial context.
ai_client.py ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ ai_client.py
3
+ Thin wrapper around the Anthropic API with chunked processing and streaming.
4
+ Author: algorembrant
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import sys
10
+ from typing import Iterator, Optional
11
+
12
+ import anthropic
13
+
14
+ from config import DEFAULT_MODEL, MAX_TOKENS, CHUNK_SIZE
15
+
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Module-level client (lazy init, reused across calls)
19
+ # ---------------------------------------------------------------------------
20
+ _client: Optional[anthropic.Anthropic] = None
21
+
22
+
23
+ def _get_client() -> anthropic.Anthropic:
24
+ global _client
25
+ if _client is None:
26
+ _client = anthropic.Anthropic()
27
+ return _client
28
+
29
+
30
+ # ---------------------------------------------------------------------------
31
+ # Core helpers
32
+ # ---------------------------------------------------------------------------
33
+
34
+ def complete(
35
+ system: str,
36
+ user: str,
37
+ model: str = DEFAULT_MODEL,
38
+ max_tokens: int = MAX_TOKENS,
39
+ stream: bool = True,
40
+ ) -> str:
41
+ """
42
+ Run a single completion and return the full response text.
43
+ Streams tokens to stderr if `stream=True` so the user sees progress.
44
+ """
45
+ client = _get_client()
46
+
47
+ if stream:
48
+ result_parts: list[str] = []
49
+ with client.messages.stream(
50
+ model=model,
51
+ max_tokens=max_tokens,
52
+ system=system,
53
+ messages=[{"role": "user", "content": user}],
54
+ ) as stream_ctx:
55
+ for text in stream_ctx.text_stream:
56
+ print(text, end="", flush=True, file=sys.stderr)
57
+ result_parts.append(text)
58
+ print(file=sys.stderr) # newline after stream
59
+ return "".join(result_parts)
60
+ else:
61
+ response = client.messages.create(
62
+ model=model,
63
+ max_tokens=max_tokens,
64
+ system=system,
65
+ messages=[{"role": "user", "content": user}],
66
+ )
67
+ return response.content[0].text
68
+
69
+
70
+ def _split_into_chunks(text: str, chunk_size: int = CHUNK_SIZE) -> list[str]:
71
+ """
72
+ Split text into chunks of at most `chunk_size` characters,
73
+ breaking on paragraph or sentence boundaries where possible.
74
+ """
75
+ if len(text) <= chunk_size:
76
+ return [text]
77
+
78
+ chunks: list[str] = []
79
+ start = 0
80
+ while start < len(text):
81
+ end = start + chunk_size
82
+ if end >= len(text):
83
+ chunks.append(text[start:])
84
+ break
85
+
86
+ # Try to break at a paragraph boundary (\n\n)
87
+ split_at = text.rfind("\n\n", start, end)
88
+ if split_at == -1:
89
+ # Fall back to sentence boundary
90
+ split_at = text.rfind(". ", start, end)
91
+ if split_at == -1:
92
+ # Fall back to whitespace
93
+ split_at = text.rfind(" ", start, end)
94
+ if split_at == -1:
95
+ split_at = end # hard split
96
+
97
+ chunks.append(text[start : split_at + 1])
98
+ start = split_at + 1
99
+
100
+ return chunks
101
+
102
+
103
+ def complete_long(
104
+ system: str,
105
+ user_prefix: str,
106
+ text: str,
107
+ user_suffix: str = "",
108
+ model: str = DEFAULT_MODEL,
109
+ max_tokens: int = MAX_TOKENS,
110
+ merge_system: Optional[str] = None,
111
+ stream: bool = True,
112
+ ) -> str:
113
+ """
114
+ Process a potentially long text by splitting it into chunks,
115
+ running a completion on each, then optionally merging the results.
116
+
117
+ Args:
118
+ system: System prompt.
119
+ user_prefix: Text prepended before each chunk in the user message.
120
+ text: The main content to process (may be chunked).
121
+ user_suffix: Text appended after each chunk in the user message.
122
+ model: Anthropic model identifier.
123
+ max_tokens: Max output tokens per call.
124
+ merge_system: If provided and there are multiple chunks, a final
125
+ merge pass is run with this system prompt.
126
+ stream: Whether to stream tokens to stderr.
127
+
128
+ Returns:
129
+ Final processed text (merged if multi-chunk).
130
+ """
131
+ chunks = _split_into_chunks(text)
132
+ n = len(chunks)
133
+
134
+ if n == 1:
135
+ user_msg = f"{user_prefix}\n\n{chunks[0]}"
136
+ if user_suffix:
137
+ user_msg += f"\n\n{user_suffix}"
138
+ return complete(system, user_msg, model=model, max_tokens=max_tokens, stream=stream)
139
+
140
+ # Multi-chunk processing
141
+ print(
142
+ f"[info] Text is large ({len(text):,} chars). Processing in {n} chunks.",
143
+ file=sys.stderr,
144
+ )
145
+ partial_results: list[str] = []
146
+ for i, chunk in enumerate(chunks, 1):
147
+ print(f"\n[chunk {i}/{n}]", file=sys.stderr)
148
+ user_msg = (
149
+ f"{user_prefix}\n\n"
150
+ f"[Part {i} of {n}]\n\n{chunk}"
151
+ )
152
+ if user_suffix:
153
+ user_msg += f"\n\n{user_suffix}"
154
+ result = complete(system, user_msg, model=model, max_tokens=max_tokens, stream=stream)
155
+ partial_results.append(result)
156
+
157
+ combined = "\n\n".join(partial_results)
158
+
159
+ # Optional merge/synthesis pass
160
+ if merge_system and n > 1:
161
+ print(f"\n[merging {n} chunks into final output]", file=sys.stderr)
162
+ combined = complete(
163
+ merge_system,
164
+ f"Merge and unify the following {n} sections into a single cohesive output:\n\n{combined}",
165
+ model=model,
166
+ max_tokens=max_tokens,
167
+ stream=stream,
168
+ )
169
+
170
+ return combined
cleaner.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ cleaner.py
3
+ Reformats raw YouTube transcript text into clean, readable paragraphs.
4
+ Author: algorembrant
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from config import DEFAULT_MODEL, MAX_TOKENS
10
+ from ai_client import complete_long
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Prompts
14
+ # ---------------------------------------------------------------------------
15
+
16
+ _CLEAN_SYSTEM = """You are a professional transcript editor.
17
+ Your task is to reformat raw, fragmented YouTube transcript text into clean,
18
+ readable paragraphs that preserve the speaker's words and intent exactly.
19
+
20
+ Rules:
21
+ - Do NOT paraphrase, summarize, or omit any content.
22
+ - Fix only punctuation, capitalization, and paragraph breaks.
23
+ - Group related sentences into coherent paragraphs of 3-6 sentences each.
24
+ - Remove filler words only when they impede readability (e.g. repeated "um", "uh", "like").
25
+ - Remove duplicate lines caused by auto-captioning overlap.
26
+ - Preserve proper nouns, technical terms, and speaker style.
27
+ - Output clean, flowing prose — no bullet points, no headers, no markdown.
28
+ - Do not add any commentary, preamble, or notes of your own.
29
+ """
30
+
31
+ _CLEAN_USER_PREFIX = (
32
+ "Reformat the following raw YouTube transcript into clean, readable paragraphs. "
33
+ "Preserve all content. Fix punctuation and capitalization only.\n\n"
34
+ "RAW TRANSCRIPT:"
35
+ )
36
+
37
+ _CLEAN_MERGE_SYSTEM = """You are a professional transcript editor.
38
+ You will receive several already-cleaned transcript sections.
39
+ Merge them into a single, seamless, well-paragraphed document.
40
+ Do not summarize or omit any content. Output clean flowing prose only.
41
+ """
42
+
43
+
44
+ # ---------------------------------------------------------------------------
45
+ # Public API
46
+ # ---------------------------------------------------------------------------
47
+
48
+ def clean(
49
+ raw_text: str,
50
+ model: str = DEFAULT_MODEL,
51
+ max_tokens: int = MAX_TOKENS,
52
+ stream: bool = True,
53
+ ) -> str:
54
+ """
55
+ Reformat a raw transcript into clean paragraphs.
56
+
57
+ Args:
58
+ raw_text: Plain-text transcript (output of fetcher.TranscriptResult.plain_text).
59
+ model: Anthropic model to use.
60
+ max_tokens: Max output tokens per API call.
61
+ stream: Whether to stream progress tokens to stderr.
62
+
63
+ Returns:
64
+ Cleaned, paragraph-formatted transcript as a string.
65
+ """
66
+ if not raw_text or not raw_text.strip():
67
+ raise ValueError("Cannot clean an empty transcript.")
68
+
69
+ return complete_long(
70
+ system=_CLEAN_SYSTEM,
71
+ user_prefix=_CLEAN_USER_PREFIX,
72
+ text=raw_text.strip(),
73
+ model=model,
74
+ max_tokens=max_tokens,
75
+ merge_system=_CLEAN_MERGE_SYSTEM,
76
+ stream=stream,
77
+ )
config.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ config.py
3
+ Central configuration for the YouTube Transcript Toolkit.
4
+ Author: algorembrant
5
+ """
6
+
7
+ # ---------------------------------------------------------------------------
8
+ # Model settings
9
+ # ---------------------------------------------------------------------------
10
+ # claude-haiku-4-5 is used by default for speed.
11
+ # Switch to claude-sonnet-4-6 for higher quality at the cost of latency.
12
+ DEFAULT_MODEL = "claude-haiku-4-5-20251001"
13
+ QUALITY_MODEL = "claude-sonnet-4-6"
14
+
15
+ MAX_TOKENS = 8192 # Maximum tokens to request from the model
16
+ CHUNK_SIZE = 60_000 # Characters per chunk for very long transcripts
17
+
18
+ # ---------------------------------------------------------------------------
19
+ # Transcript defaults
20
+ # ---------------------------------------------------------------------------
21
+ DEFAULT_LANGUAGES = ["en"]
22
+
23
+ # ---------------------------------------------------------------------------
24
+ # Summary modes
25
+ # ---------------------------------------------------------------------------
26
+ SUMMARY_MODES = {
27
+ "brief": {
28
+ "label": "Brief",
29
+ "description": "3-5 sentence executive summary",
30
+ },
31
+ "detailed": {
32
+ "label": "Detailed",
33
+ "description": "Comprehensive multi-section breakdown",
34
+ },
35
+ "bullets": {
36
+ "label": "Bullet Points",
37
+ "description": "Key takeaways as a structured bullet list",
38
+ },
39
+ "outline": {
40
+ "label": "Outline",
41
+ "description": "Hierarchical topic outline",
42
+ },
43
+ }
44
+
45
+ # ---------------------------------------------------------------------------
46
+ # Output formats
47
+ # ---------------------------------------------------------------------------
48
+ OUTPUT_FORMATS = ["text", "json", "srt", "vtt"]
fetcher.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ fetcher.py
3
+ Fetches YouTube transcripts directly via the caption API — no HTML parsing.
4
+ Author: algorembrant
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import re
10
+ import sys
11
+ from typing import Optional
12
+
13
+ from youtube_transcript_api import YouTubeTranscriptApi
14
+ from youtube_transcript_api.formatters import (
15
+ JSONFormatter,
16
+ SRTFormatter,
17
+ TextFormatter,
18
+ WebVTTFormatter,
19
+ )
20
+ from youtube_transcript_api._errors import (
21
+ NoTranscriptAvailable,
22
+ NoTranscriptFound,
23
+ TranscriptsDisabled,
24
+ VideoUnavailable,
25
+ )
26
+
27
+ from config import DEFAULT_LANGUAGES
28
+
29
+
30
+ # ---------------------------------------------------------------------------
31
+ # URL / ID helpers
32
+ # ---------------------------------------------------------------------------
33
+
34
+ _ID_PATTERNS = [
35
+ r"(?:youtube\.com/watch\?.*v=)([a-zA-Z0-9_-]{11})",
36
+ r"(?:youtu\.be/)([a-zA-Z0-9_-]{11})",
37
+ r"(?:youtube\.com/shorts/)([a-zA-Z0-9_-]{11})",
38
+ r"(?:youtube\.com/embed/)([a-zA-Z0-9_-]{11})",
39
+ ]
40
+
41
+
42
+ def extract_video_id(url_or_id: str) -> str:
43
+ """Return the 11-character YouTube video ID from a URL or raw ID."""
44
+ for pattern in _ID_PATTERNS:
45
+ match = re.search(pattern, url_or_id)
46
+ if match:
47
+ return match.group(1)
48
+
49
+ if re.fullmatch(r"[a-zA-Z0-9_-]{11}", url_or_id):
50
+ return url_or_id
51
+
52
+ raise ValueError(
53
+ f"Cannot extract a valid YouTube video ID from: {url_or_id!r}\n"
54
+ "Accepted: full YouTube URL, youtu.be link, Shorts URL, embed URL, or raw 11-char ID."
55
+ )
56
+
57
+
58
+ # ---------------------------------------------------------------------------
59
+ # Language listing
60
+ # ---------------------------------------------------------------------------
61
+
62
+ def list_available_transcripts(video_id: str) -> None:
63
+ """Print all available transcript languages for a video."""
64
+ tlist = YouTubeTranscriptApi.list_transcripts(video_id)
65
+
66
+ manual = list(tlist._manually_created_transcripts.values())
67
+ auto = list(tlist._generated_transcripts.values())
68
+
69
+ print(f"\nAvailable transcripts -- video: {video_id}\n")
70
+ if manual:
71
+ print("Manually created:")
72
+ for t in manual:
73
+ print(f" [{t.language_code:8s}] {t.language}")
74
+ if auto:
75
+ print("Auto-generated:")
76
+ for t in auto:
77
+ print(f" [{t.language_code:8s}] {t.language}")
78
+ if not manual and not auto:
79
+ print(" (none found)")
80
+
81
+
82
+ # ---------------------------------------------------------------------------
83
+ # Core fetch
84
+ # ---------------------------------------------------------------------------
85
+
86
+ class TranscriptResult:
87
+ """Container for a fetched transcript."""
88
+
89
+ def __init__(
90
+ self,
91
+ video_id: str,
92
+ raw_data: list[dict],
93
+ language_code: str,
94
+ language: str,
95
+ is_generated: bool,
96
+ ) -> None:
97
+ self.video_id = video_id
98
+ self.raw_data = raw_data # list of {text, start, duration}
99
+ self.language_code = language_code
100
+ self.language = language
101
+ self.is_generated = is_generated
102
+
103
+ # ------------------------------------------------------------------
104
+ # Convenience properties
105
+ # ------------------------------------------------------------------
106
+
107
+ @property
108
+ def plain_text(self) -> str:
109
+ """Plain transcript text without timestamps."""
110
+ return TextFormatter().format_transcript(self.raw_data)
111
+
112
+ def timestamped_text(self) -> str:
113
+ """Plain text with [MM:SS.ss] prefixes."""
114
+ lines = []
115
+ for entry in self.raw_data:
116
+ m = int(entry["start"] // 60)
117
+ s = entry["start"] % 60
118
+ lines.append(f"[{m:02d}:{s:05.2f}] {entry['text']}")
119
+ return "\n".join(lines)
120
+
121
+ def as_json(self) -> str:
122
+ return JSONFormatter().format_transcript(self.raw_data, indent=2)
123
+
124
+ def as_srt(self) -> str:
125
+ return SRTFormatter().format_transcript(self.raw_data)
126
+
127
+ def as_vtt(self) -> str:
128
+ return WebVTTFormatter().format_transcript(self.raw_data)
129
+
130
+ def formatted(self, fmt: str, timestamps: bool = False) -> str:
131
+ """Return transcript in the requested format string."""
132
+ if fmt == "json":
133
+ return self.as_json()
134
+ if fmt == "srt":
135
+ return self.as_srt()
136
+ if fmt == "vtt":
137
+ return self.as_vtt()
138
+ # default: text
139
+ return self.timestamped_text() if timestamps else self.plain_text
140
+
141
+ def __len__(self) -> int:
142
+ return len(self.plain_text)
143
+
144
+
145
+ def fetch(
146
+ video_id: str,
147
+ languages: Optional[list[str]] = None,
148
+ ) -> TranscriptResult:
149
+ """
150
+ Fetch a YouTube transcript directly via the caption API.
151
+
152
+ Args:
153
+ video_id: 11-character YouTube video ID.
154
+ languages: Ordered list of preferred language codes.
155
+
156
+ Returns:
157
+ TranscriptResult instance.
158
+
159
+ Raises:
160
+ SystemExit on unrecoverable errors (TranscriptsDisabled, VideoUnavailable, etc.)
161
+ """
162
+ if languages is None:
163
+ languages = DEFAULT_LANGUAGES
164
+
165
+ try:
166
+ tlist = YouTubeTranscriptApi.list_transcripts(video_id)
167
+
168
+ try:
169
+ transcript_obj = tlist.find_transcript(languages)
170
+ except NoTranscriptFound:
171
+ all_t = (
172
+ list(tlist._manually_created_transcripts.values())
173
+ + list(tlist._generated_transcripts.values())
174
+ )
175
+ if not all_t:
176
+ raise NoTranscriptAvailable(video_id)
177
+ transcript_obj = all_t[0]
178
+ print(
179
+ f"[warn] Requested language(s) not found. "
180
+ f"Using [{transcript_obj.language_code}] {transcript_obj.language}.",
181
+ file=sys.stderr,
182
+ )
183
+
184
+ raw = transcript_obj.fetch()
185
+ return TranscriptResult(
186
+ video_id=video_id,
187
+ raw_data=raw,
188
+ language_code=transcript_obj.language_code,
189
+ language=transcript_obj.language,
190
+ is_generated=transcript_obj.is_generated,
191
+ )
192
+
193
+ except TranscriptsDisabled:
194
+ sys.exit(f"[error] Transcripts are disabled for video '{video_id}'.")
195
+ except VideoUnavailable:
196
+ sys.exit(f"[error] Video '{video_id}' is unavailable (private, deleted, or region-locked).")
197
+ except NoTranscriptAvailable:
198
+ sys.exit(f"[error] No transcript found for video '{video_id}'.")
199
+ except Exception as exc:
200
+ sys.exit(f"[error] Unexpected error while fetching transcript: {exc}")
main.py ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ main.py
4
+ YouTube Transcript Toolkit — CLI entry point.
5
+
6
+ Commands:
7
+ fetch Fetch and print/save raw transcript
8
+ clean Fetch transcript and reformat into paragraphs
9
+ summarize Fetch transcript and summarize
10
+ pipeline Fetch, clean, and summarize in one pass
11
+ list List available transcript languages for a video
12
+
13
+ Author: algorembrant
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ import argparse
19
+ import sys
20
+
21
+ from config import DEFAULT_MODEL, QUALITY_MODEL, SUMMARY_MODES, OUTPUT_FORMATS
22
+ from fetcher import extract_video_id, list_available_transcripts, fetch
23
+ from cleaner import clean
24
+ from summarizer import summarize
25
+ from pipeline import run, run_batch
26
+
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # Shared argument groups
30
+ # ---------------------------------------------------------------------------
31
+
32
+ def _add_video_args(p: argparse.ArgumentParser) -> None:
33
+ p.add_argument(
34
+ "video",
35
+ nargs="+",
36
+ help="YouTube video URL(s) or ID(s).",
37
+ )
38
+
39
+ def _add_lang_args(p: argparse.ArgumentParser) -> None:
40
+ p.add_argument(
41
+ "-l", "--languages",
42
+ nargs="+",
43
+ default=["en"],
44
+ metavar="LANG",
45
+ help="Language codes in order of preference (default: en). Example: --languages en es",
46
+ )
47
+
48
+ def _add_output_args(p: argparse.ArgumentParser) -> None:
49
+ p.add_argument(
50
+ "-o", "--output",
51
+ metavar="PATH",
52
+ help="Output file (single video) or directory (multiple videos).",
53
+ )
54
+
55
+ def _add_ai_args(p: argparse.ArgumentParser) -> None:
56
+ p.add_argument(
57
+ "--quality",
58
+ action="store_true",
59
+ help=f"Use the higher-quality model ({QUALITY_MODEL}) instead of the default fast model.",
60
+ )
61
+ p.add_argument(
62
+ "--no-stream",
63
+ action="store_true",
64
+ help="Disable token streaming (collect full response before printing).",
65
+ )
66
+
67
+ def _add_format_args(p: argparse.ArgumentParser) -> None:
68
+ p.add_argument(
69
+ "-f", "--format",
70
+ choices=OUTPUT_FORMATS,
71
+ default="text",
72
+ help="Raw transcript output format (default: text).",
73
+ )
74
+ p.add_argument(
75
+ "-t", "--timestamps",
76
+ action="store_true",
77
+ help="Include timestamps in plain-text transcript output.",
78
+ )
79
+
80
+
81
+ # ---------------------------------------------------------------------------
82
+ # Argument parser
83
+ # ---------------------------------------------------------------------------
84
+
85
+ def build_parser() -> argparse.ArgumentParser:
86
+ parser = argparse.ArgumentParser(
87
+ prog="yttool",
88
+ description=(
89
+ "YouTube Transcript Toolkit\n"
90
+ "Fetch, clean, and summarize YouTube transcripts. No HTML parsing.\n"
91
+ "Author: algorembrant"
92
+ ),
93
+ formatter_class=argparse.RawTextHelpFormatter,
94
+ )
95
+
96
+ subparsers = parser.add_subparsers(dest="command", required=True)
97
+
98
+ # ---- fetch ----
99
+ p_fetch = subparsers.add_parser(
100
+ "fetch",
101
+ help="Fetch the raw transcript of a YouTube video.",
102
+ formatter_class=argparse.RawTextHelpFormatter,
103
+ )
104
+ _add_video_args(p_fetch)
105
+ _add_lang_args(p_fetch)
106
+ _add_format_args(p_fetch)
107
+ _add_output_args(p_fetch)
108
+
109
+ # ---- list ----
110
+ p_list = subparsers.add_parser(
111
+ "list",
112
+ help="List all available transcript languages for a video.",
113
+ )
114
+ _add_video_args(p_list)
115
+
116
+ # ---- clean ----
117
+ p_clean = subparsers.add_parser(
118
+ "clean",
119
+ help="Fetch a transcript and reformat it into clean paragraphs.",
120
+ formatter_class=argparse.RawTextHelpFormatter,
121
+ )
122
+ _add_video_args(p_clean)
123
+ _add_lang_args(p_clean)
124
+ _add_ai_args(p_clean)
125
+ _add_output_args(p_clean)
126
+
127
+ # ---- summarize ----
128
+ p_sum = subparsers.add_parser(
129
+ "summarize",
130
+ help="Fetch a transcript and summarize it.",
131
+ formatter_class=argparse.RawTextHelpFormatter,
132
+ )
133
+ _add_video_args(p_sum)
134
+ _add_lang_args(p_sum)
135
+ p_sum.add_argument(
136
+ "-m", "--mode",
137
+ choices=list(SUMMARY_MODES.keys()),
138
+ default="brief",
139
+ help=(
140
+ "Summary mode (default: brief):\n"
141
+ + "\n".join(
142
+ f" {k:10s} {v['description']}"
143
+ for k, v in SUMMARY_MODES.items()
144
+ )
145
+ ),
146
+ )
147
+ _add_ai_args(p_sum)
148
+ _add_output_args(p_sum)
149
+
150
+ # ---- pipeline ----
151
+ p_pipe = subparsers.add_parser(
152
+ "pipeline",
153
+ help="Fetch, clean, and summarize in one pass.",
154
+ formatter_class=argparse.RawTextHelpFormatter,
155
+ )
156
+ _add_video_args(p_pipe)
157
+ _add_lang_args(p_pipe)
158
+ _add_format_args(p_pipe)
159
+ p_pipe.add_argument(
160
+ "-m", "--mode",
161
+ choices=list(SUMMARY_MODES.keys()),
162
+ default="brief",
163
+ help="Summary mode (default: brief).",
164
+ )
165
+ p_pipe.add_argument(
166
+ "--skip-clean",
167
+ action="store_true",
168
+ help="Skip the cleaning step; summarize raw transcript directly.",
169
+ )
170
+ p_pipe.add_argument(
171
+ "--skip-summary",
172
+ action="store_true",
173
+ help="Skip the summarization step; only fetch and clean.",
174
+ )
175
+ _add_ai_args(p_pipe)
176
+ _add_output_args(p_pipe)
177
+
178
+ return parser
179
+
180
+
181
+ # ---------------------------------------------------------------------------
182
+ # Command handlers
183
+ # ---------------------------------------------------------------------------
184
+
185
+ def cmd_list(args: argparse.Namespace) -> None:
186
+ for v in args.video:
187
+ vid = extract_video_id(v)
188
+ list_available_transcripts(vid)
189
+
190
+
191
+ def cmd_fetch(args: argparse.Namespace) -> None:
192
+ import os
193
+
194
+ video_ids = [extract_video_id(v) for v in args.video]
195
+ single = len(video_ids) == 1
196
+
197
+ for vid in video_ids:
198
+ result = fetch(vid, languages=args.languages)
199
+ text = result.formatted(args.format, timestamps=args.timestamps)
200
+
201
+ if args.output:
202
+ if single:
203
+ out_path = args.output
204
+ else:
205
+ ext_map = {"text": "txt", "json": "json", "srt": "srt", "vtt": "vtt"}
206
+ os.makedirs(args.output, exist_ok=True)
207
+ out_path = os.path.join(args.output, f"{vid}.{ext_map.get(args.format, 'txt')}")
208
+
209
+ with open(out_path, "w", encoding="utf-8") as f:
210
+ f.write(text)
211
+ print(f"[saved] {out_path}", file=sys.stderr)
212
+ else:
213
+ if not single:
214
+ print(f"\n{'='*60}\nVideo: {vid}\n{'='*60}")
215
+ print(text)
216
+
217
+
218
+ def cmd_clean(args: argparse.Namespace) -> None:
219
+ import os
220
+
221
+ video_ids = [extract_video_id(v) for v in args.video]
222
+ single = len(video_ids) == 1
223
+ model = QUALITY_MODEL if args.quality else DEFAULT_MODEL
224
+ stream = not args.no_stream
225
+
226
+ for vid in video_ids:
227
+ result = fetch(vid, languages=args.languages)
228
+ cleaned = clean(result.plain_text, model=model, stream=stream)
229
+
230
+ if args.output:
231
+ if single:
232
+ out_path = args.output
233
+ else:
234
+ os.makedirs(args.output, exist_ok=True)
235
+ out_path = os.path.join(args.output, f"{vid}_cleaned.txt")
236
+ with open(out_path, "w", encoding="utf-8") as f:
237
+ f.write(cleaned)
238
+ print(f"\n[saved] {out_path}", file=sys.stderr)
239
+ else:
240
+ if not single:
241
+ print(f"\n{'='*60}\nVideo: {vid}\n{'='*60}")
242
+ print(cleaned)
243
+
244
+
245
+ def cmd_summarize(args: argparse.Namespace) -> None:
246
+ import os
247
+
248
+ video_ids = [extract_video_id(v) for v in args.video]
249
+ single = len(video_ids) == 1
250
+ model = QUALITY_MODEL if args.quality else DEFAULT_MODEL
251
+ stream = not args.no_stream
252
+
253
+ for vid in video_ids:
254
+ result = fetch(vid, languages=args.languages)
255
+ summary = summarize(result.plain_text, mode=args.mode, model=model, stream=stream)
256
+
257
+ if args.output:
258
+ if single:
259
+ out_path = args.output
260
+ else:
261
+ os.makedirs(args.output, exist_ok=True)
262
+ out_path = os.path.join(args.output, f"{vid}_summary.txt")
263
+ with open(out_path, "w", encoding="utf-8") as f:
264
+ f.write(summary)
265
+ print(f"\n[saved] {out_path}", file=sys.stderr)
266
+ else:
267
+ if not single:
268
+ print(f"\n{'='*60}\nVideo: {vid}\n{'='*60}")
269
+ print(summary)
270
+
271
+
272
+ def cmd_pipeline(args: argparse.Namespace) -> None:
273
+ video_ids = [extract_video_id(v) for v in args.video]
274
+ model = QUALITY_MODEL if args.quality else DEFAULT_MODEL
275
+ stream = not args.no_stream
276
+
277
+ kwargs = dict(
278
+ languages = args.languages,
279
+ do_clean = not args.skip_clean,
280
+ do_summarize = not args.skip_summary,
281
+ summary_mode = args.mode,
282
+ model = model,
283
+ quality = args.quality,
284
+ stream = stream,
285
+ output_dir = args.output,
286
+ transcript_format = args.format,
287
+ timestamps = args.timestamps,
288
+ )
289
+
290
+ if len(video_ids) == 1:
291
+ r = run(video_ids[0], **kwargs)
292
+ if not args.output:
293
+ _print_pipeline_result(r)
294
+ else:
295
+ results = run_batch(video_ids, **kwargs)
296
+ if not args.output:
297
+ for r in results:
298
+ print(f"\n{'='*60}\nVideo: {r.video_id}\n{'='*60}")
299
+ _print_pipeline_result(r)
300
+
301
+ # Report errors
302
+ all_errors = []
303
+ if isinstance(r if len(video_ids) == 1 else None, object):
304
+ pass # handled per-result below
305
+
306
+
307
+ def _print_pipeline_result(r) -> None:
308
+ sections = []
309
+ if r.raw:
310
+ sections.append(("RAW TRANSCRIPT", r.raw))
311
+ if r.cleaned:
312
+ sections.append(("CLEANED TRANSCRIPT", r.cleaned))
313
+ if r.summary:
314
+ sections.append(("SUMMARY", r.summary))
315
+
316
+ for title, content in sections:
317
+ print(f"\n{'='*60}")
318
+ print(f" {title}")
319
+ print(f"{'='*60}\n")
320
+ print(content)
321
+
322
+ if r.errors:
323
+ print(f"\n[errors]", file=sys.stderr)
324
+ for err in r.errors:
325
+ print(f" {err}", file=sys.stderr)
326
+
327
+
328
+ # ---------------------------------------------------------------------------
329
+ # Entry point
330
+ # ---------------------------------------------------------------------------
331
+
332
+ def main() -> None:
333
+ parser = build_parser()
334
+ args = parser.parse_args()
335
+
336
+ dispatch = {
337
+ "list": cmd_list,
338
+ "fetch": cmd_fetch,
339
+ "clean": cmd_clean,
340
+ "summarize": cmd_summarize,
341
+ "pipeline": cmd_pipeline,
342
+ }
343
+
344
+ handler = dispatch.get(args.command)
345
+ if handler:
346
+ handler(args)
347
+ else:
348
+ parser.print_help()
349
+ sys.exit(1)
350
+
351
+
352
+ if __name__ == "__main__":
353
+ main()
pipeline.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ pipeline.py
3
+ Orchestrates fetch -> clean -> summarize in a single pipeline call.
4
+ Author: algorembrant
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import os
10
+ import sys
11
+ from dataclasses import dataclass, field
12
+ from typing import Optional
13
+
14
+ from fetcher import TranscriptResult, fetch, extract_video_id
15
+ from cleaner import clean
16
+ from summarizer import summarize
17
+ from config import DEFAULT_MODEL, QUALITY_MODEL
18
+
19
+
20
+ # ---------------------------------------------------------------------------
21
+ # Result container
22
+ # ---------------------------------------------------------------------------
23
+
24
+ @dataclass
25
+ class PipelineResult:
26
+ video_id: str
27
+ raw: str = ""
28
+ cleaned: str = ""
29
+ summary: str = ""
30
+ errors: list[str] = field(default_factory=list)
31
+
32
+ @property
33
+ def success(self) -> bool:
34
+ return not self.errors
35
+
36
+
37
+ # ---------------------------------------------------------------------------
38
+ # Single-video pipeline
39
+ # ---------------------------------------------------------------------------
40
+
41
+ def run(
42
+ url_or_id: str,
43
+ languages: list[str] | None = None,
44
+ do_clean: bool = False,
45
+ do_summarize: bool = False,
46
+ summary_mode: str = "brief",
47
+ model: str = DEFAULT_MODEL,
48
+ quality: bool = False,
49
+ stream: bool = True,
50
+ output_dir: str | None = None,
51
+ transcript_format: str = "text",
52
+ timestamps: bool = False,
53
+ ) -> PipelineResult:
54
+ """
55
+ Full pipeline for one video.
56
+
57
+ Args:
58
+ url_or_id: YouTube URL or video ID.
59
+ languages: Language preference list.
60
+ do_clean: Run paragraph cleaner.
61
+ do_summarize: Run summarizer.
62
+ summary_mode: One of 'brief', 'detailed', 'bullets', 'outline'.
63
+ model: Anthropic model identifier.
64
+ quality: Use the higher-quality model instead of the default fast one.
65
+ stream: Stream AI tokens to stderr.
66
+ output_dir: Directory to write output files (optional).
67
+ transcript_format: Raw transcript format: 'text', 'json', 'srt', 'vtt'.
68
+ timestamps: Include timestamps in plain-text transcript.
69
+
70
+ Returns:
71
+ PipelineResult with all produced artifacts.
72
+ """
73
+ chosen_model = QUALITY_MODEL if quality else model
74
+ result = PipelineResult(video_id="")
75
+
76
+ # 1. Extract ID
77
+ try:
78
+ video_id = extract_video_id(url_or_id)
79
+ result.video_id = video_id
80
+ except ValueError as exc:
81
+ result.errors.append(str(exc))
82
+ return result
83
+
84
+ # 2. Fetch
85
+ print(f"\n[fetch] {video_id}", file=sys.stderr)
86
+ transcript: TranscriptResult = fetch(video_id, languages=languages)
87
+ result.raw = transcript.formatted(transcript_format, timestamps=timestamps)
88
+ plain_text = transcript.plain_text # always used as AI input
89
+
90
+ # 3. Clean
91
+ if do_clean:
92
+ print(f"\n[clean] Running paragraph cleaner...", file=sys.stderr)
93
+ try:
94
+ result.cleaned = clean(plain_text, model=chosen_model, stream=stream)
95
+ except Exception as exc:
96
+ result.errors.append(f"Cleaner error: {exc}")
97
+
98
+ # 4. Summarize
99
+ if do_summarize:
100
+ print(f"\n[summarize] Mode: {summary_mode}", file=sys.stderr)
101
+ # Prefer cleaned text if available
102
+ source_text = result.cleaned if result.cleaned else plain_text
103
+ try:
104
+ result.summary = summarize(
105
+ source_text, mode=summary_mode, model=chosen_model, stream=stream
106
+ )
107
+ except Exception as exc:
108
+ result.errors.append(f"Summarizer error: {exc}")
109
+
110
+ # 5. Save to disk
111
+ if output_dir:
112
+ _save(result, output_dir, transcript_format)
113
+
114
+ return result
115
+
116
+
117
+ def _save(result: PipelineResult, output_dir: str, fmt: str) -> None:
118
+ """Write all non-empty artifacts to output_dir."""
119
+ os.makedirs(output_dir, exist_ok=True)
120
+ vid = result.video_id
121
+
122
+ ext_map = {"text": "txt", "json": "json", "srt": "srt", "vtt": "vtt"}
123
+ ext = ext_map.get(fmt, "txt")
124
+
125
+ files_written = []
126
+
127
+ if result.raw:
128
+ p = os.path.join(output_dir, f"{vid}_transcript.{ext}")
129
+ _write(p, result.raw)
130
+ files_written.append(p)
131
+
132
+ if result.cleaned:
133
+ p = os.path.join(output_dir, f"{vid}_cleaned.txt")
134
+ _write(p, result.cleaned)
135
+ files_written.append(p)
136
+
137
+ if result.summary:
138
+ p = os.path.join(output_dir, f"{vid}_summary.txt")
139
+ _write(p, result.summary)
140
+ files_written.append(p)
141
+
142
+ for path in files_written:
143
+ print(f"[saved] {path}", file=sys.stderr)
144
+
145
+
146
+ def _write(path: str, content: str) -> None:
147
+ with open(path, "w", encoding="utf-8") as f:
148
+ f.write(content)
149
+
150
+
151
+ # ---------------------------------------------------------------------------
152
+ # Batch pipeline
153
+ # ---------------------------------------------------------------------------
154
+
155
+ def run_batch(
156
+ urls_or_ids: list[str],
157
+ **kwargs,
158
+ ) -> list[PipelineResult]:
159
+ """
160
+ Run the pipeline for multiple videos sequentially.
161
+ All keyword arguments are forwarded to `run()`.
162
+
163
+ Returns a list of PipelineResult, one per video.
164
+ """
165
+ results = []
166
+ total = len(urls_or_ids)
167
+ for i, url_or_id in enumerate(urls_or_ids, 1):
168
+ print(f"\n{'='*60}", file=sys.stderr)
169
+ print(f"[{i}/{total}] Processing: {url_or_id}", file=sys.stderr)
170
+ print(f"{'='*60}", file=sys.stderr)
171
+ r = run(url_or_id, **kwargs)
172
+ results.append(r)
173
+ return results
requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ anthropic>
2
+ youtube-transcript-api
summarizer.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ summarizer.py
3
+ Summarizes YouTube transcript text in multiple modes via the Anthropic API.
4
+ Author: algorembrant
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from config import DEFAULT_MODEL, MAX_TOKENS, QUALITY_MODEL
10
+ from ai_client import complete_long
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Per-mode prompts
14
+ # ---------------------------------------------------------------------------
15
+
16
+ _SYSTEM_BASE = """You are an expert content analyst specializing in video transcripts.
17
+ Your summaries are accurate, concise, and written in clear professional prose.
18
+ Never hallucinate or add information not present in the transcript.
19
+ Do not add a preamble or closing statement — output only the requested summary.
20
+ """
21
+
22
+ _MODE_PROMPTS: dict[str, dict[str, str]] = {
23
+
24
+ "brief": {
25
+ "system": _SYSTEM_BASE + (
26
+ "Write a brief 3-5 sentence executive summary that captures the core message, "
27
+ "key argument, and main conclusion of the transcript."
28
+ ),
29
+ "user_prefix": (
30
+ "Write a brief 3-5 sentence executive summary of the following transcript.\n\n"
31
+ "TRANSCRIPT:"
32
+ ),
33
+ },
34
+
35
+ "detailed": {
36
+ "system": _SYSTEM_BASE + (
37
+ "Write a detailed, multi-section summary with clearly labeled sections. "
38
+ "Sections should include: Overview, Key Points, Supporting Details, and Conclusion. "
39
+ "Each section should be written as flowing prose paragraphs — no bullet points."
40
+ ),
41
+ "user_prefix": (
42
+ "Write a detailed multi-section summary (Overview, Key Points, Supporting Details, Conclusion) "
43
+ "of the following transcript. Use flowing prose — no bullet points.\n\n"
44
+ "TRANSCRIPT:"
45
+ ),
46
+ },
47
+
48
+ "bullets": {
49
+ "system": _SYSTEM_BASE + (
50
+ "Extract the most important takeaways as a structured bullet list. "
51
+ "Group bullets under 3-5 thematic headings. Each bullet should be one clear sentence. "
52
+ "Use markdown bold for headings."
53
+ ),
54
+ "user_prefix": (
55
+ "Extract the key takeaways from the following transcript as a structured bullet list "
56
+ "grouped under bold thematic headings.\n\n"
57
+ "TRANSCRIPT:"
58
+ ),
59
+ },
60
+
61
+ "outline": {
62
+ "system": _SYSTEM_BASE + (
63
+ "Create a hierarchical topic outline of the transcript. "
64
+ "Use Roman numerals for top-level topics, capital letters for sub-topics, "
65
+ "and Arabic numerals for specific points. Keep entries concise (one line each)."
66
+ ),
67
+ "user_prefix": (
68
+ "Create a hierarchical outline (Roman numerals, sub-letters, sub-numbers) "
69
+ "of the following transcript.\n\n"
70
+ "TRANSCRIPT:"
71
+ ),
72
+ },
73
+ }
74
+
75
+ _MERGE_SYSTEM = """You are an expert content analyst.
76
+ You will receive several summary sections from different parts of a long transcript.
77
+ Merge them into a single cohesive, unified summary in the same format.
78
+ Remove duplicate points. Maintain a logical flow. Output only the final merged summary.
79
+ """
80
+
81
+
82
+ # ---------------------------------------------------------------------------
83
+ # Public API
84
+ # ---------------------------------------------------------------------------
85
+
86
+ def summarize(
87
+ text: str,
88
+ mode: str = "brief",
89
+ model: str = DEFAULT_MODEL,
90
+ max_tokens: int = MAX_TOKENS,
91
+ stream: bool = True,
92
+ ) -> str:
93
+ """
94
+ Summarize a transcript in the specified mode.
95
+
96
+ Args:
97
+ text: Transcript text (raw or already cleaned).
98
+ mode: One of 'brief', 'detailed', 'bullets', 'outline'.
99
+ model: Anthropic model to use.
100
+ max_tokens: Max output tokens per API call.
101
+ stream: Stream progress tokens to stderr.
102
+
103
+ Returns:
104
+ Formatted summary string.
105
+ """
106
+ if not text or not text.strip():
107
+ raise ValueError("Cannot summarize an empty transcript.")
108
+
109
+ if mode not in _MODE_PROMPTS:
110
+ valid = ", ".join(_MODE_PROMPTS.keys())
111
+ raise ValueError(f"Unknown summary mode: {mode!r}. Valid modes: {valid}")
112
+
113
+ prompts = _MODE_PROMPTS[mode]
114
+
115
+ # Detailed and outline summaries benefit from higher-quality model
116
+ # but we keep the user's choice; they can override via --quality flag
117
+ return complete_long(
118
+ system=prompts["system"],
119
+ user_prefix=prompts["user_prefix"],
120
+ text=text.strip(),
121
+ model=model,
122
+ max_tokens=max_tokens,
123
+ merge_system=_MERGE_SYSTEM,
124
+ stream=stream,
125
+ )