algorembrant commited on
Commit
c9f8d95
·
verified ·
1 Parent(s): 297ee7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -198
README.md CHANGED
@@ -1,199 +1,214 @@
1
- ---
2
- license: mit
3
- sdk: static
4
- colorFrom: blue
5
- colorTo: red
6
- tags:
7
- - youtube
8
- - transcript
9
- - api
10
- - python
11
- - tools
12
- ---
13
-
14
- ![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python&logoColor=white)
15
- ![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)
16
- ![Dependencies](https://img.shields.io/badge/Dependencies-1-orange?style=flat-square)
17
- ![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square)
18
- ![No Scraping](https://img.shields.io/badge/No%20Scraping-Direct%20API-brightgreen?style=flat-square)
19
-
20
- ---
21
-
22
- # YouTube Transcript Fetcher
23
-
24
- A fast, zero-scraping Python command-line tool that pulls transcripts directly
25
- from YouTube videos using the official caption delivery API.
26
-
27
- No Selenium. No BeautifulSoup. No headless browsers. Just the raw transcript
28
- data returned by YouTube's own caption endpoint — in milliseconds.
29
-
30
- ---
31
-
32
- ## How It Works
33
-
34
- YouTube serves captions through a dedicated timedtext API endpoint. The
35
- `youtube-transcript-api` library calls that endpoint directly, bypassing all
36
- HTML parsing entirely. This makes fetches nearly instant regardless of video length.
37
-
38
- ---
39
-
40
- ## Features
41
-
42
- - Direct API access — no HTML parsing, no browser automation
43
- - Supports full YouTube URLs, short youtu.be links, Shorts URLs, embed URLs, and raw video IDs
44
- - Output formats: plain text, JSON, SRT (SubRip), WebVTT
45
- - Optional timestamp preservation in plain-text output
46
- - Language selection with ordered fallback (e.g. try Japanese, then English)
47
- - Batch processing fetch transcripts for multiple videos in one command
48
- - Auto-saves to file or directory with correct file extension
49
- - Lists all available transcript languages for any video
50
-
51
- ---
52
-
53
- ## Installation
54
-
55
- ```bash
56
- git clone https://github.com/your-username/youtube-transcript-fetcher.git
57
- cd youtube-transcript-fetcher
58
- python -m venv .venv
59
- source .venv/bin/activate # Windows: .venv\Scripts\activate
60
- pip install -r requirements.txt
61
- ```
62
-
63
- ---
64
-
65
- ## Quick Start
66
-
67
- ```bash
68
- # Print transcript to terminal
69
- python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
70
-
71
- # Save as plain text
72
- python main.py dQw4w9WgXcQ -o transcript.txt
73
-
74
- # Save as SRT subtitles
75
- python main.py dQw4w9WgXcQ -f srt -o transcript.srt
76
-
77
- # Save as JSON (includes start time + duration per segment)
78
- python main.py dQw4w9WgXcQ -f json -o transcript.json
79
-
80
- # Include timestamps in plain-text output
81
- python main.py dQw4w9WgXcQ -t
82
-
83
- # Request Spanish transcript, fall back to English if unavailable
84
- python main.py dQw4w9WgXcQ -l es en
85
-
86
- # List every available language for a video
87
- python main.py dQw4w9WgXcQ --list
88
-
89
- # Batch: fetch three videos and save each to ./transcripts/
90
- python main.py ID1 ID2 ID3 -o ./transcripts/
91
- ```
92
-
93
- ---
94
-
95
- ## CLI Reference
96
-
97
- ```
98
- usage: main.py [-h] [-l LANG [LANG ...]] [-f {text,json,srt,vtt}]
99
- [-t] [-o PATH] [--list]
100
- video [video ...]
101
-
102
- positional arguments:
103
- video YouTube video URL(s) or video ID(s)
104
-
105
- optional arguments:
106
- -h, --help show this help message and exit
107
- -l, --languages Language codes in order of preference (default: en)
108
- -f, --format Output format: text, json, srt, vtt (default: text)
109
- -t, --timestamps Add timestamps to plain-text output
110
- -o, --output Output file (single video) or directory (batch)
111
- --list List all available transcript languages and exit
112
- ```
113
-
114
- ---
115
-
116
- ## JSON Output Structure
117
-
118
- Each entry in the JSON array contains:
119
-
120
- ```json
121
- [
122
- {
123
- "text": "Never gonna give you up",
124
- "start": 43.08,
125
- "duration": 2.16
126
- }
127
- ]
128
- ```
129
-
130
- | Field | Type | Description |
131
- |------------|-------|----------------------------------|
132
- | `text` | str | Caption text for the segment |
133
- | `start` | float | Start time in seconds |
134
- | `duration` | float | Duration of the segment in seconds |
135
-
136
- ---
137
-
138
- ## Supported URL Formats
139
-
140
- ```
141
- https://www.youtube.com/watch?v=VIDEO_ID
142
- https://youtu.be/VIDEO_ID
143
- https://www.youtube.com/shorts/VIDEO_ID
144
- https://www.youtube.com/embed/VIDEO_ID
145
- VIDEO_ID (raw 11-character ID)
146
- ```
147
-
148
- ---
149
-
150
- ## Error Reference
151
-
152
- | Exception | Meaning |
153
- |------------------------|----------------------------------------------------|
154
- | `TranscriptsDisabled` | The video owner disabled captions |
155
- | `VideoUnavailable` | Video is private, deleted, or region-locked |
156
- | `NoTranscriptFound` | Requested language(s) do not exist for this video |
157
- | `NoTranscriptAvailable`| No captions exist at all for this video |
158
-
159
- ---
160
-
161
- ## Dependencies
162
-
163
- | Package | Version | Purpose |
164
- |--------------------------|---------|---------------------------------------|
165
- | youtube-transcript-api | 1.2.4 | Direct YouTube caption API access |
166
-
167
- No other dependencies. The standard library handles everything else.
168
-
169
- ---
170
-
171
- ## License
172
-
173
- MIT License. See `LICENSE` for details.
174
-
175
- ---
176
-
177
- ## Citation
178
-
179
- If you use this tool in your research or project, please cite it as follows:
180
-
181
- ```bibtex
182
- @software{albeos2026yttfetcher,
183
- author = {Rembrant Oyangoren Albeos},
184
- title = {YouTube Transcript Fetcher: High-speed, Zero-scraping Caption Extraction},
185
- year = {2026},
186
- publisher = {Hugging Face},
187
- journal = {Hugging Face Repository},
188
- howpublished = {\url{https://huggingface.co/algorembrant/youtube-transcript-fetcher}},
189
- version = {1.2.4}
190
- }
191
- ```
192
-
193
- ---
194
-
195
- ## Disclaimer
196
-
197
- This tool uses YouTube's publicly accessible caption endpoint for personal,
198
- educational, and research use. Review YouTube's Terms of Service before
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
  using this tool in a production or commercial context.
 
1
+ ---
2
+ license: mit
3
+ sdk: static
4
+ colorFrom: blue
5
+ colorTo: red
6
+ tags:
7
+ - youtube
8
+ - transcript
9
+ - api
10
+ - python
11
+ - tools
12
+ ---
13
+
14
+ ![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python&logoColor=white)
15
+ ![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)
16
+ ![Dependencies](https://img.shields.io/badge/Dependencies-1-orange?style=flat-square)
17
+ ![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square)
18
+ ![No Scraping](https://img.shields.io/badge/No%20Scraping-Direct%20API-brightgreen?style=flat-square)
19
+
20
+ ---
21
+
22
+ # YouTube Transcript Fetcher
23
+
24
+ A fast, zero-scraping Python command-line tool that pulls transcripts directly
25
+ from YouTube videos using the official caption delivery API.
26
+
27
+ No Selenium. No BeautifulSoup. No headless browsers. Just the raw transcript
28
+ data returned by YouTube's own caption endpoint — in milliseconds.
29
+
30
+ ---
31
+
32
+ ## How It Works
33
+
34
+ YouTube serves captions through a dedicated timedtext API endpoint. The
35
+ `youtube-transcript-api` library calls that endpoint directly, bypassing all
36
+ HTML parsing entirely. This makes fetches nearly instant regardless of video length.
37
+
38
+ ---
39
+
40
+ ## System Overview
41
+
42
+ ```mermaid
43
+ graph TD
44
+ A[User Commands] --> B[main.py CLI Handler]
45
+ B --> C[YouTubeTranscriptApi Instance]
46
+ C --> D[YouTube timedtext Endpoint]
47
+ D -- XML/JSON Data --> C
48
+ C -- List of Snippets --> B
49
+ B --> E{Output Mode}
50
+ E -->|Write to File| F[Exported Transcript]
51
+ E -->|Terminal| G[Standard Output]
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Features
57
+
58
+ - Direct API access — no HTML parsing, no browser automation
59
+ - Supports full YouTube URLs, short youtu.be links, Shorts URLs, embed URLs, and raw video IDs
60
+ - Output formats: plain text, JSON, SRT (SubRip), WebVTT
61
+ - Optional timestamp preservation in plain-text output
62
+ - Language selection with ordered fallback (e.g. try Japanese, then English)
63
+ - Batch processing — fetch transcripts for multiple videos in one command
64
+ - Auto-saves to file or directory with correct file extension
65
+ - Lists all available transcript languages for any video
66
+
67
+ ---
68
+
69
+ ## Installation
70
+
71
+ ```bash
72
+ git clone https://github.com/your-username/youtube-transcript-fetcher.git
73
+ cd youtube-transcript-fetcher
74
+ python -m venv .venv
75
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
76
+ pip install -r requirements.txt
77
+ ```
78
+
79
+ ---
80
+
81
+ ## Quick Start
82
+
83
+ ```bash
84
+ # Print transcript to terminal
85
+ python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
86
+
87
+ # Save as plain text
88
+ python main.py dQw4w9WgXcQ -o transcript.txt
89
+
90
+ # Save as SRT subtitles
91
+ python main.py dQw4w9WgXcQ -f srt -o transcript.srt
92
+
93
+ # Save as JSON (includes start time + duration per segment)
94
+ python main.py dQw4w9WgXcQ -f json -o transcript.json
95
+
96
+ # Include timestamps in plain-text output
97
+ python main.py dQw4w9WgXcQ -t
98
+
99
+ # Request Spanish transcript, fall back to English if unavailable
100
+ python main.py dQw4w9WgXcQ -l es en
101
+
102
+ # List every available language for a video
103
+ python main.py dQw4w9WgXcQ --list
104
+
105
+ # Batch: fetch three videos and save each to ./transcripts/
106
+ python main.py ID1 ID2 ID3 -o ./transcripts/
107
+ ```
108
+
109
+ ---
110
+
111
+ ## CLI Reference
112
+
113
+ ```
114
+ usage: main.py [-h] [-l LANG [LANG ...]] [-f {text,json,srt,vtt}]
115
+ [-t] [-o PATH] [--list]
116
+ video [video ...]
117
+
118
+ positional arguments:
119
+ video YouTube video URL(s) or video ID(s)
120
+
121
+ optional arguments:
122
+ -h, --help show this help message and exit
123
+ -l, --languages Language codes in order of preference (default: en)
124
+ -f, --format Output format: text, json, srt, vtt (default: text)
125
+ -t, --timestamps Add timestamps to plain-text output
126
+ -o, --output Output file (single video) or directory (batch)
127
+ --list List all available transcript languages and exit
128
+ ```
129
+
130
+ ---
131
+
132
+ ## JSON Output Structure
133
+
134
+ Each entry in the JSON array contains:
135
+
136
+ ```json
137
+ [
138
+ {
139
+ "text": "Never gonna give you up",
140
+ "start": 43.08,
141
+ "duration": 2.16
142
+ }
143
+ ]
144
+ ```
145
+
146
+ | Field | Type | Description |
147
+ |------------|-------|----------------------------------|
148
+ | `text` | str | Caption text for the segment |
149
+ | `start` | float | Start time in seconds |
150
+ | `duration` | float | Duration of the segment in seconds |
151
+
152
+ ---
153
+
154
+ ## Supported URL Formats
155
+
156
+ ```
157
+ https://www.youtube.com/watch?v=VIDEO_ID
158
+ https://youtu.be/VIDEO_ID
159
+ https://www.youtube.com/shorts/VIDEO_ID
160
+ https://www.youtube.com/embed/VIDEO_ID
161
+ VIDEO_ID (raw 11-character ID)
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Error Reference
167
+
168
+ | Exception | Meaning |
169
+ |------------------------|----------------------------------------------------|
170
+ | `TranscriptsDisabled` | The video owner disabled captions |
171
+ | `VideoUnavailable` | Video is private, deleted, or region-locked |
172
+ | `NoTranscriptFound` | Requested language(s) do not exist for this video |
173
+ | `NoTranscriptAvailable`| No captions exist at all for this video |
174
+
175
+ ---
176
+
177
+ ## Dependencies
178
+
179
+ | Package | Version | Purpose |
180
+ |--------------------------|---------|---------------------------------------|
181
+ | youtube-transcript-api | 1.2.4 | Direct YouTube caption API access |
182
+
183
+ No other dependencies. The standard library handles everything else.
184
+
185
+ ---
186
+
187
+ ## License
188
+
189
+ MIT License. See `LICENSE` for details.
190
+
191
+ ---
192
+
193
+ ## Citation
194
+ If you use this tool in your research or project, please cite it as follows:
195
+
196
+ ```bibtex
197
+ @software{albeos2026yttfetcher,
198
+ author = {Rembrant Oyangoren Albeos},
199
+ title = {YouTube Transcript Fetcher: High-speed, Zero-scraping Caption Extraction},
200
+ year = {2026},
201
+ publisher = {Hugging Face},
202
+ journal = {Hugging Face Repository},
203
+ howpublished = {\url{https://huggingface.co/algorembrant/youtube-transcript-fetcher}},
204
+ version = {1.2.4}
205
+ }
206
+ ```
207
+
208
+ ---
209
+
210
+ ## Disclaimer
211
+
212
+ This tool uses YouTube's publicly accessible caption endpoint for personal,
213
+ educational, and research use. Review YouTube's Terms of Service before
214
  using this tool in a production or commercial context.