algorembrant commited on
Commit
297ee7b
·
verified ·
1 Parent(s): 04a092d

Upload 6 files

Browse files
Files changed (6) hide show
  1. .gitignore +2 -0
  2. GUIDE.md +159 -0
  3. LICENSE +21 -0
  4. README.md +199 -0
  5. main.py +314 -0
  6. requirements.txt +1 -0
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ .venv/
2
+ venv/
GUIDE.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Step-by-Step Setup Guide
2
+
3
+ ---
4
+
5
+ ## Prerequisites
6
+
7
+ | Requirement | Minimum Version |
8
+ |-------------|-----------------|
9
+ | Python | 3.8 |
10
+ | pip | 21.0 |
11
+
12
+ ---
13
+
14
+ ## Step 1 — Clone or Download the Project
15
+
16
+ If you have Git installed:
17
+
18
+ ```bash
19
+ git clone https://github.com/your-username/youtube-transcript-fetcher.git
20
+ cd youtube-transcript-fetcher
21
+ ```
22
+
23
+ Or download and unzip the archive, then open a terminal inside the folder.
24
+
25
+ ---
26
+
27
+ ## Step 2 — Create a Virtual Environment (Recommended)
28
+
29
+ **macOS / Linux**
30
+ ```bash
31
+ python3 -m venv .venv
32
+ source .venv/bin/activate
33
+ ```
34
+
35
+ **Windows (Command Prompt)**
36
+ ```cmd
37
+ python -m venv .venv
38
+ .venv\Scripts\activate.bat
39
+ ```
40
+
41
+ **Windows (PowerShell)**
42
+ ```powershell
43
+ python -m venv .venv
44
+ .venv\Scripts\Activate.ps1
45
+ ```
46
+
47
+ You should see `(.venv)` at the start of your terminal prompt when the environment is active.
48
+
49
+ ---
50
+
51
+ ## Step 3 — Install Dependencies
52
+
53
+ ```bash
54
+ pip install -r requirements.txt
55
+ ```
56
+
57
+ Verify the install:
58
+
59
+ ```bash
60
+ pip show youtube-transcript-api
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Step 4 — Run the Script
66
+
67
+ ### Basic usage — print transcript to terminal
68
+
69
+ ```bash
70
+ python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
71
+ ```
72
+
73
+ ### Save to a file
74
+
75
+ ```bash
76
+ python main.py "https://youtu.be/dQw4w9WgXcQ" -o transcript.txt
77
+ ```
78
+
79
+ ### Export as SRT subtitles
80
+
81
+ ```bash
82
+ python main.py dQw4w9WgXcQ -f srt -o transcript.srt
83
+ ```
84
+
85
+ ### Export as JSON
86
+
87
+ ```bash
88
+ python main.py dQw4w9WgXcQ -f json -o transcript.json
89
+ ```
90
+
91
+ ### Include timestamps in plain-text output
92
+
93
+ ```bash
94
+ python main.py dQw4w9WgXcQ -t
95
+ ```
96
+
97
+ ### Request a specific language
98
+
99
+ ```bash
100
+ python main.py dQw4w9WgXcQ -l es # Spanish
101
+ python main.py dQw4w9WgXcQ -l ja ko en # Japanese, then Korean, then English
102
+ ```
103
+
104
+ ### List all available languages for a video
105
+
106
+ ```bash
107
+ python main.py dQw4w9WgXcQ --list
108
+ ```
109
+
110
+ ### Batch — multiple videos saved to a directory
111
+
112
+ ```bash
113
+ python main.py VIDEO_ID_1 VIDEO_ID_2 VIDEO_ID_3 -o ./transcripts/
114
+ ```
115
+
116
+ ---
117
+
118
+ ## Step 5 — Deactivate the Virtual Environment (When Done)
119
+
120
+ ```bash
121
+ deactivate
122
+ ```
123
+
124
+ ---
125
+
126
+ ## Troubleshooting
127
+
128
+ | Error | Cause | Fix |
129
+ |-------|-------|-----|
130
+ | `TranscriptsDisabled` | The video owner turned off transcripts | Nothing can be done; try another video |
131
+ | `VideoUnavailable` | Video is private, deleted, or region-locked | Check the URL; use a VPN if region-locked |
132
+ | `NoTranscriptFound` | Requested language does not exist | Run `--list` and pick an available language |
133
+ | `ModuleNotFoundError` | Dependencies not installed | Run `pip install -r requirements.txt` |
134
+ | Empty output | Video has no speech or very short content | Confirm the video has captions enabled |
135
+
136
+ ---
137
+
138
+ ## Input Formats Accepted
139
+
140
+ All of the following point to the same video and are equally valid input:
141
+
142
+ ```
143
+ https://www.youtube.com/watch?v=dQw4w9WgXcQ
144
+ https://youtu.be/dQw4w9WgXcQ
145
+ https://www.youtube.com/shorts/dQw4w9WgXcQ
146
+ https://www.youtube.com/embed/dQw4w9WgXcQ
147
+ dQw4w9WgXcQ
148
+ ```
149
+
150
+ ---
151
+
152
+ ## Output Formats
153
+
154
+ | Flag | Format | Best For |
155
+ |------|--------|----------|
156
+ | `text` (default) | Plain text, one line per caption segment | Reading, summarization, NLP |
157
+ | `json` | JSON array with `text`, `start`, `duration` fields | Programmatic processing |
158
+ | `srt` | SubRip subtitle format | Video players, Premiere, DaVinci |
159
+ | `vtt` | WebVTT subtitle format | HTML5 video, browsers |
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Rembrant Oyangoren Albeos
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ sdk: static
4
+ colorFrom: blue
5
+ colorTo: red
6
+ tags:
7
+ - youtube
8
+ - transcript
9
+ - api
10
+ - python
11
+ - tools
12
+ ---
13
+
14
+ ![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python&logoColor=white)
15
+ ![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)
16
+ ![Dependencies](https://img.shields.io/badge/Dependencies-1-orange?style=flat-square)
17
+ ![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square)
18
+ ![No Scraping](https://img.shields.io/badge/No%20Scraping-Direct%20API-brightgreen?style=flat-square)
19
+
20
+ ---
21
+
22
+ # YouTube Transcript Fetcher
23
+
24
+ A fast, zero-scraping Python command-line tool that pulls transcripts directly
25
+ from YouTube videos using the official caption delivery API.
26
+
27
+ No Selenium. No BeautifulSoup. No headless browsers. Just the raw transcript
28
+ data returned by YouTube's own caption endpoint — in milliseconds.
29
+
30
+ ---
31
+
32
+ ## How It Works
33
+
34
+ YouTube serves captions through a dedicated timedtext API endpoint. The
35
+ `youtube-transcript-api` library calls that endpoint directly, bypassing all
36
+ HTML parsing entirely. This makes fetches nearly instant regardless of video length.
37
+
38
+ ---
39
+
40
+ ## Features
41
+
42
+ - Direct API access — no HTML parsing, no browser automation
43
+ - Supports full YouTube URLs, short youtu.be links, Shorts URLs, embed URLs, and raw video IDs
44
+ - Output formats: plain text, JSON, SRT (SubRip), WebVTT
45
+ - Optional timestamp preservation in plain-text output
46
+ - Language selection with ordered fallback (e.g. try Japanese, then English)
47
+ - Batch processing — fetch transcripts for multiple videos in one command
48
+ - Auto-saves to file or directory with correct file extension
49
+ - Lists all available transcript languages for any video
50
+
51
+ ---
52
+
53
+ ## Installation
54
+
55
+ ```bash
56
+ git clone https://github.com/your-username/youtube-transcript-fetcher.git
57
+ cd youtube-transcript-fetcher
58
+ python -m venv .venv
59
+ source .venv/bin/activate # Windows: .venv\Scripts\activate
60
+ pip install -r requirements.txt
61
+ ```
62
+
63
+ ---
64
+
65
+ ## Quick Start
66
+
67
+ ```bash
68
+ # Print transcript to terminal
69
+ python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
70
+
71
+ # Save as plain text
72
+ python main.py dQw4w9WgXcQ -o transcript.txt
73
+
74
+ # Save as SRT subtitles
75
+ python main.py dQw4w9WgXcQ -f srt -o transcript.srt
76
+
77
+ # Save as JSON (includes start time + duration per segment)
78
+ python main.py dQw4w9WgXcQ -f json -o transcript.json
79
+
80
+ # Include timestamps in plain-text output
81
+ python main.py dQw4w9WgXcQ -t
82
+
83
+ # Request Spanish transcript, fall back to English if unavailable
84
+ python main.py dQw4w9WgXcQ -l es en
85
+
86
+ # List every available language for a video
87
+ python main.py dQw4w9WgXcQ --list
88
+
89
+ # Batch: fetch three videos and save each to ./transcripts/
90
+ python main.py ID1 ID2 ID3 -o ./transcripts/
91
+ ```
92
+
93
+ ---
94
+
95
+ ## CLI Reference
96
+
97
+ ```
98
+ usage: main.py [-h] [-l LANG [LANG ...]] [-f {text,json,srt,vtt}]
99
+ [-t] [-o PATH] [--list]
100
+ video [video ...]
101
+
102
+ positional arguments:
103
+ video YouTube video URL(s) or video ID(s)
104
+
105
+ optional arguments:
106
+ -h, --help show this help message and exit
107
+ -l, --languages Language codes in order of preference (default: en)
108
+ -f, --format Output format: text, json, srt, vtt (default: text)
109
+ -t, --timestamps Add timestamps to plain-text output
110
+ -o, --output Output file (single video) or directory (batch)
111
+ --list List all available transcript languages and exit
112
+ ```
113
+
114
+ ---
115
+
116
+ ## JSON Output Structure
117
+
118
+ Each entry in the JSON array contains:
119
+
120
+ ```json
121
+ [
122
+ {
123
+ "text": "Never gonna give you up",
124
+ "start": 43.08,
125
+ "duration": 2.16
126
+ }
127
+ ]
128
+ ```
129
+
130
+ | Field | Type | Description |
131
+ |------------|-------|----------------------------------|
132
+ | `text` | str | Caption text for the segment |
133
+ | `start` | float | Start time in seconds |
134
+ | `duration` | float | Duration of the segment in seconds |
135
+
136
+ ---
137
+
138
+ ## Supported URL Formats
139
+
140
+ ```
141
+ https://www.youtube.com/watch?v=VIDEO_ID
142
+ https://youtu.be/VIDEO_ID
143
+ https://www.youtube.com/shorts/VIDEO_ID
144
+ https://www.youtube.com/embed/VIDEO_ID
145
+ VIDEO_ID (raw 11-character ID)
146
+ ```
147
+
148
+ ---
149
+
150
+ ## Error Reference
151
+
152
+ | Exception | Meaning |
153
+ |------------------------|----------------------------------------------------|
154
+ | `TranscriptsDisabled` | The video owner disabled captions |
155
+ | `VideoUnavailable` | Video is private, deleted, or region-locked |
156
+ | `NoTranscriptFound` | Requested language(s) do not exist for this video |
157
+ | `NoTranscriptAvailable`| No captions exist at all for this video |
158
+
159
+ ---
160
+
161
+ ## Dependencies
162
+
163
+ | Package | Version | Purpose |
164
+ |--------------------------|---------|---------------------------------------|
165
+ | youtube-transcript-api | 1.2.4 | Direct YouTube caption API access |
166
+
167
+ No other dependencies. The standard library handles everything else.
168
+
169
+ ---
170
+
171
+ ## License
172
+
173
+ MIT License. See `LICENSE` for details.
174
+
175
+ ---
176
+
177
+ ## Citation
178
+
179
+ If you use this tool in your research or project, please cite it as follows:
180
+
181
+ ```bibtex
182
+ @software{albeos2026yttfetcher,
183
+ author = {Rembrant Oyangoren Albeos},
184
+ title = {YouTube Transcript Fetcher: High-speed, Zero-scraping Caption Extraction},
185
+ year = {2026},
186
+ publisher = {Hugging Face},
187
+ journal = {Hugging Face Repository},
188
+ howpublished = {\url{https://huggingface.co/algorembrant/youtube-transcript-fetcher}},
189
+ version = {1.2.4}
190
+ }
191
+ ```
192
+
193
+ ---
194
+
195
+ ## Disclaimer
196
+
197
+ This tool uses YouTube's publicly accessible caption endpoint for personal,
198
+ educational, and research use. Review YouTube's Terms of Service before
199
+ using this tool in a production or commercial context.
main.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ YouTube Transcript Fetcher
4
+ Fetches transcripts directly from YouTube videos using the YouTube Transcript API.
5
+ No HTML parsing or scraping involved.
6
+ """
7
+
8
+ import argparse
9
+ import json
10
+ import sys
11
+ import re
12
+ from typing import Optional
13
+
14
+ from youtube_transcript_api import YouTubeTranscriptApi
15
+ from youtube_transcript_api.formatters import (
16
+ TextFormatter,
17
+ JSONFormatter,
18
+ SRTFormatter,
19
+ WebVTTFormatter,
20
+ )
21
+ from youtube_transcript_api._errors import (
22
+ TranscriptsDisabled,
23
+ NoTranscriptFound,
24
+ VideoUnavailable,
25
+ CouldNotRetrieveTranscript,
26
+ )
27
+
28
+
29
+ def extract_video_id(url_or_id: str) -> str:
30
+ """
31
+ Extract the video ID from a YouTube URL or return it directly if already an ID.
32
+
33
+ Supports formats:
34
+ - https://www.youtube.com/watch?v=VIDEO_ID
35
+ - https://youtu.be/VIDEO_ID
36
+ - https://www.youtube.com/shorts/VIDEO_ID
37
+ - https://www.youtube.com/embed/VIDEO_ID
38
+ - VIDEO_ID (raw)
39
+ """
40
+ patterns = [
41
+ r"(?:youtube\.com/watch\?.*v=)([a-zA-Z0-9_-]{11})",
42
+ r"(?:youtu\.be/)([a-zA-Z0-9_-]{11})",
43
+ r"(?:youtube\.com/shorts/)([a-zA-Z0-9_-]{11})",
44
+ r"(?:youtube\.com/embed/)([a-zA-Z0-9_-]{11})",
45
+ ]
46
+ for pattern in patterns:
47
+ match = re.search(pattern, url_or_id)
48
+ if match:
49
+ return match.group(1)
50
+
51
+ # Assume raw video ID if it looks like one
52
+ if re.fullmatch(r"[a-zA-Z0-9_-]{11}", url_or_id):
53
+ return url_or_id
54
+
55
+ raise ValueError(
56
+ f"Could not extract a valid YouTube video ID from: {url_or_id}\n"
57
+ "Accepted formats: full URL, youtu.be short link, or raw 11-character video ID."
58
+ )
59
+
60
+
61
+ def list_available_transcripts(video_id: str) -> None:
62
+ """List all available transcript languages for a video."""
63
+ api = YouTubeTranscriptApi()
64
+ transcript_list = api.list(video_id)
65
+
66
+ print(f"\nAvailable transcripts for video: {video_id}\n")
67
+
68
+ manually_created = list(transcript_list._manually_created_transcripts.values())
69
+ auto_generated = list(transcript_list._generated_transcripts.values())
70
+
71
+ if manually_created:
72
+ print("Manually created:")
73
+ for t in manually_created:
74
+ print(f" [{t.language_code}] {t.language}")
75
+
76
+ if auto_generated:
77
+ print("Auto-generated:")
78
+ for t in auto_generated:
79
+ print(f" [{t.language_code}] {t.language} (auto)")
80
+
81
+ if not manually_created and not auto_generated:
82
+ print(" No transcripts found.")
83
+
84
+
85
+ def fetch_transcript(
86
+ video_id: str,
87
+ languages: Optional[list] = None,
88
+ output_format: str = "text",
89
+ preserve_timestamps: bool = False,
90
+ output_file: Optional[str] = None,
91
+ ) -> str:
92
+ """
93
+ Fetch transcript for a given video ID.
94
+
95
+ Args:
96
+ video_id: YouTube video ID.
97
+ languages: Ordered list of language codes to try (e.g. ['en', 'es']).
98
+ Falls back to the first available transcript if None.
99
+ output_format: One of 'text', 'json', 'srt', 'vtt'.
100
+ preserve_timestamps: Include timestamps in plain-text output.
101
+ output_file: If provided, write transcript to this file path.
102
+
103
+ Returns:
104
+ The transcript as a formatted string.
105
+ """
106
+ if languages is None:
107
+ languages = ["en"]
108
+
109
+ try:
110
+ api = YouTubeTranscriptApi()
111
+ transcript_list = api.list(video_id)
112
+
113
+ # Try requested languages first; fall back to any available transcript
114
+ try:
115
+ transcript = transcript_list.find_transcript(languages)
116
+ except NoTranscriptFound:
117
+ # Grab whatever is available
118
+ all_transcripts = list(transcript_list)
119
+
120
+ if not all_transcripts:
121
+ print(f"Error: No transcript is available for video '{video_id}'.", file=sys.stderr)
122
+ sys.exit(1)
123
+
124
+ transcript = all_transcripts[0]
125
+ print(
126
+ f"Warning: None of the requested languages found. "
127
+ f"Using [{transcript.language_code}] {transcript.language} instead.",
128
+ file=sys.stderr,
129
+ )
130
+
131
+ transcript_data = transcript.fetch()
132
+
133
+ # Format
134
+ if output_format == "json":
135
+ formatter = JSONFormatter()
136
+ result = formatter.format_transcript(transcript_data, indent=2)
137
+
138
+ elif output_format == "srt":
139
+ formatter = SRTFormatter()
140
+ result = formatter.format_transcript(transcript_data)
141
+
142
+ elif output_format == "vtt":
143
+ formatter = WebVTTFormatter()
144
+ result = formatter.format_transcript(transcript_data)
145
+
146
+ else: # default: plain text
147
+ if preserve_timestamps:
148
+ lines = []
149
+ for entry in transcript_data:
150
+ minutes = int(entry["start"] // 60)
151
+ seconds = entry["start"] % 60
152
+ lines.append(f"[{minutes:02d}:{seconds:05.2f}] {entry['text']}")
153
+ result = "\n".join(lines)
154
+ else:
155
+ formatter = TextFormatter()
156
+ result = formatter.format_transcript(transcript_data)
157
+
158
+ if output_file:
159
+ with open(output_file, "w", encoding="utf-8") as f:
160
+ f.write(result)
161
+ print(f"Transcript saved to: {output_file}")
162
+
163
+ return result
164
+
165
+ except TranscriptsDisabled:
166
+ print(f"Error: Transcripts are disabled for video '{video_id}'.", file=sys.stderr)
167
+ sys.exit(1)
168
+ except VideoUnavailable:
169
+ print(f"Error: Video '{video_id}' is unavailable or does not exist.", file=sys.stderr)
170
+ sys.exit(1)
171
+ except CouldNotRetrieveTranscript as e:
172
+ print(f"Error for video '{video_id}': {e}", file=sys.stderr)
173
+ sys.exit(1)
174
+ except Exception as e:
175
+ print(f"Unexpected error: {e}", file=sys.stderr)
176
+ sys.exit(1)
177
+
178
+
179
+ def fetch_multiple(
180
+ video_ids: list,
181
+ languages: Optional[list] = None,
182
+ output_format: str = "text",
183
+ preserve_timestamps: bool = False,
184
+ output_dir: Optional[str] = None,
185
+ ) -> dict:
186
+ """
187
+ Fetch transcripts for multiple video IDs.
188
+
189
+ Args:
190
+ video_ids: List of YouTube video IDs.
191
+ languages: Language preference list.
192
+ output_format: Output format string.
193
+ preserve_timestamps: Include timestamps.
194
+ output_dir: Directory to save individual transcript files.
195
+
196
+ Returns:
197
+ Dictionary mapping video_id -> transcript string (or error message).
198
+ """
199
+ import os
200
+
201
+ results = {}
202
+ for vid in video_ids:
203
+ print(f"Fetching: {vid}", file=sys.stderr)
204
+ try:
205
+ out_file = None
206
+ if output_dir:
207
+ ext_map = {"text": "txt", "json": "json", "srt": "srt", "vtt": "vtt"}
208
+ ext = ext_map.get(output_format, "txt")
209
+ os.makedirs(output_dir, exist_ok=True)
210
+ out_file = os.path.join(output_dir, f"{vid}.{ext}")
211
+
212
+ transcript = fetch_transcript(
213
+ video_id=vid,
214
+ languages=languages,
215
+ output_format=output_format,
216
+ preserve_timestamps=preserve_timestamps,
217
+ output_file=out_file,
218
+ )
219
+ results[vid] = {"status": "ok", "transcript": transcript}
220
+ except SystemExit:
221
+ results[vid] = {"status": "error", "transcript": None}
222
+
223
+ return results
224
+
225
+
226
+ def parse_args():
227
+ parser = argparse.ArgumentParser(
228
+ description="Fetch YouTube video transcripts directly — no scraping required.",
229
+ formatter_class=argparse.RawTextHelpFormatter,
230
+ )
231
+
232
+ parser.add_argument(
233
+ "video",
234
+ nargs="+",
235
+ help="YouTube video URL(s) or video ID(s).",
236
+ )
237
+
238
+ parser.add_argument(
239
+ "-l", "--languages",
240
+ nargs="+",
241
+ default=["en"],
242
+ metavar="LANG",
243
+ help="Language codes in order of preference (default: en).\nExample: --languages en es fr",
244
+ )
245
+
246
+ parser.add_argument(
247
+ "-f", "--format",
248
+ choices=["text", "json", "srt", "vtt"],
249
+ default="text",
250
+ help="Output format (default: text).",
251
+ )
252
+
253
+ parser.add_argument(
254
+ "-t", "--timestamps",
255
+ action="store_true",
256
+ help="Include timestamps in plain-text output.",
257
+ )
258
+
259
+ parser.add_argument(
260
+ "-o", "--output",
261
+ metavar="PATH",
262
+ help="Output file path (single video) or directory (multiple videos).",
263
+ )
264
+
265
+ parser.add_argument(
266
+ "--list",
267
+ action="store_true",
268
+ help="List all available transcript languages for the video(s) and exit.",
269
+ )
270
+
271
+ return parser.parse_args()
272
+
273
+
274
+ def main():
275
+ args = parse_args()
276
+
277
+ video_ids = [extract_video_id(v) for v in args.video]
278
+
279
+ if args.list:
280
+ for vid in video_ids:
281
+ list_available_transcripts(vid)
282
+ return
283
+
284
+ if len(video_ids) == 1:
285
+ transcript = fetch_transcript(
286
+ video_id=video_ids[0],
287
+ languages=args.languages,
288
+ output_format=args.format,
289
+ preserve_timestamps=args.timestamps,
290
+ output_file=args.output,
291
+ )
292
+ if not args.output:
293
+ print(transcript)
294
+ else:
295
+ results = fetch_multiple(
296
+ video_ids=video_ids,
297
+ languages=args.languages,
298
+ output_format=args.format,
299
+ preserve_timestamps=args.timestamps,
300
+ output_dir=args.output,
301
+ )
302
+ if not args.output:
303
+ for vid, data in results.items():
304
+ print(f"\n{'='*60}")
305
+ print(f"Video ID: {vid}")
306
+ print(f"{'='*60}")
307
+ if data["status"] == "ok":
308
+ print(data["transcript"])
309
+ else:
310
+ print("Failed to retrieve transcript.")
311
+
312
+
313
+ if __name__ == "__main__":
314
+ main()
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ youtube-transcript-api==1.2.4