Upload 6 files

Browse files

Files changed (6) hide show

.gitignore +2 -0
GUIDE.md +159 -0
LICENSE +21 -0
README.md +199 -0
main.py +314 -0
requirements.txt +1 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ .venv/
2	+ venv/

GUIDE.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# Step-by-Step Setup Guide
+---
+## Prerequisites
+| Requirement | Minimum Version |
+|-------------|-----------------|
+| Python      | 3.8             |
+| pip         | 21.0            |
+---
+## Step 1 — Clone or Download the Project
+If you have Git installed:
+```bash
+git clone https://github.com/your-username/youtube-transcript-fetcher.git
+cd youtube-transcript-fetcher
+```
+Or download and unzip the archive, then open a terminal inside the folder.
+---
+## Step 2 — Create a Virtual Environment (Recommended)
+**macOS / Linux**
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+```
+**Windows (Command Prompt)**
+```cmd
+python -m venv .venv
+.venv\Scripts\activate.bat
+```
+**Windows (PowerShell)**
+```powershell
+python -m venv .venv
+.venv\Scripts\Activate.ps1
+```
+You should see `(.venv)` at the start of your terminal prompt when the environment is active.
+---
+## Step 3 — Install Dependencies
+```bash
+pip install -r requirements.txt
+```
+Verify the install:
+```bash
+pip show youtube-transcript-api
+```
+---
+## Step 4 — Run the Script
+### Basic usage — print transcript to terminal
+```bash
+python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+```
+### Save to a file
+```bash
+python main.py "https://youtu.be/dQw4w9WgXcQ" -o transcript.txt
+```
+### Export as SRT subtitles
+```bash
+python main.py dQw4w9WgXcQ -f srt -o transcript.srt
+```
+### Export as JSON
+```bash
+python main.py dQw4w9WgXcQ -f json -o transcript.json
+```
+### Include timestamps in plain-text output
+```bash
+python main.py dQw4w9WgXcQ -t
+```
+### Request a specific language
+```bash
+python main.py dQw4w9WgXcQ -l es          # Spanish
+python main.py dQw4w9WgXcQ -l ja ko en    # Japanese, then Korean, then English
+```
+### List all available languages for a video
+```bash
+python main.py dQw4w9WgXcQ --list
+```
+### Batch — multiple videos saved to a directory
+```bash
+python main.py VIDEO_ID_1 VIDEO_ID_2 VIDEO_ID_3 -o ./transcripts/
+```
+---
+## Step 5 — Deactivate the Virtual Environment (When Done)
+```bash
+deactivate
+```
+---
+## Troubleshooting
+| Error | Cause | Fix |
+|-------|-------|-----|
+| `TranscriptsDisabled` | The video owner turned off transcripts | Nothing can be done; try another video |
+| `VideoUnavailable` | Video is private, deleted, or region-locked | Check the URL; use a VPN if region-locked |
+| `NoTranscriptFound` | Requested language does not exist | Run `--list` and pick an available language |
+| `ModuleNotFoundError` | Dependencies not installed | Run `pip install -r requirements.txt` |
+| Empty output | Video has no speech or very short content | Confirm the video has captions enabled |
+---
+## Input Formats Accepted
+All of the following point to the same video and are equally valid input:
+```
+https://www.youtube.com/watch?v=dQw4w9WgXcQ
+https://youtu.be/dQw4w9WgXcQ
+https://www.youtube.com/shorts/dQw4w9WgXcQ
+https://www.youtube.com/embed/dQw4w9WgXcQ
+dQw4w9WgXcQ
+```
+---
+## Output Formats
+| Flag | Format | Best For |
+|------|--------|----------|
+| `text` (default) | Plain text, one line per caption segment | Reading, summarization, NLP |
+| `json` | JSON array with `text`, `start`, `duration` fields | Programmatic processing |
+| `srt` | SubRip subtitle format | Video players, Premiere, DaVinci |
+| `vtt` | WebVTT subtitle format | HTML5 video, browsers |

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Rembrant Oyangoren Albeos
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+license: mit
+sdk: static
+colorFrom: blue
+colorTo: red
+tags:
+- youtube
+- transcript
+- api
+- python
+- tools
+---
+![Python](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat-square&logo=python&logoColor=white)
+![License](https://img.shields.io/badge/License-MIT-green?style=flat-square)
+![Dependencies](https://img.shields.io/badge/Dependencies-1-orange?style=flat-square)
+![Platform](https://img.shields.io/badge/Platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey?style=flat-square)
+![No Scraping](https://img.shields.io/badge/No%20Scraping-Direct%20API-brightgreen?style=flat-square)
+---
+# YouTube Transcript Fetcher
+A fast, zero-scraping Python command-line tool that pulls transcripts directly
+from YouTube videos using the official caption delivery API.
+No Selenium. No BeautifulSoup. No headless browsers. Just the raw transcript
+data returned by YouTube's own caption endpoint — in milliseconds.
+---
+## How It Works
+YouTube serves captions through a dedicated timedtext API endpoint. The
+`youtube-transcript-api` library calls that endpoint directly, bypassing all
+HTML parsing entirely. This makes fetches nearly instant regardless of video length.
+---
+## Features
+- Direct API access — no HTML parsing, no browser automation
+- Supports full YouTube URLs, short youtu.be links, Shorts URLs, embed URLs, and raw video IDs
+- Output formats: plain text, JSON, SRT (SubRip), WebVTT
+- Optional timestamp preservation in plain-text output
+- Language selection with ordered fallback (e.g. try Japanese, then English)
+- Batch processing — fetch transcripts for multiple videos in one command
+- Auto-saves to file or directory with correct file extension
+- Lists all available transcript languages for any video
+---
+## Installation
+```bash
+git clone https://github.com/your-username/youtube-transcript-fetcher.git
+cd youtube-transcript-fetcher
+python -m venv .venv
+source .venv/bin/activate      # Windows: .venv\Scripts\activate
+pip install -r requirements.txt
+```
+---
+## Quick Start
+```bash
+# Print transcript to terminal
+python main.py "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
+# Save as plain text
+python main.py dQw4w9WgXcQ -o transcript.txt
+# Save as SRT subtitles
+python main.py dQw4w9WgXcQ -f srt -o transcript.srt
+# Save as JSON (includes start time + duration per segment)
+python main.py dQw4w9WgXcQ -f json -o transcript.json
+# Include timestamps in plain-text output
+python main.py dQw4w9WgXcQ -t
+# Request Spanish transcript, fall back to English if unavailable
+python main.py dQw4w9WgXcQ -l es en
+# List every available language for a video
+python main.py dQw4w9WgXcQ --list
+# Batch: fetch three videos and save each to ./transcripts/
+python main.py ID1 ID2 ID3 -o ./transcripts/
+```
+---
+## CLI Reference
+```
+usage: main.py [-h] [-l LANG [LANG ...]] [-f {text,json,srt,vtt}]
+               [-t] [-o PATH] [--list]
+               video [video ...]
+positional arguments:
+  video                 YouTube video URL(s) or video ID(s)
+optional arguments:
+  -h, --help            show this help message and exit
+  -l, --languages       Language codes in order of preference (default: en)
+  -f, --format          Output format: text, json, srt, vtt (default: text)
+  -t, --timestamps      Add timestamps to plain-text output
+  -o, --output          Output file (single video) or directory (batch)
+  --list                List all available transcript languages and exit
+```
+---
+## JSON Output Structure
+Each entry in the JSON array contains:
+```json
+[
+  {
+    "text": "Never gonna give you up",
+    "start": 43.08,
+    "duration": 2.16
+  }
+]
+```
+| Field      | Type  | Description                      |
+|------------|-------|----------------------------------|
+| `text`     | str   | Caption text for the segment     |
+| `start`    | float | Start time in seconds            |
+| `duration` | float | Duration of the segment in seconds |
+---
+## Supported URL Formats
+```
+https://www.youtube.com/watch?v=VIDEO_ID
+https://youtu.be/VIDEO_ID
+https://www.youtube.com/shorts/VIDEO_ID
+https://www.youtube.com/embed/VIDEO_ID
+VIDEO_ID  (raw 11-character ID)
+```
+---
+## Error Reference
+| Exception              | Meaning                                            |
+|------------------------|----------------------------------------------------|
+| `TranscriptsDisabled`  | The video owner disabled captions                  |
+| `VideoUnavailable`     | Video is private, deleted, or region-locked        |
+| `NoTranscriptFound`    | Requested language(s) do not exist for this video  |
+| `NoTranscriptAvailable`| No captions exist at all for this video            |
+---
+## Dependencies
+| Package                  | Version | Purpose                               |
+|--------------------------|---------|---------------------------------------|
+| youtube-transcript-api   | 1.2.4   | Direct YouTube caption API access     |
+No other dependencies. The standard library handles everything else.
+---
+## License
+MIT License. See `LICENSE` for details.
+---
+## Citation
+If you use this tool in your research or project, please cite it as follows:
+```bibtex
+@software{albeos2026yttfetcher,
+  author = {Rembrant Oyangoren Albeos},
+  title = {YouTube Transcript Fetcher: High-speed, Zero-scraping Caption Extraction},
+  year = {2026},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Repository},
+  howpublished = {\url{https://huggingface.co/algorembrant/youtube-transcript-fetcher}},
+  version = {1.2.4}
+}
+```
+---
+## Disclaimer
+This tool uses YouTube's publicly accessible caption endpoint for personal,
+educational, and research use. Review YouTube's Terms of Service before
+using this tool in a production or commercial context.

main.py ADDED Viewed

	@@ -0,0 +1,314 @@

+#!/usr/bin/env python3
+"""
+YouTube Transcript Fetcher
+Fetches transcripts directly from YouTube videos using the YouTube Transcript API.
+No HTML parsing or scraping involved.
+"""
+import argparse
+import json
+import sys
+import re
+from typing import Optional
+from youtube_transcript_api import YouTubeTranscriptApi
+from youtube_transcript_api.formatters import (
+    TextFormatter,
+    JSONFormatter,
+    SRTFormatter,
+    WebVTTFormatter,
+)
+from youtube_transcript_api._errors import (
+    TranscriptsDisabled,
+    NoTranscriptFound,
+    VideoUnavailable,
+    CouldNotRetrieveTranscript,
+)
+def extract_video_id(url_or_id: str) -> str:
+    """
+    Extract the video ID from a YouTube URL or return it directly if already an ID.
+    Supports formats:
+        - https://www.youtube.com/watch?v=VIDEO_ID
+        - https://youtu.be/VIDEO_ID
+        - https://www.youtube.com/shorts/VIDEO_ID
+        - https://www.youtube.com/embed/VIDEO_ID
+        - VIDEO_ID (raw)
+    """
+    patterns = [
+        r"(?:youtube\.com/watch\?.*v=)([a-zA-Z0-9_-]{11})",
+        r"(?:youtu\.be/)([a-zA-Z0-9_-]{11})",
+        r"(?:youtube\.com/shorts/)([a-zA-Z0-9_-]{11})",
+        r"(?:youtube\.com/embed/)([a-zA-Z0-9_-]{11})",
+    ]
+    for pattern in patterns:
+        match = re.search(pattern, url_or_id)
+        if match:
+            return match.group(1)
+    # Assume raw video ID if it looks like one
+    if re.fullmatch(r"[a-zA-Z0-9_-]{11}", url_or_id):
+        return url_or_id
+    raise ValueError(
+        f"Could not extract a valid YouTube video ID from: {url_or_id}\n"
+        "Accepted formats: full URL, youtu.be short link, or raw 11-character video ID."
+    )
+def list_available_transcripts(video_id: str) -> None:
+    """List all available transcript languages for a video."""
+    api = YouTubeTranscriptApi()
+    transcript_list = api.list(video_id)
+    print(f"\nAvailable transcripts for video: {video_id}\n")
+    manually_created = list(transcript_list._manually_created_transcripts.values())
+    auto_generated = list(transcript_list._generated_transcripts.values())
+    if manually_created:
+        print("Manually created:")
+        for t in manually_created:
+            print(f"  [{t.language_code}] {t.language}")
+    if auto_generated:
+        print("Auto-generated:")
+        for t in auto_generated:
+            print(f"  [{t.language_code}] {t.language} (auto)")
+    if not manually_created and not auto_generated:
+        print("  No transcripts found.")
+def fetch_transcript(
+    video_id: str,
+    languages: Optional[list] = None,
+    output_format: str = "text",
+    preserve_timestamps: bool = False,
+    output_file: Optional[str] = None,
+) -> str:
+    """
+    Fetch transcript for a given video ID.
+    Args:
+        video_id:            YouTube video ID.
+        languages:           Ordered list of language codes to try (e.g. ['en', 'es']).
+                             Falls back to the first available transcript if None.
+        output_format:       One of 'text', 'json', 'srt', 'vtt'.
+        preserve_timestamps: Include timestamps in plain-text output.
+        output_file:         If provided, write transcript to this file path.
+    Returns:
+        The transcript as a formatted string.
+    """
+    if languages is None:
+        languages = ["en"]
+    try:
+        api = YouTubeTranscriptApi()
+        transcript_list = api.list(video_id)
+        # Try requested languages first; fall back to any available transcript
+        try:
+            transcript = transcript_list.find_transcript(languages)
+        except NoTranscriptFound:
+            # Grab whatever is available
+            all_transcripts = list(transcript_list)
+            if not all_transcripts:
+                print(f"Error: No transcript is available for video '{video_id}'.", file=sys.stderr)
+                sys.exit(1)
+            transcript = all_transcripts[0]
+            print(
+                f"Warning: None of the requested languages found. "
+                f"Using [{transcript.language_code}] {transcript.language} instead.",
+                file=sys.stderr,
+            )
+        transcript_data = transcript.fetch()
+        # Format
+        if output_format == "json":
+            formatter = JSONFormatter()
+            result = formatter.format_transcript(transcript_data, indent=2)
+        elif output_format == "srt":
+            formatter = SRTFormatter()
+            result = formatter.format_transcript(transcript_data)
+        elif output_format == "vtt":
+            formatter = WebVTTFormatter()
+            result = formatter.format_transcript(transcript_data)
+        else:  # default: plain text
+            if preserve_timestamps:
+                lines = []
+                for entry in transcript_data:
+                    minutes = int(entry["start"] // 60)
+                    seconds = entry["start"] % 60
+                    lines.append(f"[{minutes:02d}:{seconds:05.2f}] {entry['text']}")
+                result = "\n".join(lines)
+            else:
+                formatter = TextFormatter()
+                result = formatter.format_transcript(transcript_data)
+        if output_file:
+            with open(output_file, "w", encoding="utf-8") as f:
+                f.write(result)
+            print(f"Transcript saved to: {output_file}")
+        return result
+    except TranscriptsDisabled:
+        print(f"Error: Transcripts are disabled for video '{video_id}'.", file=sys.stderr)
+        sys.exit(1)
+    except VideoUnavailable:
+        print(f"Error: Video '{video_id}' is unavailable or does not exist.", file=sys.stderr)
+        sys.exit(1)
+    except CouldNotRetrieveTranscript as e:
+        print(f"Error for video '{video_id}': {e}", file=sys.stderr)
+        sys.exit(1)
+    except Exception as e:
+        print(f"Unexpected error: {e}", file=sys.stderr)
+        sys.exit(1)
+def fetch_multiple(
+    video_ids: list,
+    languages: Optional[list] = None,
+    output_format: str = "text",
+    preserve_timestamps: bool = False,
+    output_dir: Optional[str] = None,
+) -> dict:
+    """
+    Fetch transcripts for multiple video IDs.
+    Args:
+        video_ids:    List of YouTube video IDs.
+        languages:    Language preference list.
+        output_format: Output format string.
+        preserve_timestamps: Include timestamps.
+        output_dir:   Directory to save individual transcript files.
+    Returns:
+        Dictionary mapping video_id -> transcript string (or error message).
+    """
+    import os
+    results = {}
+    for vid in video_ids:
+        print(f"Fetching: {vid}", file=sys.stderr)
+        try:
+            out_file = None
+            if output_dir:
+                ext_map = {"text": "txt", "json": "json", "srt": "srt", "vtt": "vtt"}
+                ext = ext_map.get(output_format, "txt")
+                os.makedirs(output_dir, exist_ok=True)
+                out_file = os.path.join(output_dir, f"{vid}.{ext}")
+            transcript = fetch_transcript(
+                video_id=vid,
+                languages=languages,
+                output_format=output_format,
+                preserve_timestamps=preserve_timestamps,
+                output_file=out_file,
+            )
+            results[vid] = {"status": "ok", "transcript": transcript}
+        except SystemExit:
+            results[vid] = {"status": "error", "transcript": None}
+    return results
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Fetch YouTube video transcripts directly — no scraping required.",
+        formatter_class=argparse.RawTextHelpFormatter,
+    )
+    parser.add_argument(
+        "video",
+        nargs="+",
+        help="YouTube video URL(s) or video ID(s).",
+    )
+    parser.add_argument(
+        "-l", "--languages",
+        nargs="+",
+        default=["en"],
+        metavar="LANG",
+        help="Language codes in order of preference (default: en).\nExample: --languages en es fr",
+    )
+    parser.add_argument(
+        "-f", "--format",
+        choices=["text", "json", "srt", "vtt"],
+        default="text",
+        help="Output format (default: text).",
+    )
+    parser.add_argument(
+        "-t", "--timestamps",
+        action="store_true",
+        help="Include timestamps in plain-text output.",
+    )
+    parser.add_argument(
+        "-o", "--output",
+        metavar="PATH",
+        help="Output file path (single video) or directory (multiple videos).",
+    )
+    parser.add_argument(
+        "--list",
+        action="store_true",
+        help="List all available transcript languages for the video(s) and exit.",
+    )
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    video_ids = [extract_video_id(v) for v in args.video]
+    if args.list:
+        for vid in video_ids:
+            list_available_transcripts(vid)
+        return
+    if len(video_ids) == 1:
+        transcript = fetch_transcript(
+            video_id=video_ids[0],
+            languages=args.languages,
+            output_format=args.format,
+            preserve_timestamps=args.timestamps,
+            output_file=args.output,
+        )
+        if not args.output:
+            print(transcript)
+    else:
+        results = fetch_multiple(
+            video_ids=video_ids,
+            languages=args.languages,
+            output_format=args.format,
+            preserve_timestamps=args.timestamps,
+            output_dir=args.output,
+        )
+        if not args.output:
+            for vid, data in results.items():
+                print(f"\n{'='*60}")
+                print(f"Video ID: {vid}")
+                print(f"{'='*60}")
+                if data["status"] == "ok":
+                    print(data["transcript"])
+                else:
+                    print("Failed to retrieve transcript.")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ youtube-transcript-api==1.2.4