Eyal Rosenthal · Web scraping at scale

How to Scrape YouTube Videos, Transcripts, and Channel Data in 2026

How to Scrape YouTube in 2026

YouTube is one of the easier-to-extract major platforms — three solid tools cover virtually every use case. The tricky part is YouTube's IP-rate-limiting at scale and a few format/auth gotchas.

What you might want, ranked

GoalTool
Download video filesyt-dlp
Get transcripts (auto-captions or manual subs)youtube-transcript-api or yt-dlp --write-auto-subs
List videos in a channelyt-dlp (extract_flat)
Get view counts, likes, channel statsYouTube Data API v3
Real-time monitoring (new uploads)YouTube Data API v3 + polling
Live-stream chat transcriptspytchat library

Tool 1: yt-dlp (the swiss-army knife)

yt-dlp is the active fork of youtube-dl. CLI + Python library. Maintained, extensible, supports 1,000+ sites.

brew install yt-dlp     # macOS
pip install yt-dlp      # Python library
brew install deno       # JS runtime — yt-dlp needs this for current YouTube

Download a single video

yt-dlp "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Outputs to dQw4w9WgXcQ - .mp4</code> (best available format) by default.</p> <h3 id="download-an-entire-channel">Download an entire channel</h3> <pre><code>yt-dlp -o "%(playlist_index)03d - %(title)s.%(ext)s" \ "https://www.youtube.com/@CHANNEL_NAME/videos"</code></pre> <p>The <code>-o</code> template numbers files in upload order with the video title. Channel-scale downloads work but YouTube will throttle you after ~200-500 videos in a session.</p> <h3 id="just-the-audio-mp3">Just the audio (mp3)</h3> <pre><code>yt-dlp --extract-audio --audio-format mp3 \ "https://www.youtube.com/watch?v=VIDEO_ID"</code></pre> <h3 id="just-the-metadata-no-download">Just the metadata, no download</h3> <pre><code>yt-dlp --skip-download --print "%(id)s|%(title)s|%(view_count)s|%(duration)s" \ "https://www.youtube.com/@CHANNEL/videos"</code></pre> <p>Faster than fetching each video, useful for "list every video on a channel" extraction.</p> <h3 id="programmatic-use">Programmatic use</h3> <pre><code>import yt_dlp with yt_dlp.YoutubeDL({"extract_flat": True, "quiet": True}) as ydl: info = ydl.extract_info("https://www.youtube.com/@JohnWatsonRooney/videos", download=False) for entry in info["entries"]: print(entry["id"], entry["title"])</code></pre> <h2 id="tool-2-youtube-transcript-api-transcripts-fast">Tool 2: <code>youtube-transcript-api</code> (transcripts, fast)</h2> <p>When you want the spoken-text transcript without downloading audio, use this library directly.</p> <pre><code>pip install youtube-transcript-api</code></pre> <pre><code>from youtube_transcript_api import YouTubeTranscriptApi api = YouTubeTranscriptApi() fetched = api.fetch("dQw4w9WgXcQ", languages=["en", "en-US", "en-GB"]) for snippet in fetched: print(f"[{snippet.start:.1f}s] {snippet.text}") # Or get the full text in one string full_text = " ".join(s.text for s in fetched)</code></pre> <p>YouTube's auto-captions are ~95% accurate on clean audio. Punctuation is iffy but usable for search, RAG, LLM analysis.</p> <p><strong>Gotcha</strong>: at scale (hundreds of videos in quick succession), YouTube IP-rate-limits transcript fetches. Add <code>time.sleep(1.5)</code> between calls. For very high volume, route through residential proxies (Webshare $3-15/mo).</p> <h2 id="tool-3-youtube-data-api-v3-engagement-metrics">Tool 3: YouTube Data API v3 (engagement metrics)</h2> <p>For view counts, likes, comments, channel statistics — <code>yt-dlp</code> and transcript-api can't give you these reliably. Use the official API.</p> <p>Setup:</p> <ol><li>Create a Google Cloud project: <a href="https://console.cloud.google.com">console.cloud.google.com</a></li><li>Enable "YouTube Data API v3"</li><li>Create an API key (no OAuth needed for read-only public data)</li></ol> <pre><code>import requests API_KEY = "YOUR_KEY" def video_stats(video_id: str) -> dict: r = requests.get("https://www.googleapis.com/youtube/v3/videos", params={ "part": "snippet,statistics", "id": video_id, "key": API_KEY, }) return r.json()["items"][0] stats = video_stats("dQw4w9WgXcQ") print(stats["snippet"]["title"]) print(stats["statistics"]["viewCount"], "views") print(stats["statistics"]["likeCount"], "likes")</code></pre> <p>Free quota: 10,000 units/day. Each video lookup costs ~3 units, so ~3,000 video stats queries/day on the free tier.</p> <h2 id="combining-all-three-a-real-world-pattern">Combining all three: a real-world pattern</h2> <p>For "give me every video in this channel with title + views + transcript":</p> <pre><code>import yt_dlp, requests from youtube_transcript_api import YouTubeTranscriptApi import time API_KEY = "YOUR_KEY" CHANNEL = "https://www.youtube.com/@JohnWatsonRooney" # Step 1 — list videos via yt-dlp (fast, no API quota) with yt_dlp.YoutubeDL({"extract_flat": True, "quiet": True}) as ydl: info = ydl.extract_info(CHANNEL + "/videos", download=False) videos = info["entries"] print(f"Found {len(videos)} videos") # Step 2 — for each video, get transcript + stats in parallel api = YouTubeTranscriptApi() out = [] for i, v in enumerate(videos[:50]): # first 50 for demo vid = v["id"] # Stats from Data API stats_r = requests.get("https://www.googleapis.com/youtube/v3/videos", params={ "part": "statistics", "id": vid, "key": API_KEY, }).json() stats = stats_r["items"][0]["statistics"] if stats_r.get("items") else {} # Transcript try: fetched = api.fetch(vid, languages=["en"]) transcript = " ".join(s.text for s in fetched) except Exception: transcript = None out.append({ "id": vid, "title": v["title"], "views": stats.get("viewCount"), "likes": stats.get("likeCount"), "transcript_words": len(transcript.split()) if transcript else 0, "transcript": transcript, }) time.sleep(1.5) # rate-limit politeness if i % 10 == 0: print(f" [{i}/50]") # Save import json with open("channel_dataset.json", "w") as f: json.dump(out, f, indent=2) print(f"Saved {len(out)} records")</code></pre> <p>This is the pattern behind <a href="https://github.com/luba-media/freelance/tree/main/portfolio_demos/youtube_channel_monitor"><code>portfolio_demos/youtube_channel_monitor/</code></a> in the repo, plus the channel-transcripts demo I built recently.</p> <h2 id="common-gotchas">Common gotchas</h2> <h3 id="gotcha-1-yt-dlp-needs-deno-or-another-js-runtime">Gotcha 1: yt-dlp needs deno or another JS runtime</h3> <p>YouTube's web client requires JavaScript challenge solving. yt-dlp uses <code>deno</code> by default. Without it: "Requested format is not available" errors.</p> <pre><code>brew install deno</code></pre> <h3 id="gotcha-2-ip-rate-limiting-at-scale">Gotcha 2: IP rate-limiting at scale</h3> <p>Both transcript-api and yt-dlp share the underlying YouTube infrastructure. After ~80-200 fast requests, YouTube will return 429s for ~30 min to several hours.</p> <p>Fixes (in order):</p> <ol><li>Add inter-request delay (<code>time.sleep(1.5)</code>)</li><li>Use <code>--cookies-from-browser chrome</code> to send authenticated session (much higher rate limit)</li><li>Route through residential proxies</li></ol> <h3 id="gotcha-3-some-videos-have-transcripts-disabled">Gotcha 3: Some videos have transcripts disabled</h3> <p>Auto-captions are off on:</p> <ul><li>Music videos (rights reasons)</li><li>Live streams (until ~30 min after they end)</li><li>Videos in languages without auto-captioning support</li><li>Videos where the creator manually disabled</li></ul> <p><code>youtube-transcript-api</code> raises <code>TranscriptsDisabled</code> or <code>NoTranscriptFound</code>. Catch and skip.</p> <h3 id="gotcha-4-youtube-shorts-have-separate-urls">Gotcha 4: YouTube Shorts have separate URLs</h3> <p>Shorts use <code>/shorts/<id></code> instead of <code>/watch?v=<id></code>. Both <code>yt-dlp</code> and the transcript API handle them transparently — same video IDs work for both.</p> <h2 id="legal-tos">Legal & ToS</h2> <p>YouTube's ToS forbids "downloading content unless YouTube provides a download link." Auto-caption and metadata extraction sit in a grey zone — community practice has been permissive for individual / non-commercial use; commercial republishing is unambiguously not OK.</p> <p>Practical rules:</p> <ol><li><strong>Personal/research use is broadly fine</strong> — every academic paper using YouTube data scraped it</li><li><strong>Republishing video files is copyright infringement</strong> — don't, ever, even with attribution</li><li><strong>Republishing transcripts</strong> is grey — fair use claim depends on transformation/excerpting</li><li><strong>Channel-stats analysis</strong> is broadly fine — that's what the Data API is for</li></ol> <h2 id="what-to-read-next">What to read next</h2> <ul><li><a href="/tutorials/how-to-scrape-reddit/">How to Scrape Reddit</a> — the other API-friendly major platform</li><li><a href="/tutorials/self-healing-ai-extractors/">Self-Healing AI Web Extractors</a> — useful for parsing transcript text afterward</li><li>The repo: <a href="https://github.com/luba-media/freelance/tree/main/portfolio_demos/youtube_channel_monitor"><code>portfolio_demos/youtube_channel_monitor/</code></a></li></ul> <p>For a YouTube-data-shaped brief, send to <strong>info@luba.media</strong>. Quick gigs ($150-500) ship in 1-3 days.</p> <div class="cta"> <h2>Hire me to build this for your site</h2> <p>I quote fixed-price and ship in 7-10 days. Send a brief to <a href="mailto:info@luba.media">info@luba.media</a>.</p> <a class="btn" href="mailto:info@luba.media">Send a brief</a> </div> </article> <aside class="toc"><h4>On this page</h4><ul><li><a href="#what-you-might-want-ranked">What you might want, ranked</a></li><li><a href="#tool-1-yt-dlp-the-swiss-army-knife">Tool 1: `yt-dlp` (the swiss-army knife)</a></li><li><a href="#tool-2-youtube-transcript-api-transcripts-fast">Tool 2: `youtube-transcript-api` (transcripts, fast)</a></li><li><a href="#tool-3-youtube-data-api-v3-engagement-metrics">Tool 3: YouTube Data API v3 (engagement metrics)</a></li><li><a href="#combining-all-three-a-real-world-pattern">Combining all three: a real-world pattern</a></li><li><a href="#common-gotchas">Common gotchas</a></li><li><a href="#legal-tos">Legal & ToS</a></li><li><a href="#what-to-read-next">What to read next</a></li></ul></aside> </div> </div> </main> <footer> <div class="container"> <div class="footer-grid"> <div> <h4>One-stop shop for web scraping</h4> <p style="font-size:15px;line-height:1.5;color:var(--muted);margin:0 0 12px">46 production demos, 10 deep tutorials, written by someone who runs a €500K/yr data business in Spain on his own scrapers. Native English, async-only, fixed-price preferred.</p> <p style="font-size:15px;color:var(--muted)"><a href="mailto:info@luba.media" style="color:var(--accent)">info@luba.media</a></p> </div> <div> <h4>Site</h4> <ul> <li><a href="/demos/">Demos</a></li> <li><a href="/tutorials/">Tutorials</a></li> <li><a href="/tools/">Free tools</a></li> <li><a href="/resources/">Resources index</a></li> <li><a href="/about/">About</a></li> </ul> </div> <div> <h4>For machines</h4> <ul> <li><a href="/llms.txt">llms.txt</a></li> <li><a href="/sitemap.xml">sitemap.xml</a></li> <li><a href="/rss.xml">rss.xml</a></li> <li><a href="/robots.txt">robots.txt</a></li> <li><a href="https://github.com/luba-media/freelance">Source on GitHub</a></li> </ul> </div> </div> <div style="border-top:1px solid var(--rule);padding-top:24px;font-size:13px"> © Eyal Rosenthal 2026. Madrid, Spain. Hosted on Vercel. </div> </div> </footer> </body> </html>