How to Scrape YouTube Videos, Transcripts, and Channel Data in 2026
How to Scrape YouTube in 2026
YouTube is one of the easier-to-extract major platforms — three solid tools cover virtually every use case. The tricky part is YouTube's IP-rate-limiting at scale and a few format/auth gotchas.
What you might want, ranked
| Goal | Tool |
|---|---|
| Download video files | yt-dlp |
| Get transcripts (auto-captions or manual subs) | youtube-transcript-api or yt-dlp --write-auto-subs |
| List videos in a channel | yt-dlp (extract_flat) |
| Get view counts, likes, channel stats | YouTube Data API v3 |
| Real-time monitoring (new uploads) | YouTube Data API v3 + polling |
| Live-stream chat transcripts | pytchat library |
Tool 1: yt-dlp (the swiss-army knife)
yt-dlp is the active fork of youtube-dl. CLI + Python library. Maintained, extensible, supports 1,000+ sites.
brew install yt-dlp # macOS
pip install yt-dlp # Python library
brew install deno # JS runtime — yt-dlp needs this for current YouTube
Download a single video
yt-dlp "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
Outputs to dQw4w9WgXcQ - (best available format) by default.
Download an entire channel
yt-dlp -o "%(playlist_index)03d - %(title)s.%(ext)s" \
"https://www.youtube.com/@CHANNEL_NAME/videos"
The -o template numbers files in upload order with the video title. Channel-scale downloads work but YouTube will throttle you after ~200-500 videos in a session.
Just the audio (mp3)
yt-dlp --extract-audio --audio-format mp3 \
"https://www.youtube.com/watch?v=VIDEO_ID"
Just the metadata, no download
yt-dlp --skip-download --print "%(id)s|%(title)s|%(view_count)s|%(duration)s" \
"https://www.youtube.com/@CHANNEL/videos"
Faster than fetching each video, useful for "list every video on a channel" extraction.
Programmatic use
import yt_dlp
with yt_dlp.YoutubeDL({"extract_flat": True, "quiet": True}) as ydl:
info = ydl.extract_info("https://www.youtube.com/@JohnWatsonRooney/videos",
download=False)
for entry in info["entries"]:
print(entry["id"], entry["title"])
Tool 2: youtube-transcript-api (transcripts, fast)
When you want the spoken-text transcript without downloading audio, use this library directly.
pip install youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi
api = YouTubeTranscriptApi()
fetched = api.fetch("dQw4w9WgXcQ", languages=["en", "en-US", "en-GB"])
for snippet in fetched:
print(f"[{snippet.start:.1f}s] {snippet.text}")
# Or get the full text in one string
full_text = " ".join(s.text for s in fetched)
YouTube's auto-captions are ~95% accurate on clean audio. Punctuation is iffy but usable for search, RAG, LLM analysis.
Gotcha: at scale (hundreds of videos in quick succession), YouTube IP-rate-limits transcript fetches. Add time.sleep(1.5) between calls. For very high volume, route through residential proxies (Webshare $3-15/mo).
Tool 3: YouTube Data API v3 (engagement metrics)
For view counts, likes, comments, channel statistics — yt-dlp and transcript-api can't give you these reliably. Use the official API.
Setup:
- Create a Google Cloud project: console.cloud.google.com
- Enable "YouTube Data API v3"
- Create an API key (no OAuth needed for read-only public data)
import requests
API_KEY = "YOUR_KEY"
def video_stats(video_id: str) -> dict:
r = requests.get("https://www.googleapis.com/youtube/v3/videos", params={
"part": "snippet,statistics", "id": video_id, "key": API_KEY,
})
return r.json()["items"][0]
stats = video_stats("dQw4w9WgXcQ")
print(stats["snippet"]["title"])
print(stats["statistics"]["viewCount"], "views")
print(stats["statistics"]["likeCount"], "likes")
Free quota: 10,000 units/day. Each video lookup costs ~3 units, so ~3,000 video stats queries/day on the free tier.
Combining all three: a real-world pattern
For "give me every video in this channel with title + views + transcript":
import yt_dlp, requests
from youtube_transcript_api import YouTubeTranscriptApi
import time
API_KEY = "YOUR_KEY"
CHANNEL = "https://www.youtube.com/@JohnWatsonRooney"
# Step 1 — list videos via yt-dlp (fast, no API quota)
with yt_dlp.YoutubeDL({"extract_flat": True, "quiet": True}) as ydl:
info = ydl.extract_info(CHANNEL + "/videos", download=False)
videos = info["entries"]
print(f"Found {len(videos)} videos")
# Step 2 — for each video, get transcript + stats in parallel
api = YouTubeTranscriptApi()
out = []
for i, v in enumerate(videos[:50]): # first 50 for demo
vid = v["id"]
# Stats from Data API
stats_r = requests.get("https://www.googleapis.com/youtube/v3/videos", params={
"part": "statistics", "id": vid, "key": API_KEY,
}).json()
stats = stats_r["items"][0]["statistics"] if stats_r.get("items") else {}
# Transcript
try:
fetched = api.fetch(vid, languages=["en"])
transcript = " ".join(s.text for s in fetched)
except Exception:
transcript = None
out.append({
"id": vid, "title": v["title"],
"views": stats.get("viewCount"), "likes": stats.get("likeCount"),
"transcript_words": len(transcript.split()) if transcript else 0,
"transcript": transcript,
})
time.sleep(1.5) # rate-limit politeness
if i % 10 == 0:
print(f" [{i}/50]")
# Save
import json
with open("channel_dataset.json", "w") as f:
json.dump(out, f, indent=2)
print(f"Saved {len(out)} records")
This is the pattern behind portfolio_demos/youtube_channel_monitor/ in the repo, plus the channel-transcripts demo I built recently.
Common gotchas
Gotcha 1: yt-dlp needs deno or another JS runtime
YouTube's web client requires JavaScript challenge solving. yt-dlp uses deno by default. Without it: "Requested format is not available" errors.
brew install deno
Gotcha 2: IP rate-limiting at scale
Both transcript-api and yt-dlp share the underlying YouTube infrastructure. After ~80-200 fast requests, YouTube will return 429s for ~30 min to several hours.
Fixes (in order):
- Add inter-request delay (
time.sleep(1.5)) - Use
--cookies-from-browser chrometo send authenticated session (much higher rate limit) - Route through residential proxies
Gotcha 3: Some videos have transcripts disabled
Auto-captions are off on:
- Music videos (rights reasons)
- Live streams (until ~30 min after they end)
- Videos in languages without auto-captioning support
- Videos where the creator manually disabled
youtube-transcript-api raises TranscriptsDisabled or NoTranscriptFound. Catch and skip.
Gotcha 4: YouTube Shorts have separate URLs
Shorts use /shorts/ instead of /watch?v=. Both yt-dlp and the transcript API handle them transparently — same video IDs work for both.
Legal & ToS
YouTube's ToS forbids "downloading content unless YouTube provides a download link." Auto-caption and metadata extraction sit in a grey zone — community practice has been permissive for individual / non-commercial use; commercial republishing is unambiguously not OK.
Practical rules:
- Personal/research use is broadly fine — every academic paper using YouTube data scraped it
- Republishing video files is copyright infringement — don't, ever, even with attribution
- Republishing transcripts is grey — fair use claim depends on transformation/excerpting
- Channel-stats analysis is broadly fine — that's what the Data API is for
What to read next
- How to Scrape Reddit — the other API-friendly major platform
- Self-Healing AI Web Extractors — useful for parsing transcript text afterward
- The repo:
portfolio_demos/youtube_channel_monitor/
For a YouTube-data-shaped brief, send to info@luba.media. Quick gigs ($150-500) ship in 1-3 days.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief