Eyal Rosenthal — Web Scraping & Data Pipelines

Eyal Rosenthal — Web Scraping & Data Pipelines https://eyalrosenthal.online/ Production web scraping tutorials, interactive tools, and 46 working demos. Updated regularly. en-us Tue, 05 May 2026 18:43:24 +0000 Getting Started with Web Scraping in 2026: From Zero to First Working Scraper in 30 Minutes https://eyalrosenthal.online/tutorials/getting-started-web-scraping/ https://eyalrosenthal.online/tutorials/getting-started-web-scraping/ If you've never scraped a website before, start here. The minimum tools, the first working script, and the three things that will trip you up. Written for total beginners; no Python experience assumed. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape eBay Listings in 2026 https://eyalrosenthal.online/tutorials/how-to-scrape-ebay/ https://eyalrosenthal.online/tutorials/how-to-scrape-ebay/ eBay is the friendliest major e-commerce scraping target — light anti-bot, generous official API (5k requests/day free), and CSS structures that haven't drifted much in years. Here's the working stack. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Reddit in 2026: Use the Official API (It's Cheap and the Workarounds Aren't) https://eyalrosenthal.online/tutorials/how-to-scrape-reddit/ https://eyalrosenthal.online/tutorials/how-to-scrape-reddit/ Reddit closed public scraping in 2023 but kept the API affordable. PRAW + free OAuth tier handles 90% of use cases. The DIY scraping route exists but is brittle, ToS-risky, and unnecessary. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Wikipedia (The Easy Target Everyone Overcomplicates) https://eyalrosenthal.online/tutorials/how-to-scrape-wikipedia/ https://eyalrosenthal.online/tutorials/how-to-scrape-wikipedia/ Wikipedia is the simplest legitimate scraping target on the public internet. CC-BY-SA license, official APIs, no anti-bot. Here are the four ways to extract data and which to use when. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape YouTube Videos, Transcripts, and Channel Data in 2026 https://eyalrosenthal.online/tutorials/how-to-scrape-youtube/ https://eyalrosenthal.online/tutorials/how-to-scrape-youtube/ YouTube has three things you might want to extract: video files, transcripts, and metadata. Each has its own toolchain. yt-dlp + youtube-transcript-api + the Data API v3 cover 99% of use cases. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) Web Scraping FAQ: Every Question I Get Asked https://eyalrosenthal.online/tutorials/web-scraping-faq/ https://eyalrosenthal.online/tutorials/web-scraping-faq/ Direct answers to the 25 most-common web scraping questions: legality, costs, tools, anti-bot, languages, time-to-build, what to do when sites change. No vendor weasel-language. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) Web Scraping Glossary: Every Term Defined Plainly https://eyalrosenthal.online/tutorials/web-scraping-glossary/ https://eyalrosenthal.online/tutorials/web-scraping-glossary/ If you're new to web scraping you'll see jargon everywhere — TLS fingerprinting, headless browsers, user agents, rate limits, residential proxies. This is every term you'll encounter, defined in one or two sentences each, in plain English. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) Web Scraping Legal & Ethics: 2026 State of Play https://eyalrosenthal.online/tutorials/web-scraping-legal-ethics/ https://eyalrosenthal.online/tutorials/web-scraping-legal-ethics/ What's legal, what's not, and where the gray zones are. The hiQ ruling, GDPR, CFAA, ToS, robots.txt, and the practical rules I follow on every job. Not legal advice — but the realistic landscape. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) 100 Production Web Scrapers, One Repo: The Patterns That Repeat https://eyalrosenthal.online/tutorials/100-production-scrapers-one-repo/ https://eyalrosenthal.online/tutorials/100-production-scrapers-one-repo/ After shipping 100 scrapers across 40+ brief classes, the patterns are obvious. The full taxonomy of web-scraping work, with a real example for each. 2026-05-04T00:00:00Z info@luba.media (Eyal Rosenthal) Why a $5/mo VPS Beats a $1,200/mo ScrapingBee Plan https://eyalrosenthal.online/tutorials/5-vps-vs-scrapingbee/ https://eyalrosenthal.online/tutorials/5-vps-vs-scrapingbee/ The actual stack, the actual cost math, and the operational discipline that turns a $5 Hetzner box into a 100k-page-per-day scraping pipeline. Pipeline-as-product, not script-as-deliverable. 2026-05-04T00:00:00Z info@luba.media (Eyal Rosenthal) Bypassing Cloudflare, DataDome, and PerimeterX in 2026: A Working Playbook https://eyalrosenthal.online/tutorials/anti-bot-bypass-2026/ https://eyalrosenthal.online/tutorials/anti-bot-bypass-2026/ How to scrape sites behind modern anti-bot stacks without paying $1,200/mo for ScrapingBee. curl_cffi, nodriver, residential rotation, headers, and the math behind each choice. 2026-05-04T00:00:00Z info@luba.media (Eyal Rosenthal) Best Residential Proxy Services 2026: Honest Comparison (Webshare vs Bright Data vs Oxylabs vs IPRoyal) https://eyalrosenthal.online/tutorials/best-residential-proxies-2026/ https://eyalrosenthal.online/tutorials/best-residential-proxies-2026/ Tested all four for production scraping work. The pricing is opaque, the per-GB math is misleading, and the 'best' depends entirely on your volume and country mix. Here's the math, the gotchas, and which one I actually use. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Amazon Product Data in 2026 (And Whether You Should) https://eyalrosenthal.online/tutorials/how-to-scrape-amazon/ https://eyalrosenthal.online/tutorials/how-to-scrape-amazon/ Amazon is the hardest mainstream scraping target — Cloudflare-equivalent anti-bot, aggressive ToS enforcement, and a paid official API. Here's what actually works, what gets you blocked, and when you should just pay Apify $50 instead. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Google Search Results in 2026 (and the Two Real Alternatives) https://eyalrosenthal.online/tutorials/how-to-scrape-google-search/ https://eyalrosenthal.online/tutorials/how-to-scrape-google-search/ Google search is the highest-friction scraping target on the public web — TLS fingerprinting, CAPTCHA escalation, IP-rotation requirements. Here's what works at small scale, what works at large scale, and the two cheap alternatives that solve 90% of use cases. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Indeed Job Listings in 2026 https://eyalrosenthal.online/tutorials/how-to-scrape-indeed/ https://eyalrosenthal.online/tutorials/how-to-scrape-indeed/ Indeed has Cloudflare Turnstile, aggressive anti-bot, and no public API for non-employers. Here's the working DIY approach for low volume, the official ATS partner path for serious work, and the public-data alternatives. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape LinkedIn in 2026 (Honest: You Can't Do It Safely) https://eyalrosenthal.online/tutorials/how-to-scrape-linkedin/ https://eyalrosenthal.online/tutorials/how-to-scrape-linkedin/ LinkedIn has actively litigated scrapers since 2017 (and won most of the contract-law cases). The hiQ ruling does not protect you from the contract claim. Here's the realistic landscape and the four legitimate alternatives. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Twitter / X in 2026 (Honest: Don't, Use the API) https://eyalrosenthal.online/tutorials/how-to-scrape-twitter-x/ https://eyalrosenthal.online/tutorials/how-to-scrape-twitter-x/ Twitter/X aggressively litigates scrapers, broke every public scraping library in 2023, and gates content behind login. The official API is your real answer. Here's the unvarnished landscape and a working alternative for the 5% of cases where the API genuinely doesn't fit. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) How to Scrape Yelp Business Listings in 2026 https://eyalrosenthal.online/tutorials/how-to-scrape-yelp/ https://eyalrosenthal.online/tutorials/how-to-scrape-yelp/ Yelp has anti-bot at the Cloudflare-Turnstile tier and an official API ($95/mo) for the use cases people typically want. Here's the working DIY approach for low volume, the API path for serious work, and the lead-gen alternatives. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) Scrapy vs Playwright vs Selenium: 2026 Decision Tree (with the Honest Verdict) https://eyalrosenthal.online/tutorials/scrapy-vs-playwright-vs-selenium/ https://eyalrosenthal.online/tutorials/scrapy-vs-playwright-vs-selenium/ Three tools for three different jobs. Most tutorials mix them up. Here's when each one wins, when each one loses, and the simple flowchart that picks the right tool for any scraping brief. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal) SEC EDGAR + XBRL: From Filings to Clean CSV in 30 Seconds https://eyalrosenthal.online/tutorials/sec-edgar-xbrl-extraction/ https://eyalrosenthal.online/tutorials/sec-edgar-xbrl-extraction/ How to pull structured financial data from SEC filings without paying $20K/year for Bloomberg or $400/month for AlphaSense. The XBRL multi-candidate problem and the resolver that solves it. 2026-05-04T00:00:00Z info@luba.media (Eyal Rosenthal) Self-Healing AI Web Extractors: A Complete Implementation Guide https://eyalrosenthal.online/tutorials/self-healing-ai-extractors/ https://eyalrosenthal.online/tutorials/self-healing-ai-extractors/ How to build web scrapers that survive site redesigns. LLM + JSON Schema as the contract, not CSS selectors. Stress-tested against full DOM scrambles. Working code, real numbers. 2026-05-04T00:00:00Z info@luba.media (Eyal Rosenthal) Web Scraping Tools Comparison 2026: Scrapy vs Playwright vs Beautiful Soup vs ScrapingBee vs DIY https://eyalrosenthal.online/tutorials/web-scraping-tools-comparison/ https://eyalrosenthal.online/tutorials/web-scraping-tools-comparison/ Honest, no-affiliate comparison of every web scraping tool you'll evaluate. When to use Scrapy, when to use Playwright, when to use Beautiful Soup, when to pay for ScrapingBee/Bright Data/Apify, and when to roll your own. Decision tree included. 2026-05-05T00:00:00Z info@luba.media (Eyal Rosenthal)