Eyal Rosenthal · Web scraping at scale

SEC EDGAR Bulk Extractor

SEC EDGAR Bulk Extractor — Tickers → CIK → 10-K Filings + XBRL Financials in One CSV

SEC EDGAR Bulk Financial Filings Extractor

Built specifically against Upwork job ~022050416 "SEC EDGAR Extraction" (US, fixed-price, 5-10 proposals, $700+ verified client, posted 2026-05-02 on the freelance pipeline shortlist).

The brief class: take a list of tickers, pull most recent 10-K / 10-Q / 8-K filings + structured XBRL financial facts (revenue, net income, balance-sheet items), output a clean CSV.

Built 2026-05-03 as Demo #25 — the first hybrid-mode (real-Upwork-job-mapped) demo.

Run

. ~/freelance/.venv/bin/activate
cd ~/freelance/portfolio_demos/sec_edgar_extractor
python extract.py                                  # uses tickers.txt
python extract.py --tickers AAPL,MSFT,NVDA,GOOGL   # ad hoc

Result (10 mega-cap public companies)

  • 10 / 10 tickers extracted, 0 failures ✅
  • Per-ticker structured fields: company name, CIK, SIC code, industry, state of incorporation, fiscal year end ✅
  • Latest annual XBRL facts (USD): Revenues, NetIncome, Assets, Liabilities, Equity, Cash ✅
  • Latest 10-K + 10-Q + 8-K filing dates + accession numbers + direct URLs ✅
  • 0.15s sleep between calls (well under SEC's 10 req/s rate limit) ✅
  • Tenacity retries with exponential backoff (4-15s) ✅

Sample output:

tickercompanyrevenuesnet income10-K filed
AAPLApple Inc.$265.6B$112.0B2025-10-31
MSFTMicrosoft Corp$62.5B$101.8B2025-07-30
NVDANVIDIA Corp$215.9B$120.1B2026-02-25
GOOGLAlphabet Inc.$402.8B$132.2B2026-02-05
AMZNAmazon.com$716.9B$77.7B2026-02-06

Why the multi-candidate XBRL field lookup

XBRL is a moving standard — companies report "revenue" under different tag names depending on adoption year and industry:

("Revenues", ["Revenues",
              "RevenueFromContractWithCustomerExcludingAssessedTax",
              "SalesRevenueNet"])

The extractor tries each candidate in order and uses the first that has FY data. This is the difference between a demo that works on AAPL and one that works on AAPL + biotechs + insurers + financial services.

Adapting to a different brief

  • Different forms: edit find_filing(submissions, "10-K") — also accepts 10-Q, 8-K, S-1, DEF 14A, etc.
  • More fields: append to XBRL_FIELDS — common asks are OperatingCashFlow, CapitalExpenditures, LongTermDebt, EarningsPerShareBasic.
  • Quarterly instead of annual: change fp == "FY" filter to fp.startswith("Q").
  • Full text 10-K parsing: extend pipeline to fetch the primary document URL and parse with pdfplumber / BeautifulSoup for narrative sections (Risk Factors, MD&A).

SEC compliance

  • Public APIs only: data.sec.gov, www.sec.gov. No login, no auth.
  • Descriptive User-Agent header (SEC requires identification).
  • Rate limit respected: 0.15s sleep ≈ 6 req/s, well under SEC's 10 req/s cap.

Hire me to build this for your stack

Same patterns, your target site. Send the brief and I'll quote fixed-price within 24 hours.

info@luba.media