Eyal Rosenthal · Web scraping at scale

SEC EDGAR + XBRL: From Filings to Clean CSV in 30 Seconds

SEC EDGAR + XBRL: From Filings to Clean CSV in 30 Seconds

SEC EDGAR is one of the most valuable free data sources on the internet. Every 10-K, 10-Q, 8-K, S-1, and Form 4 filed by every US-listed company sits there, queryable, with no rate limits beyond a 10-requests-per-second polite-use guideline. Real-time, free, no API key.

So why does Bloomberg charge $24,000/year and AlphaSense charge $4,800/year for "structured financial data"?

Because parsing it is a real problem. The filings are HTML wrappers around XBRL XML, and XBRL has 10,000+ tags, with multiple naming conventions per concept. Same financial fact, three different tags depending on which CPA firm wrote the filer's books. That's the moat. Once you build a solver for it, you have Bloomberg-grade data at zero infrastructure cost.

This tutorial walks through the working solver. Code is in portfolio_demos/sec_edgar_extractor/ in the repo.

The five-minute version

The endpoint you need:

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=10-K

Returns an HTML page listing every 10-K the company has filed. Each row links to a filing-detail page. Each filing-detail page has links to:

  • The HTML 10-K (human-readable)
  • The XBRL instance document (.xml)
  • The XBRL schema (.xsd)
  • A pile of exhibits

Parsing the HTML 10-K gives you text (great for NLP). Parsing the XBRL XML gives you the financial statements (great for analysis). You usually want both.

import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "Eyal Rosenthal info@luba.media"}  # SEC requires this

def get_filings(cik: str, form: str = "10-K"):
    url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type={form}&dateb=&owner=include&count=40"
    soup = BeautifulSoup(requests.get(url, headers=HEADERS).text, "html.parser")
    return [(row.find_all("td")[3].text.strip(),  # date
             row.find_all("td")[1].find("a")["href"])  # filing-detail link
            for row in soup.select("tr.even, tr.odd")]

Twenty lines, you have every 10-K Apple ever filed. The hard part is the XBRL.

The XBRL multi-candidate problem

XBRL maps each financial fact to a concept tag. "Net Revenue" should be one tag. It isn't. It's at least twelve:

us-gaap:Revenues
us-gaap:SalesRevenueNet
us-gaap:SalesRevenueGoodsNet
us-gaap:SalesRevenueServicesNet
us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax
us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax
us-gaap:Revenue
us-gaap:NetSales
us-gaap:RevenueNetOfReturnsAllowancesAndDiscounts
us-gaap:OperatingRevenue
us-gaap:RevenueFromGoodsSold
us-gaap:RevenueFromExternalCustomers

Each filer picks one (or more) based on their accounting policy. Apple uses RevenueFromContractWithCustomerExcludingAssessedTax. Google uses Revenues. Tesla uses both, in different sections. Berkshire Hathaway is a hybrid because some of their subs file under different policies than others.

If you grep for us-gaap:Revenues, you miss 60% of the S&P 500. If you greedily take any tag with "Revenue" in the name, you double-count companies that disclose multiple revenue streams.

This is the multi-candidate problem. Solving it is what separates a working SEC extractor from a useless one.

The resolver pattern

The fix is a multi-candidate resolver. For each canonical concept (Revenue, NetIncome, TotalAssets, ...), you maintain a priority-ordered list of XBRL tags. The resolver tries each in priority order and takes the first match.

RESOLVER = {
    "Revenue": [
        "us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax",
        "us-gaap:Revenues",
        "us-gaap:SalesRevenueNet",
        "us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax",
        "us-gaap:Revenue",
        "us-gaap:NetSales",
        "us-gaap:OperatingRevenue",
    ],
    "NetIncome": [
        "us-gaap:NetIncomeLoss",
        "us-gaap:ProfitLoss",
        "us-gaap:NetIncomeLossAvailableToCommonStockholdersBasic",
    ],
    "TotalAssets": [
        "us-gaap:Assets",
    ],
    # ... ~25 of these for full Balance / Income / Cash Flow coverage
}

def resolve(facts: dict, concept: str) -> tuple[float, str] | None:
    """Return (value, tag_used) for the first matching tag, or None."""
    for tag in RESOLVER[concept]:
        if tag in facts:
            return facts[tag], tag
    return None

The priority order matters. RevenueFromContractWithCustomerExcludingAssessedTax is the post-2018 ASC 606 standard tag, so it goes first. Older filings use Revenues or SalesRevenueNet. By ordering by recency-of-standard, the resolver picks the most accurate tag for any filing era.

You build the resolver by sampling 50 filings across industries (tech, financial, energy, retail, healthcare, real estate) and noting which tags actually appear. Takes 4-6 hours one time. Then it works for the next decade.

Reading the XBRL itself

The XBRL XML format is verbose but parseable. Each element has a contextRef attribute pointing to a block that defines when the fact applies (which fiscal period, which entity).

<us-gaap:Revenues
  contextRef="FD2024Q4YTD"
  unitRef="USD"
  decimals="-3">119575000000</us-gaap:Revenues>

<context id="FD2024Q4YTD">
  <entity>
    <identifier scheme="http://www.sec.gov/CIK">0000320193</identifier>
  </entity>
  <period>
    <startDate>2023-09-30</startDate>
    <endDate>2024-09-28</endDate>
  </period>
</context>

Apple's FY2024 revenue: $119.575B. The decimals="-3" means "rounded to thousands."

To extract every fact for a single filing:

from lxml import etree

def parse_xbrl(xml_path: str) -> dict:
    tree = etree.parse(xml_path)
    root = tree.getroot()
    nsmap = {k or "default": v for k, v in root.nsmap.items()}

    # Build context lookup: contextRef → (start, end)
    contexts = {}
    for ctx in root.iterfind(".//{http://www.xbrl.org/2003/instance}context"):
        cid = ctx.get("id")
        period = ctx.find("{http://www.xbrl.org/2003/instance}period")
        if period is not None:
            sd = period.findtext("{http://www.xbrl.org/2003/instance}startDate")
            ed = period.findtext("{http://www.xbrl.org/2003/instance}endDate")
            inst = period.findtext("{http://www.xbrl.org/2003/instance}instant")
            contexts[cid] = (sd, ed) if sd else (inst, inst)

    facts = {}  # tag → list of (value, period)
    for el in root.iter():
        if not isinstance(el.tag, str): continue
        if "}" not in el.tag: continue
        ns, _, local = el.tag[1:].partition("}")
        if ns not in ("http://fasb.org/us-gaap/2024",
                      "http://fasb.org/us-gaap/2023",
                      "http://fasb.org/us-gaap/2022"): continue
        tag = f"us-gaap:{local}"
        ctx = contexts.get(el.get("contextRef"), ("?", "?"))
        try:
            value = float(el.text)
        except (TypeError, ValueError):
            continue
        decimals = int(el.get("decimals", "0"))
        if decimals < 0:
            value *= 10 ** abs(decimals)
        facts.setdefault(tag, []).append({"value": value, "period_start": ctx[0], "period_end": ctx[1]})
    return facts

That's the unsexy part. Once you have facts as a dict-of-lists keyed by tag, the resolver above turns it into clean financial statements.

End-to-end: ticker → balance sheet CSV in 30 seconds

Putting it together:

def ticker_to_balance_sheet(ticker: str) -> dict:
    cik = ticker_to_cik(ticker)            # ~0.5s — uses SEC's company-tickers.json
    filing_url = latest_10k(cik)            # ~1s — index page parse
    xbrl_url = xbrl_link(filing_url)        # ~1s — filing-detail parse
    facts = parse_xbrl(download(xbrl_url))  # ~5-15s depending on filing size
    return {
        canonical: resolve(facts, canonical)
        for canonical in ("Revenue", "NetIncome", "TotalAssets",
                          "TotalLiabilities", "StockholdersEquity",
                          "CashAndCashEquivalents", "OperatingCashFlow")
    }

For a single ticker, ~10-20 seconds end-to-end. For a watchlist of 25 tickers, ~5 minutes if run in parallel (the bottleneck is XBRL parsing, which is CPU-bound and parallelizes cleanly with ProcessPoolExecutor).

What you can build with this

The pattern unlocks several premium product shapes:

  • Hedge fund screeners — "every S&P 500 company where Revenue grew >20% YoY but NetIncome shrunk." 30 seconds of compute.
  • Quarterly earnings briefings — auto-generated within 1h of new 10-Q being filed (poll EDGAR's RSS feed).
  • Custom watchlists — arbitrary tickers, custom field selection, push to BigQuery / Google Sheets / Slack.
  • NLP-on-MD&A — extract the textual sections (Item 7 = Management Discussion, Item 1A = Risk Factors) and run sentiment / topic analysis.
  • Insider trading alerts — Form 4 filings parse cleanly via the same XBRL pattern.

Pricing on Upwork for SEC-EDGAR work is $80-150/hr or $500-3,000 fixed-price per project. The clients who pay this aren't comparison-shopping with $30/hr Pakistani freelancers. They're hedge funds, family offices, accounting firms.

What I won't do for free

The version of this in my repo (portfolio_demos/sec_edgar_extractor/) is the working baseline. The version I ship to clients includes:

  • A 25-tag resolver covering the full balance sheet, income statement, and cash flow
  • Multi-period extraction (current period + 4 historical comparators)
  • Filing-diff alerts (notify when a new 10-Q is filed for a watchlist ticker)
  • Direct push to BigQuery / Postgres / Google Sheets

Quote on request — fixed-price, ships in 7-10 days. info@luba.media

Hire me to build this for your site

I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.

Send a brief