SEC EDGAR + XBRL: From Filings to Clean CSV in 30 Seconds
SEC EDGAR + XBRL: From Filings to Clean CSV in 30 Seconds
SEC EDGAR is one of the most valuable free data sources on the internet. Every 10-K, 10-Q, 8-K, S-1, and Form 4 filed by every US-listed company sits there, queryable, with no rate limits beyond a 10-requests-per-second polite-use guideline. Real-time, free, no API key.
So why does Bloomberg charge $24,000/year and AlphaSense charge $4,800/year for "structured financial data"?
Because parsing it is a real problem. The filings are HTML wrappers around XBRL XML, and XBRL has 10,000+ tags, with multiple naming conventions per concept. Same financial fact, three different tags depending on which CPA firm wrote the filer's books. That's the moat. Once you build a solver for it, you have Bloomberg-grade data at zero infrastructure cost.
This tutorial walks through the working solver. Code is in portfolio_demos/sec_edgar_extractor/ in the repo.
The five-minute version
The endpoint you need:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type=10-K
Returns an HTML page listing every 10-K the company has filed. Each row links to a filing-detail page. Each filing-detail page has links to:
- The HTML 10-K (human-readable)
- The XBRL instance document (
.xml) - The XBRL schema (
.xsd) - A pile of exhibits
Parsing the HTML 10-K gives you text (great for NLP). Parsing the XBRL XML gives you the financial statements (great for analysis). You usually want both.
import requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "Eyal Rosenthal info@luba.media"} # SEC requires this
def get_filings(cik: str, form: str = "10-K"):
url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={cik}&type={form}&dateb=&owner=include&count=40"
soup = BeautifulSoup(requests.get(url, headers=HEADERS).text, "html.parser")
return [(row.find_all("td")[3].text.strip(), # date
row.find_all("td")[1].find("a")["href"]) # filing-detail link
for row in soup.select("tr.even, tr.odd")]
Twenty lines, you have every 10-K Apple ever filed. The hard part is the XBRL.
The XBRL multi-candidate problem
XBRL maps each financial fact to a concept tag. "Net Revenue" should be one tag. It isn't. It's at least twelve:
us-gaap:Revenues
us-gaap:SalesRevenueNet
us-gaap:SalesRevenueGoodsNet
us-gaap:SalesRevenueServicesNet
us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax
us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax
us-gaap:Revenue
us-gaap:NetSales
us-gaap:RevenueNetOfReturnsAllowancesAndDiscounts
us-gaap:OperatingRevenue
us-gaap:RevenueFromGoodsSold
us-gaap:RevenueFromExternalCustomers
Each filer picks one (or more) based on their accounting policy. Apple uses RevenueFromContractWithCustomerExcludingAssessedTax. Google uses Revenues. Tesla uses both, in different sections. Berkshire Hathaway is a hybrid because some of their subs file under different policies than others.
If you grep for us-gaap:Revenues, you miss 60% of the S&P 500. If you greedily take any tag with "Revenue" in the name, you double-count companies that disclose multiple revenue streams.
This is the multi-candidate problem. Solving it is what separates a working SEC extractor from a useless one.
The resolver pattern
The fix is a multi-candidate resolver. For each canonical concept (Revenue, NetIncome, TotalAssets, ...), you maintain a priority-ordered list of XBRL tags. The resolver tries each in priority order and takes the first match.
RESOLVER = {
"Revenue": [
"us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax",
"us-gaap:Revenues",
"us-gaap:SalesRevenueNet",
"us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax",
"us-gaap:Revenue",
"us-gaap:NetSales",
"us-gaap:OperatingRevenue",
],
"NetIncome": [
"us-gaap:NetIncomeLoss",
"us-gaap:ProfitLoss",
"us-gaap:NetIncomeLossAvailableToCommonStockholdersBasic",
],
"TotalAssets": [
"us-gaap:Assets",
],
# ... ~25 of these for full Balance / Income / Cash Flow coverage
}
def resolve(facts: dict, concept: str) -> tuple[float, str] | None:
"""Return (value, tag_used) for the first matching tag, or None."""
for tag in RESOLVER[concept]:
if tag in facts:
return facts[tag], tag
return None
The priority order matters. RevenueFromContractWithCustomerExcludingAssessedTax is the post-2018 ASC 606 standard tag, so it goes first. Older filings use Revenues or SalesRevenueNet. By ordering by recency-of-standard, the resolver picks the most accurate tag for any filing era.
You build the resolver by sampling 50 filings across industries (tech, financial, energy, retail, healthcare, real estate) and noting which tags actually appear. Takes 4-6 hours one time. Then it works for the next decade.
Reading the XBRL itself
The XBRL XML format is verbose but parseable. Each element has a contextRef attribute pointing to a block that defines when the fact applies (which fiscal period, which entity).
<us-gaap:Revenues
contextRef="FD2024Q4YTD"
unitRef="USD"
decimals="-3">119575000000</us-gaap:Revenues>
<context id="FD2024Q4YTD">
<entity>
<identifier scheme="http://www.sec.gov/CIK">0000320193</identifier>
</entity>
<period>
<startDate>2023-09-30</startDate>
<endDate>2024-09-28</endDate>
</period>
</context>
Apple's FY2024 revenue: $119.575B. The decimals="-3" means "rounded to thousands."
To extract every fact for a single filing:
from lxml import etree
def parse_xbrl(xml_path: str) -> dict:
tree = etree.parse(xml_path)
root = tree.getroot()
nsmap = {k or "default": v for k, v in root.nsmap.items()}
# Build context lookup: contextRef → (start, end)
contexts = {}
for ctx in root.iterfind(".//{http://www.xbrl.org/2003/instance}context"):
cid = ctx.get("id")
period = ctx.find("{http://www.xbrl.org/2003/instance}period")
if period is not None:
sd = period.findtext("{http://www.xbrl.org/2003/instance}startDate")
ed = period.findtext("{http://www.xbrl.org/2003/instance}endDate")
inst = period.findtext("{http://www.xbrl.org/2003/instance}instant")
contexts[cid] = (sd, ed) if sd else (inst, inst)
facts = {} # tag → list of (value, period)
for el in root.iter():
if not isinstance(el.tag, str): continue
if "}" not in el.tag: continue
ns, _, local = el.tag[1:].partition("}")
if ns not in ("http://fasb.org/us-gaap/2024",
"http://fasb.org/us-gaap/2023",
"http://fasb.org/us-gaap/2022"): continue
tag = f"us-gaap:{local}"
ctx = contexts.get(el.get("contextRef"), ("?", "?"))
try:
value = float(el.text)
except (TypeError, ValueError):
continue
decimals = int(el.get("decimals", "0"))
if decimals < 0:
value *= 10 ** abs(decimals)
facts.setdefault(tag, []).append({"value": value, "period_start": ctx[0], "period_end": ctx[1]})
return facts
That's the unsexy part. Once you have facts as a dict-of-lists keyed by tag, the resolver above turns it into clean financial statements.
End-to-end: ticker → balance sheet CSV in 30 seconds
Putting it together:
def ticker_to_balance_sheet(ticker: str) -> dict:
cik = ticker_to_cik(ticker) # ~0.5s — uses SEC's company-tickers.json
filing_url = latest_10k(cik) # ~1s — index page parse
xbrl_url = xbrl_link(filing_url) # ~1s — filing-detail parse
facts = parse_xbrl(download(xbrl_url)) # ~5-15s depending on filing size
return {
canonical: resolve(facts, canonical)
for canonical in ("Revenue", "NetIncome", "TotalAssets",
"TotalLiabilities", "StockholdersEquity",
"CashAndCashEquivalents", "OperatingCashFlow")
}
For a single ticker, ~10-20 seconds end-to-end. For a watchlist of 25 tickers, ~5 minutes if run in parallel (the bottleneck is XBRL parsing, which is CPU-bound and parallelizes cleanly with ProcessPoolExecutor).
What you can build with this
The pattern unlocks several premium product shapes:
- Hedge fund screeners — "every S&P 500 company where Revenue grew >20% YoY but NetIncome shrunk." 30 seconds of compute.
- Quarterly earnings briefings — auto-generated within 1h of new 10-Q being filed (poll EDGAR's RSS feed).
- Custom watchlists — arbitrary tickers, custom field selection, push to BigQuery / Google Sheets / Slack.
- NLP-on-MD&A — extract the textual sections (Item 7 = Management Discussion, Item 1A = Risk Factors) and run sentiment / topic analysis.
- Insider trading alerts — Form 4 filings parse cleanly via the same XBRL pattern.
Pricing on Upwork for SEC-EDGAR work is $80-150/hr or $500-3,000 fixed-price per project. The clients who pay this aren't comparison-shopping with $30/hr Pakistani freelancers. They're hedge funds, family offices, accounting firms.
What I won't do for free
The version of this in my repo (portfolio_demos/sec_edgar_extractor/) is the working baseline. The version I ship to clients includes:
- A 25-tag resolver covering the full balance sheet, income statement, and cash flow
- Multi-period extraction (current period + 4 historical comparators)
- Filing-diff alerts (notify when a new 10-Q is filed for a watchlist ticker)
- Direct push to BigQuery / Postgres / Google Sheets
Quote on request — fixed-price, ships in 7-10 days. info@luba.media
What to read next
- 100 Production Web Scrapers, One Repo: The Patterns That Repeat — SEC EDGAR is Pattern 5 (filing extraction). The other patterns share a lot of infrastructure.
- Why $5/mo VPS Beats $1,200/mo ScrapingBee — the deployment model for the daily-poll version of this pipeline.
- The repo:
portfolio_demos/sec_edgar_extractor/— full working code, with the resolver, XBRL parser, and a sample CSV output for AAPL / GOOGL / MSFT / TSLA / NVDA / META / AMZN / BRK-B / JPM / JNJ.
Hire me to build this for your site
I quote fixed-price and ship in 7-10 days. Send a brief to info@luba.media.
Send a brief