Two regulators, one schema: building fundamentals pipelines for the U.S. and Hong Kong
bearsignal.ai — /engineering
Forensic mathematics has a boring dependency: it only works on clean, comparable, point-in-time fundamentals. Benford tests need raw reported digits. Accrual models need balance sheets that reconcile across periods. Every model in our engine assumes the inputs are real, aligned, and honest about what they don’t know. Getting to that state — across two regulatory regimes — is most of the actual work.
The U.S. side: structured, vast, and quietly treacherous
The SEC gives you XBRL: machine-readable, standardized, going back years, across roughly 7,900 active filers and millions of filings. It is a gift. It is also a minefield of dialects — companies extend the taxonomy with custom tags, restate prior periods without ceremony, and report the “same” concept under different elements across years.
Our pipeline treats EDGAR as an append-only source of record and normalizes into a single fundamentals cache keyed by company and period. Two rules carry most of the weight. First: bulk ingestion goes through PostgreSQL’s COPY, never row-by-row inserts — at millions of filings, that’s the difference between an afternoon and a lost week, roughly a 30x throughput gap in our environment. Second: when a value can’t be resolved with confidence, it lands as NULL. Not zero, not a forward-fill, not an industry average. NULL is information; a fabricated value is contamination that no downstream model can detect.
The Hong Kong side: where the pipeline grows eyes
HKEX has no XBRL equivalent for most of what we need. Disclosure lives in PDF annual reports — thousands of them, in two languages, with financial statements formatted at each issuer’s discretion under IFRS rather than US-GAAP.
So the Hong Kong pipeline is a different animal: filing-metadata harvesting at exchange scale, then LLM-driven extraction from the PDFs themselves — pulling structured statements out of documents that were formatted for human eyes. Every extracted figure carries provenance back to its source document. Extraction at this scale is a batch-economics problem as much as an accuracy problem; we run it as supervised batch jobs with sampled human-grade verification, not as a fire-and-forget script.
And then the deeper problem: IFRS and US-GAAP don’t just differ in format — accrual structures, fair-value treatment, and revenue recognition differ in substance. A forensic model naively transplanted across regimes will flag accounting differences as anomalies. Our scoring treats the two regimes as distinct calibration domains rather than pretending one mapping fits both. That adaptation is ongoing work, and we’d rather ship it right than ship it twice.
One schema, honestly versioned
Both pipelines land in the same place: one fundamentals cache, one scoring store, with the writer of each table strictly separated — raw fundamentals have one writer, computed scores another, and the two never share a pen. Every scoring formula is versioned in the database itself, so any score ever produced can be traced to the exact formula that produced it. Before any bulk write: a CREATE TABLE AS snapshot, a candidate-set verification query, and a row-count reconciliation after. Boring, mechanical, non-negotiable — because the product downstream of this schema is an accusation, and accusations demand receipts.
If wrangling two regulators’ worth of disclosure into mathematics-ready truth sounds like your kind of problem — we’re hiring in Silicon Valley.