Two regulators, one schema: building fundamentals pipelines for the U.S. and Hong Kong

[ /engineering ] · 2026-06-12 · 3 min read

bearsignal.ai — /engineering

Forensic mathematics has a boring dependency: it only works on clean, comparable, point-in-time fundamentals. Benford tests need raw reported digits. Accrual models need balance sheets that reconcile across periods. Every model in our engine assumes the inputs are real, aligned, and honest about what they don’t know. Getting to that state — across two regulatory regimes — is most of the actual work.

The U.S. side: structured, vast, and quietly treacherous

The SEC gives you XBRL: machine-readable, standardized, going back years, across roughly 7,900 active filers and millions of filings. It is a gift. It is also a minefield of dialects — companies extend the taxonomy with custom tags, restate prior periods without ceremony, and report the “same” concept under different elements across years.

Our pipeline treats EDGAR as an append-only source of record and normalizes into a single fundamentals cache keyed by company and period. Two rules carry most of the weight. First: bulk ingestion goes through PostgreSQL’s COPY, never row-by-row inserts — at millions of filings, that’s the difference between an afternoon and a lost week, roughly a 30x throughput gap in our environment. Second: when a value can’t be resolved with confidence, it lands as NULL. Not zero, not a forward-fill, not an industry average. NULL is information; a fabricated value is contamination that no downstream model can detect.

The Hong Kong side: where the pipeline grows eyes

HKEX has no XBRL equivalent for most of what we need. Disclosure lives in PDF annual reports — thousands of them, in two languages, with financial statements formatted at each issuer’s discretion under IFRS rather than US-GAAP.

So the Hong Kong pipeline is a different animal: filing-metadata harvesting at exchange scale, then LLM-driven extraction from the PDFs themselves — pulling structured statements out of documents that were formatted for human eyes. Every extracted figure carries provenance back to its source document. Extraction at this scale is a batch-economics problem as much as an accuracy problem; we run it as supervised batch jobs with sampled human-grade verification, not as a fire-and-forget script.

And then the deeper problem: IFRS and US-GAAP don’t just differ in format — accrual structures, fair-value treatment, and revenue recognition differ in substance. A forensic model naively transplanted across regimes will flag accounting differences as anomalies. Our scoring treats the two regimes as distinct calibration domains rather than pretending one mapping fits both. That adaptation is ongoing work, and we’d rather ship it right than ship it twice.

One schema, honestly versioned

Both pipelines land in the same place: one fundamentals cache, one scoring store, with the writer of each table strictly separated — raw fundamentals have one writer, computed scores another, and the two never share a pen. Every scoring formula is versioned in the database itself, so any score ever produced can be traced to the exact formula that produced it. Before any bulk write: a CREATE TABLE AS snapshot, a candidate-set verification query, and a row-count reconciliation after. Boring, mechanical, non-negotiable — because the product downstream of this schema is an accusation, and accusations demand receipts.

If wrangling two regulators’ worth of disclosure into mathematics-ready truth sounds like your kind of problem — we’re hiring in Silicon Valley.

Next →NULL is a feature: the most boring rule we refuse to break

NULL is a feature: the most boring rule we refuse to break

[ /engineering ] · 2026-06-12 · 2 min read

bearsignal.ai — /engineering

Every data team has a rule like this somewhere in a style guide, half-enforced: don’t fabricate missing values. Ours is absolute, and it has a name we use in code review, in schema comments, in the documents that govern how we work: None→NULL. If a value cannot be resolved with confidence, it is stored as NULL. Not zero. Not the previous period. Not the sector median. Not an imputation, however clever. NULL.

The argument for imputation is always reasonable. Coverage looks better. Models run on more rows. Dashboards have fewer holes. And in most ML applications, a sector-median fill is genuinely fine — the loss function averages over the error and life goes on.

We are not most applications. Our models exist to detect anomalies in financial statements. An imputed value is, by construction, maximally normal — it is literally the expected value. Backfill a missing inventory number with the sector median and you haven’t just added noise; you’ve injected a synthetic data point that is perfectly calibrated to look innocent, into a system whose entire job is to find things that don’t. Imputation doesn’t degrade a fraud detector. It quietly blinds it, one plausible value at a time.

The rule costs us. Real coverage numbers are lower than they’d look with imputation, and we report them anyway. Some detectors run on a fraction of the universe because their inputs are genuinely sparse — small-caps with thin market history simply get NULL where richer tickers get a number, and the gap is documented rather than papered over. When a pipeline upstream fails silently, our tables show holes instead of hallucinations, and the holes are what page us.

And that’s the actual payoff: NULL is information. A missing value tells you something — about disclosure quality, about a filer’s history, about your own pipeline — that a fabricated value actively destroys. Sometimes the pattern of what a company doesn’t report is the most interesting feature of all.

The discipline extends past data. The same documents that mandate None→NULL mandate its sibling for analysis: no fabricated conclusions. A model that can’t support a claim outputs nothing, and “we don’t know yet” is a complete sentence in our review meetings. In a business where the product is an accusation, the willingness to say nothing is what makes saying something mean anything.

← PreviousTwo regulators, one schema: building fundamentals pipelines for the U.S. and Hong Kong

Two regulators, one schema: building fundamentals pipelines for the U.S. and Hong Kong

The U.S. side: structured, vast, and quietly treacherous #

The Hong Kong side: where the pipeline grows eyes #

One schema, honestly versioned #

NULL is a feature: the most boring rule we refuse to break

The U.S. side: structured, vast, and quietly treacherous

The Hong Kong side: where the pipeline grows eyes

One schema, honestly versioned