The fraud literature works. Almost nobody runs it.

[ /methodology ] · 2026-06-12 · 2 min read

bearsignal.ai — /methodology

Ask a quant about alpha factors and you’ll get a story about decay: published anomalies stop working once the paper is out, arbitraged away by everyone who read it. It’s a good story, and for return-predictive signals it’s largely true.

Forensic signals are different, and the difference is structural. A momentum factor stops working when traders crowd it. A digit-distribution anomaly doesn’t stop appearing in cooked books just because accountants read Benford. Return anomalies decay because trading on them moves prices. Forensic anomalies persist because the behavior that creates them — the mechanics of making fake numbers look real — doesn’t change when detection methods are published. Fraud has invariants. Invented numbers must still articulate: the balance sheet must balance, accruals must reconcile to cash eventually, growth must come from somewhere. Manipulation strains those joints in mathematically recurring ways.

What five decades actually produced

The literature is deeper than outsiders assume. A few landmarks we build on, and what each one measures:

Digit-level forensics. Benford’s Law describes the expected distribution of leading digits in naturally occurring financial data. Humans inventing numbers produce detectably different distributions — a result replicated across decades, jurisdictions, and accounting regimes.

Accrual decomposition. The Jones model and its modified successors separate accruals into what business conditions explain and what they don’t. Discretionary accruals — the unexplained residue — are where earnings management lives. Later work on accrual quality asks a harsher question: how well do your accruals map to cash flows you eventually report?

Composite misstatement scores. Beneish’s M-Score and Dechow’s F-Score aggregate financial-statement features into probabilities of manipulation, trained on enforcement cases — the rare luxury of labeled fraud data.

Distress structure. Altman, Ohlson, Zmijewski, structural distance-to-default in the Merton tradition: not fraud models, but essential context. Distress is fraud’s weather system — pressure on management is where misreporting incentives concentrate, and a fraud signal means something different in a healthy company than in a drowning one.

So why doesn’t everyone run this?

Because the gap was never knowledge. It’s three other things.

Engineering. These models assume clean, point-in-time, cross-period-consistent fundamentals at full-market scale — which, as we’ve written elsewhere, is most of the work.

Calibration. A paper reports that a score discriminated fraud from non-fraud in a sample ending years ago. Running it today, on your universe, demands knowing base rates, sector structure, and what a given score is worth as evidence now — against outcomes, continuously, or the numbers are decoration.

Institutional incentive. Academia validates and moves on. Funds that build this keep it private by definition. The sell side is structurally uninterested in finding fraud. The literature sits in journals, correct and unread by machines.

That gap — between what’s published and what’s operational — is the company.

Next →The engine produces clues, never verdicts.

The engine produces clues, never verdicts.

[ /methodology ] · 2026-06-12 · 2 min read

bearsignal.ai — /methodology

Early in building BearSignal, we made a mistake that took weeks to undo. It felt natural: if three detection models flag the same company, surely the system should mark it as suspicious. So we built it — a voting ensemble, two-out-of-three, a tidy boolean at the end of the pipeline.

It was wrong. Not wrong in implementation — wrong in kind.

Here’s the problem. A Benford deviation is a fact about digit distributions. An abnormal accrual is a fact about the gap between earnings and cash. A distance-to-default is a fact about leverage and volatility. Each is a measurement — continuous, contextual, meaningful only relative to a base rate.

A verdict is none of these things. A verdict is a decision, and decisions require something measurements don’t have: knowledge of consequences. What’s the cost of a false flag? What’s the prior probability of fraud in this sector, this exchange, this year? How much should a flag from one detector discount a silence from another? None of that lives in the mathematics of any single detector. Hard-coding a verdict into the engine means answering those questions implicitly, invisibly, and permanently — with a threshold somebody picked on a Tuesday.

So we tore the verdict out. The rebuilt architecture has a name we use internally: the engine is a feature factory. Every detector produces evidence — a continuous value, its inputs, its context — and nothing else. No thresholds. No votes. No booleans. If a detector can’t produce a number honestly, it produces NULL, never a guess. The discipline sounds trivial; maintaining it under deadline pressure is not.

Judgment lives in a separate layer, downstream, with separate update cadence and separate accountability. That layer sees all the evidence at once, knows the base rates, and — critically — learns from outcomes. When judgment improves, the engine doesn’t change. When we add a detector, the judgment layer decides what its evidence is worth. The two evolve independently because they are epistemically different things: one measures the world, the other decides what to do about it.

The payoff shows up everywhere. Detectors are testable in isolation — a measurement either matches the literature’s definition or it doesn’t, no judgment required. Adding the eleventh model family didn’t require retuning the first ten. And when a detector turns out to carry a different sign than the literature suggested — it happens, more than once — the fix is a weight in one layer, not an excavation through the whole pipeline.

If you’ve built ML systems, you’ll recognize the shape: features and models, separated. The difference is that in most ML systems the separation is a convenience. In forensic finance, where a verdict can move capital and a false one can move it wrongly, the separation is the integrity of the product. The factory must never think it’s a judge.

← PreviousThe fraud literature works. Almost nobody runs it.

The fraud literature works. Almost nobody runs it.

What five decades actually produced #

So why doesn’t everyone run this? #

The engine produces clues, never verdicts.

What five decades actually produced

So why doesn’t everyone run this?