All Resources

Resource

The Week My Model Was Convinced Ransomware Was Good for Companies

A ransomware attack destroys a company's IT systems. The model predicts revenue will go up. A note on what went wrong, the duct tape that fixed it, and the rebuild that has now replaced most of the duct tape.

·7 min read·
mlgeopoliticsmodel-failuresai-assisted

For about a week, my impact estimator was confident that ransomware attacks were good for companies. Boycotts were also good. Sanctions were good. Export controls were good. Across most clearly negative geopolitical events, the model's default answer was that the affected company would be slightly better off.

The model was not broken. It was working as designed. The design was the problem.

This is a writeup of how I noticed, what I diagnosed, what I shipped about it, and what the actual rebuild looks like. It's a piece about a small failure that pointed at a bigger architectural problem, which I think is the more interesting thing about it.

(Originally written April 2026. Updated in May 2026 to reflect the rebuild that has since been implemented.)

The bug

The impact estimator is one of four models in a geopolitical risk pipeline I'm building. Its job is to take an event description and a company and put a percentage range on the company's revenue move. The other three models classify the event, predict which business channel will be hit, and recommend strategic responses.

I was running case-by-case validation against a fixed list of historically known events. NotPetya destroying FedEx's IT systems was on the list. The expected answer was a clear negative. The actual TNT Express loss was around $400M.

The model said +$750M.

I assumed it was a one-off. It wasn't. The 2021 Lululemon Xinjiang boycott came back positive. The 2022 sanctions on Russian oligarchs, applied to firms with substantial Russian exposure, came back positive. Chip export controls on NVIDIA in 2022, with a known $5B revenue hit, came back small but positive. The pattern was clean: across maybe seventy percent of the events I tested, the model's default direction on a clearly negative event was wrong.

This was not the kind of failure I had expected. I had expected the model to be uncertain or to have wide ranges or to be off by a magnitude. I had not expected it to be confidently wrong about the sign.

The diagnosis

The diagnosis took a day to figure out and is a textbook regression-to-mean problem, except that the regression-to-mean is happening on the wrong axis.

Scatter plot showing training data clustered near zero with slight positive bias, and a flat prediction line passing through positive territory instead of matching the diagonalActual revenue impact−10%0+10%Predicted+10%0−10%Perfect predictionOriginal model's predictionsFedEx / NotPetyaActual: −5%. Model said: +1%.Illustrative. The training data clusters near zero with a slight positive average. The model learned to predict near zero with a slight positive bias, which is why clearly negative events came back positive.

The training data was mostly small, near-zero stock and revenue reactions. Most S&P 500 companies don't react to most geopolitical events because most events don't hit most companies directly. Of the companies that do react, the average is slightly positive. Some firms benefit from supply disruptions, sanctions on competitors, and so on.

So the training data had a strong central peak near zero, a slight positive lean in its average, and long thin tails on both sides. A model trained on this data with a standard regression loss learned the optimal hedge: when uncertain, predict close to zero, with a small positive bias. Most of the time that bet was right. The model was rewarded.

The problem was what happened at the tails. When a clearly negative event came in (a ransomware attack, a sanction, a boycott) the model had no architectural reason to break out of its hedge. It had learned that the safe answer was "small positive," and it gave that answer regardless of how negative the input language was.

There's a deeper version of this problem, which I started thinking about more once I noticed the bug. The impact estimator was trying to predict two different things at once: the short-term stock reaction (typically measured over a five-day window around the event) and the longer-term fundamental revenue impact (measured over a quarter or a year). These are fundamentally different quantities. A stock can drop ten percent on news of a sanction even if revenue ends up barely affected, because the market is pricing in tail risk. Conversely, revenue can drop more than the stock if the market underreacts. Mixing these as training targets means the model was learning a noisy average of two different dynamics. The regression-to-mean bias was partly a consequence of that.

That insight is what drove the rebuild.

The duct tape

What I shipped first, in the immediate aftermath of finding the bug, was a rule-based sign override. The rule is exactly as crude as it sounds: if the event description contains clearly negative keywords (sanction, boycott, ransomware, impairment, attack, write-down, and so on, about forty in the current list), and the model predicts a positive number, flip the sign.

This patch is not principled. It is what you do when you have an embarrassing failure mode and you need it to stop while you build the real fix. A blanket sign-flip on negative keywords will overcorrect in real cases. A US sanction on Russia, applied to a US software company with no Russia exposure, will trigger the rule and flip the prediction to negative even though the company isn't affected. The rule trades a visible failure (the model says ransomware is good for FedEx) for a less visible one (the model occasionally says a US-only company is hurt by a foreign sanction it has no exposure to).

I think this is the right trade for now. Visible failures destroy trust faster than quiet ones, and the system has to be presentable while the underlying architecture is rebuilt. But the rule was never the fix. It was a guardrail.

The rebuild

The real rebuild does not look like a sign-correction rule. It comes from recognizing that the impact estimator is not one prediction problem but two, and that mixing them was a root cause of the bias. The plan splits the model into two pieces, with different methods appropriate to each.

Architecture comparison: original impact estimator with one mixed model, intermediate state with the duct tape rule, and the rebuilt state splitting into two specialized modelsORIGINALImpact EstimatorXGBoost, mixed targets% range⚠ Sign bias+ DUCT TAPEImpact EstimatorSame as beforeSign override ruleif neg keywords → flip% range, sign-corrected⚠ Overcorrects on edge casesREBUILTModel 3AStock reaction(Event study)Model 3BRevenue impact(Causal inference)Conformal Prediction intervals93.8% guaranteed coverageTwo predictions, calibratedRule remains as guardrailCurrent production state: the rebuilt models run with the rule still active as a final sign guardrail.The rule comes out once Hierarchical Bayesian estimation is added for the small-sample case.

Model 3A predicts the short-term stock reaction. Target: the five-day cumulative abnormal return, the standard finance measure of how much a stock moved relative to what it would have done given normal market conditions. Method: an event study, which has been the accepted approach in published finance research for decades. This is well-trodden ground. Trained on 1,775 historical event-firm pairs. Mean absolute error: 1.59 percentage points. Direction accuracy: 70%.

Model 3B predicts the longer-term fundamental revenue impact. Target: the quarter-over-quarter or year-over-year change in revenue attributable to the event. Method: causal inference, with stock reaction (3A's output) as one input feature but never as the training target. The market's reaction tells you something useful about what's coming, but it isn't what you're predicting. The training set for 3B is currently small (eighteen labeled examples after the temporal split) and the prediction intervals are correspondingly wide. Scaling that training set is the next phase.

Both models now use Conformal Prediction for the uncertainty intervals. This is a method that gives you mathematically guaranteed coverage on your prediction ranges, regardless of how well-calibrated the underlying model is. The original impact estimator claimed eighty percent coverage and actually delivered forty-three percent. The rebuilt version delivers ninety-three point eight percent on the temporal validation set.

What's still ahead, deferred to the next phase, is Hierarchical Bayesian estimation for the small-sample 3B problem. Hierarchical models are the right tool when you have very few labeled examples and need to pool strength across similar events, similar firms, and similar sectors without forcing them all to share the same parameters. Frequentist methods overfit aggressively at this sample size. The other deferred piece is Difference-in-Differences and Synthetic Control for the causal inference layer of 3B. These are the standard methods in applied economics for measuring the causal effect of a shock when you can't run experiments, and they are the methods that would let this work be defended in academic conversations.

None of the methods named here are invented for this project. Every one of them is the standard tool in its sub-field. The work is in combining them into a single deployable pipeline.

The duct tape rule still runs in production as a final guardrail on top of the rebuilt models. It comes out when 3B has enough data to stand on its own with Hierarchical Bayesian intervals. Until then, redundancy is fine.

What this taught me

The most useful thing I've taken from this isn't about ransomware or about sign correction. It's about how to think about model failures.

The model gave a confidently wrong answer. The first instinct was to improve the loss function, balance the training set, calibrate the output. Treat the failure as a tuning problem. But sitting with the diagnosis longer made it clear that the failure wasn't really about loss functions or sample weights. It was about asking one model to predict two different things, short-term market reactions and long-term revenue impacts, that have different statistical structures and different time horizons. The bias was a symptom of an architectural mistake.

I think this generalizes. When a model fails in obvious-to-humans ways, the failure usually isn't a tuning problem. It's a sign that the model is being asked to do the wrong job. The right fix is rarely a better version of the same model. It's a rethink of what the prediction problem actually is.

That's what the rebuild has been doing. The duct tape was just buying time for it.


If you've worked on impact estimators that have to handle both short-horizon market reactions and longer-horizon fundamental effects, I'd genuinely like to hear how you separated them. Especially what didn't work.