Validation methodology: N-1 temporal holdout
The validation uses N-1 temporal holdout — a methodology designed to test whether a scoring system generates useful predictions from only the data available at the pre-hire stage.
How it works:
- 01
Withhold the most recent completed role
For each candidate in the validation cohort, the most recent completed role — with a known start date, end date, and tenure length — is removed from the career file. This becomes the ground truth.
- 02
Score on prior history only
Stability Engine runs on the remaining career history — what was visible before that last role started. This mirrors the actual pre-hire information state: the system scores only what would have been available at the moment of the offer.
- 03
Predict 12-month retention
The system generates a Stability Score and a 12-month retention probability for each candidate, based solely on prior career history.
- 04
Compare prediction to ground truth
The predicted 12-month retention outcome is compared to the actual tenure of the withheld role. A candidate who scored in the higher risk bands and departed within 12 months is a correct prediction. A candidate who scored in the lower risk bands and remained past 12 months is also a correct prediction.
Validation results
| Metric | Result | What it measures |
|---|---|---|
| 12-month accuracy | 84.3% | Correct binary classification at the 12-month retention threshold |
| Score accuracy | 72.5% | Stability Scores within ±15 points of the reference label |
| Mean Cox Brier score | 0.169 | Probabilistic calibration quality (0 = perfect, lower is better) |
| Cohort size | n=51 | Total candidates in the holdout validation cohort |
| Methodology | N-1 temporal holdout | Prior career history only; most recent role withheld as ground truth |
What the Stability Score measures
The Stability Score analyzes structural career history signals — not interview performance, personality assessments, or self-reported preferences. The relevant signals include:
- —Prior tenure patterns: how long the candidate stayed across completed roles, and the distribution of that tenure
- —Transition density: how quickly the candidate has moved between roles and environments
- —History alignment: whether the prior career pattern matches the stability demands of the role being assessed
- —Environmental fit signals: whether prior operating environments resemble the current one
The score is a directional signal, not a verdict. It does not tell a hiring team to hire or not hire a candidate. It provides a structured basis for calibrating onboarding investment, monitoring cadence, and early intervention — not for replacing the human judgment that belongs in any serious hiring process.
Honest limits of the validation
The validation establishes predictive signal, not certainty. Several important limitations:
- •The holdout cohort is n=51. This is a meaningful validation data point but not a large-scale epidemiological study. Additional validation is ongoing as outcome data accumulates.
- •The score captures career history pattern risk — not environmental factors, management quality, or post-hire conditions that also affect retention.
- •A high Stability Score does not guarantee retention. A lower score does not mean a hire will fail. Scores are probability distributions, not individual predictions.
- •The N-1 methodology tests the system against prior career history only. It does not test prediction performance in real-time, concurrent hiring conditions.
Frequently asked questions
What is the N-1 temporal holdout methodology?
The most recent completed role in a candidate's career history is withheld as the ground truth. The scoring system runs on prior career history only — what was visible before that last role started. The prediction is then compared to what actually happened in the withheld role.
What does 84.3% accuracy at 12 months mean?
Stability Engine correctly identified early-departure risk at the 12-month threshold for 84.3% of candidates in the validation cohort — both high-risk candidates who did depart, and lower-risk candidates who remained past 12 months.
What is a Brier score and what does 0.169 indicate?
The Brier score measures probabilistic prediction accuracy on a 0-to-1 scale. 0 is perfect; higher is worse. A score of 0.169 indicates well-calibrated probabilistic forecasts — the stated probabilities of early departure track closely with observed departure rates.
Where can I read the full validation study?
The Ros Holdout Validation Study 2026 is available for download at stabilityengine.ai/audit. It covers methodology, cohort composition, result tables, failure mode transparency, and interpretation guidance.