Introducing Delphi: Formation Bio's Clinical Trial Prediction Model
How we built Formation Bio’s clinical trial prediction model

At Formation Bio, we acquire and develop clinical or IND-stage drug assets. We evaluate candidates across multiple factors, and among the most important factors is the probability that a drug will meet its primary endpoint in a clinical trial.
Clinical trial outcomes are one of the strongest signals for whether a drug will ultimately reach patients, but trials themselves are failure-prone – only 10% of drugs that enter clinical development ever reach approval.[1] As AI accelerates drug discovery, the number of candidates competing for limited clinical development resources has continued to increase and will only grow further.
The reasons drugs fail in the clinic are diverse. Early trials may uncover safety concerns, mid-stage trials may struggle with dose selection, and late-stage trials may fail to show efficacy. Outcomes depend on modality, indication, trial design, and operations. Beyond the trial itself, an asset’s trajectory is also shaped by external factors like the regulatory and competitive landscape, which shift over time. The consequences of a late-stage failure are severe – the average cost to develop a drug rose to $2.23 billion in 2024, with the industry’s leading pharma companies spending $7.7 billion on trials that were ultimately terminated.
The fundamental challenge is that amid this complexity and risk, how do we allocate capital and resources to the assets that are most likely to succeed? Better forecasting ultimately translates directly to better capital allocation, and every low-probability asset identified early is an expensive failure avoided, redirecting time and resources toward candidates more likely to reach patients.
Advancing the right drugs through development requires both the right data and models that can reason across target biology, drug characteristics such as efficacy or safety, and trial-level factors like study design or operations. We built Delphi to bring those together.
What We Built
Delphi predicts the probability a clinical asset will meet its primary endpoint, supported by transparent, evidence-backed rationale. It scores both external trials already underway and hypothetical scenarios for internal assets – varying indication selection, study design, and operational strategy to find the strongest path forward. The idea is to not only learn from the drugs we acquire and develop, but from all readouts across the industry.
Delphi uses a multi-agent architecture, with each agent acting as a domain expert gathering evidence across target biology, drug profile, and trial design. Agents also draw on clinical precedent to contextualize their findings. An orchestrator synthesizes agent evidence into scored factors along two axes. On safety, Delphi scores clinical safety data (adverse event profiles and tolerability from the asset and its class) and risk mitigation adequacy (whether the trial's monitoring and design controls the known liabilities). On efficacy, it scores mechanism-of-action plausibility, clinical efficacy data, and how the asset stacks against precedent in the same indication and target space. Each factor carries a point estimate and a confidence interval, and the active factors and their weights shift with development stage: earlier-stage assets lean on mechanism and target-level evidence, later-stage assets on clinical data and competitive benchmarks.
Implementation
Domain-scoped agents gather evidence in parallel for five key areas: target characterization, target validation, asset profiling, clinical precedent, and human genetics. Genetic evidence draws on large-scale databases including UK Biobank, FinnGen, and All of Us. As agents gather evidence, it is stored in append-only tables, giving us a full audit trail of what the model saw and when.
Scoring uses a modular prompt composition system. Independently versioned prompt modules are assembled at runtime, covering phase-specific logic, modality-specific profiles, and indication context such as pediatric safety considerations or rare disease evidence constraints. This modularity lets domain experts update one dimension of scoring logic without modifying the rest, and makes every prediction traceable to the exact prompt versions that produced it.
The compiled prompt is sent to an agentic orchestrator, and the response is validated into a structured schema. We provide an overall probability score with confidence bounds. In addition, we surface the entire reasoning chain including providing the user with safety and efficacy rationales, and key risks, strengths, and critical unknowns. By surfacing critical unknowns alongside the score, Delphi points directly to what additional evidence, such as preclinical studies, biomarker work, or design adjustments, could de-risk an asset before we commit. This is particularly valuable for assets we are evaluating for acquisition, because it tells us whether to pursue an asset as well as what work remains to de-risk it.
We have optimized for temporal integrity. One of the main criticisms of existing models in the field is "illusory generalizability" - when the model has seen a trial's outcome, it can memorize this result and inflate apparent performance.[2] In Delphi, predictions are stored in immutable, time-stamped records prior to a clinical trial releasing their results - once written, they cannot be updated. As additional information comes in like new experiments or similar trial readouts, we generate new timestamped versions of our predictions and add these to our database, enabling us to assess how our scores change over time. When we report performance on resolved trials, we're measuring true forecasting accuracy, not memorization.
Below: Example interactive asset card output from Delphi showing prediction of Abdakibart, a humanized anti-interleukin-1β monoclonal antibody developed by Avalo Therapeutics. As demonstrated in the card, Delphi (correctly) predicted that Abdakibart’s phase 2 LOTUS trial would meet its primary endpoint in hidradenitis suppurativa, a result which Avalo confirmed in early May. The factor score breakdown highlights the mixed historical context, clean safety profile heading into the trial, and plausible mechanism of action.
Learnings & Future State
AI agents scale our diligence capacity. In Delphi, we operationalize labor-intensive parts of diligence, like researching target biology and clinical trial precedent, through our agentic system. This lets us evaluate more assets in greater depth, surface perspectives we might otherwise miss, and reach decisions faster with our counterparties.
Data differentiates from consensus. As Delphi evolves, we will increasingly draw on proprietary and causal-biology views for target-disease linkage, plus real-world data that ground predictions in evidence outside any public training set. Where a frontier model reasons from consensus, Delphi reasons from data. The divergence between the two is often the signal worth acting on.
Scoped agent models improve auditability. Because each agent operates within a defined domain, we can trace exactly what evidence informed each part of the score: which databases were queried, what literature was retrieved, and how that evidence translated into safety or efficacy rationale. This is significantly harder to achieve with monolithic general-purpose models where the reasoning behind a given claim is opaque.
Validation reveals where the reasoning matters most. Delphi allows us to connect reasoning traces to outputs. In one case, Delphi correctly assigned above-base-rate probability to a first-in-class bispecific, identifying strong target biology while flagging the unproven synergy of dual-target blockade. The trial succeeded with a clean safety profile, confirming the model's reasoning across both dimensions. In another, the model surfaced evidence supporting a positive outcome but underweighted it against consensus skepticism. This pattern pointed us toward structured contrarian analysis as a key addition: reasoning not just about the most likely outcome, but about what would need to be true for the minority scenario to occur.
We are now building this into Delphi’s output, alongside expanding the model to incorporate additional factors like competitive landscape and improving its ability to dynamically update as new readouts and institutional knowledge come in.
Conclusion
Delphi expands the capacity of our diligence teams today while laying the foundation for AI-driven portfolios over time. By combining domain-scoped agents with strict temporal integrity and modular scoring, the system addresses two of the field's most persistent challenges: the opacity of predictive models and the risk of inflated performance from data leakage. While no model can eliminate the inherent uncertainty of drug development, Delphi gives decision-makers a structured, auditable foundation for evaluating clinical assets – surfacing a probability along with the evidence and reasoning behind it. We continue to refine the model with our goal of helping the company allocate resources more effectively so that the most promising therapies reach patients faster.
- Sun, D., Gao, W., Hu, H., & Zhou, S. (2022). Why 90% of clinical drug development fails and how to improve it? Acta Pharmaceutica Sinica B, 12(7), 3049–3062. ↩
- Chekroud, A. M., Hawrilenko, M., Loho, H., Bondar, J., Gueorguieva, R., Hasan, A., Kambeitz, J., Corlett, P. R., Koutsouleris, N., Krumholz, H. M., Krystal, J. H., & Paulus, M. (2024). Illusory generalizability of clinical prediction models. Science, 383(6679), 164–167. ↩










