Scaling the Impact of Genetics Insights
How Formation Bio built a human-in-the-loop pipeline to make genetic insights operational at scale

Most drug development programs fail; roughly ninety percent of drug candidates that enter clinical trials never reach patients. Around half fail for lack of efficacy, and another third fail for safety concerns [1]. These failures often emerge only after years of development and substantial capital investment.
In response, the pharmaceutical industry has increasingly prioritized genetic evidence in target selection, with most companies routinely considering human genetics in early-stage pipeline decisions. The logic is straightforward: some individuals carry genetic variants that alter the function of specific proteins. When these individuals show lower incidence of a disease, it suggests a drug mimicking that effect might work the same way. Published research supports this intuition, as drug mechanisms with genetic support have a two- to three-fold higher probability of clinical success in clinical development compared to those without [2] [3]. In an industry where the base rate of success is around ten percent, increasing those odds represents an enormous advantage.
The Challenge: Scaling and Operationalizing Genetic Insights
At Formation Bio, we integrate AI-powered genetic analysis into every asset evaluation. By doing so, we accelerate due diligence, uncover new indications, and identify drugs with the highest probability of clinical success. Our analysis addresses three key questions for a given asset. First, does human genetics support the proposed indication(s)? Second, are there safety signals, such as an increased risk for adverse events? Finally, are there missed opportunities, such as indications aligned with known biology that the originating company didn’t pursue? This last question reflects Formation's "Known in New” strategy of applying validated mechanisms to novel areas of high unmet need.
This analysis relies on large-scale biobanks that collectively hold genomic and deidentified health data on hundreds of thousands of individuals. From the start, we built our process around these resources, aiming to generate actionable insights for every asset we consider adding to our pipeline.
Until recently, our process for generating genetic insights was largely manual. A data scientist would download association tables from each biobank, filter for significant associations, and triage results to extract the most relevant signals. They would then use a large language model (LLM) to synthesize findings across datasets and assemble a slide deck for business development (BD) stakeholders. The process worked, but it took hours to days per asset depending on the complexity of its mechanism of action (MoA) and data volume. As Formation's deal flow increased, however, this turnaround time risked making genetic insights a rate-limiting step rather than a competitive advantage.
The Solution: A Human-in-the-Loop Genetics Pipeline
To address this, we built an automated pipeline capable of transforming basic asset details into a structured genetic insights report in minutes, with human-in-the-loop checkpoints built in where scientific judgment matters most. This way, software handles the volume, while human expertise shapes the output. When our BD team identifies a promising drug asset, we can rapidly determine whether human genetics supports its mechanism.
The pipeline consists of seven stages, each handling a specific step from the manual workflow. Some stages are deterministic, executing identically given the same input, while others use LLMs to handle ambiguity or synthesize findings into a coherent narrative. All LLM-generated outputs require human review before the pipeline proceeds, ensuring scientific judgment determines the final deliverable. In drug development, a misattributed target or inverted interaction direction could lead to flawed investment decisions or overlooked safety signals. LLMs can hallucinate citations, misinterpret domain terminology, or generate plausible but incorrect mechanistic claims. These failure modes demand human validation at this stage.
Example Walkthrough: The Genetics Pipeline in Action
To illustrate the pipeline in action, consider “Asset X”, an anti-IL-18 antibody in development for Crohn's disease. Since IL-18 is a potent pro-inflammatory cytokine, it’s reasonable to hypothesize that blocking its activity could be efficacious in a chronic inflammatory condition like Crohn’s, which affects nearly a million Americans.
To test this hypothesis, we investigate individuals carrying genetic variants that reduce IL-18 function, such as loss-of-function mutations, damaging missense variants, or expression-lowering variants. In effect, their genetics mimic what the drug does. We then assess whether they are:
Protected from Crohn's and related diseases
Protected from other diseases
At increased risk for any adverse conditions
Step 1: Extracting the MoA
First, the pipeline receives a string describing the asset’s MoA: "anti-IL-18 antibody." Using LLM-based parsing, we extract four items:
Target gene: IL-18
Modality: Antibody
Modulation direction: Negative (since the drug is an antagonist antibody)
Confidence score: 100 (indicating high conviction in the extraction)
Here, the input is well-structured, so the extraction presents no issues. But not every input arrives so cleanly formatted; some come from pitch decks or incomplete database entries. When the pipeline encounters ambiguity, it flags uncertainty for human review.
Step 2: Validating the MoA (Human-in-the-Loop)
Since Step 1 relies on LLM parsing, we introduce a human checkpoint to verify the extraction. In internal testing, only 43% of MoA extractions were fully correct, and the model struggled particularly with multi-target assets. A scientist reviews the output: target (IL-18), modality (antibody), and modulation direction (negative). They confirm the mapping is correct and that subsequent analysis should focus on variants that mimic the drug’s effect.
If the extraction were incorrect (for instance, confusing IL-18 with IL-18BP, the IL-18 binding protein), the scientist would catch it here. This checkpoint catches foundational errors before they affect downstream steps.
Step 3: Querying genetic databases
With the target gene confirmed, the pipeline focuses on individuals whose genetics mimic the drug’s effect (i.e., those carrying variants predicted to reduce IL-18 signaling). To identify relevant outcomes in these individuals, we scan across thousands of phenotypes for statistical associations with the target gene. The pipeline queries three complementary biobanks:
UK Biobank: ~500,000 participants, predominantly European, with deep phenotyping including imaging, biomarkers, and linked health records across thousands of traits.
FinnGen: ~500,000 Finnish participants linked to national health registries. Finland's genetic homogeneity and founder effects make low-frequency disease variants easier to detect than in more heterogeneous populations.
All of Us: ~300,000 participants, with roughly half from groups historically underrepresented in biomedical research. This diversity captures signals that may be missed in European-majority cohorts.
The initial output contains thousands of associations, most of which are statistical noise. We filter aggressively, requiring both statistical significance and meaningful effect size. What remains are the signals most likely to reflect real biology.
Step 4: Generating the narrative
The filtering yields associations that still require interpretation, but expert review of each association doesn’t scale. An LLM synthesizes the data into a structured narrative, grouping findings by therapeutic area and distinguishing between signals that suggest new opportunities, warrant caution, or validate the existing strategy.
Step 5: Reviewing the narrative (Human-in-the-Loop)
Since Step 4 relies on LLM synthesis, which can hallucinate claims or misgroup findings, we introduce a second human checkpoint. A scientist reviews the output, which includes asset context, extracted associations, and a full narrative with claims and therapeutic area groupings. Here's a condensed view of what that looks like for Asset X:
The reviewer confirms that:
The groupings are logical (e.g., “Renal failure” and “Chronic kidney disease” belong together)
Overlapping categories are merged rather than duplicated
Claims are grounded in the filtered data rather than hallucinated. If the narrative mischaracterized a signal (for instance, overstating a weak association or miscategorizing a phenotype), the scientist would flag it here.
This checkpoint ensures the final output reflects what the genetic evidence actually supports.
Step 6: Delivering the report
The approved narrative is packaged into two deliverables: a presentation-ready HTML report summarizing the genetic evidence, and an Excel file with the underlying data for verification.
The report for Asset X reveals two themes:
The first is encouraging: individuals carrying variants that mimic the drug’s effect show lower risk of several disease categories, including kidney and renal diseases (renal failure: β = -2.03; chronic kidney disease: β = -1.62) and degenerative joint conditions (arthrosis: β = -1.09 to -1.74). While no direct Crohn's disease signal was observed, the asset-specific analysis suggests IL-18 antagonism may increase liver disease risk (β = 2.41 to 3.85)—relevant given IBD-liver comorbidity.
The second is cautionary: the same variants are associated with increased risk for conditions including kidney stones (calculus of kidney: β = 3.54 to 3.70), liver disease (chronic hepatitis: β = 3.83 to 3.85), and certain malignancies. These associations flag potential adverse effects to monitor in clinical development.
Producing this output manually would take hours; the pipeline generates it in minutes. Human input totals roughly ten minutes across two checkpoints, preserving scientific judgment for where it matters most. A BD team member can now generate genetic insights for an entire quarterly pipeline in the time it once took to evaluate a handful of candidates.
Challenges and learnings
Building this pipeline surfaced three engineering challenges.
MoA extraction. The pipeline takes in a string describing the asset’s MoA. Some are precise (e.g., "anti-IL-18 antibody"), while others are less precise. From this heterogeneous input, the system must extract the target gene symbol, modulation type, and direction of effect. We used Claude Sonnet 4.5 with web search enabled for this extraction. It worked reliably for well-documented assets, but for assets still in preclinical testing or licensed from overseas partners, public information was sparse and performance suffered. In response, we introduced our first human checkpoint.
Phenotype harmonization. The same condition can appear under different names across datasets. For example, "Type 2 diabetes mellitus", "T2DM", "non-insulin-dependent diabetes", and ICD code E11 (and related subcodes) all refer to overlapping patient populations. We built mapping tables and deduplication logic to normalize phenotypes into consistent categories so signals can be compared across biobanks.
Direction of effect. We collapse MoA modulation terms (inhibitor, antagonist, agonist, activator) into two categories: positive or negative. This matters because a genetic signal can have different implications depending on the drug’s MoA. An association with a negative effect size suggests an opportunity for an inhibitor or antagonist but a safety concern for an activator. The pipeline combines effect direction from genetic data with modulation direction from the MoA to interpret each signal correctly.
Capturing expert feedback for model training
These human checkpoints serve a second purpose: generating training data. When a scientist reviews target extraction, they see the model's output alongside the inputs it relied on: the original MoA description, supporting context from web search, a confidence score, and a link to the asset’s CRM entry from BD. For narrative review, they see the generated summary next to the underlying data, including specific phenotypes, effect sizes, and source biobanks. They can approve with a click or correct with a few keystrokes.
We capture both the original output and human feedback, creating paired examples of what the model produced versus the expert’s correction. Today, these inform our prompt refinement. Over time, we plan to use them for fine-tuning the extraction model and optimizing narrative generation via reinforcement learning.
ARK Integration with Snowflake
Narratives, human feedback, and intermediate data are all written to Snowflake and become queryable through ARK [4], Formation’s conversational interface to internal knowledge. A BD team member can ask, "What are the genetic insights for Asset X?" and receive a synthesized answer drawing on relevant prior analyses.
By storing this data in Snowflake, we make it available for downstream exploration beyond the final report. A user might filter phenotypes by therapeutic area and ask follow-up questions using ARK, or pull genetic insights from a target analyzed months ago for a new evaluation. The pipeline runs once, but the data remains available for whatever questions come up next.
What’s Next: Toward Fully Self-Improving Pipelines
Our pipeline delivers two kinds of value: speed and improved decision-making.
Speed is the most visible gain. What once took hours to days now happens in minutes of compute time plus roughly ten minutes of human review. With this pipeline in place, the BD team triggers runs directly from our internal systems.
The deeper gain is consistency, which improves decision-making. Every asset, no matter how complex, passes through identical filtering logic, narrative structure, and quality checks. Multi-target assets like bispecific and trispecific antibodies are processed automatically, with separate profiles generated for each target. This improves our odds of selecting the drugs with the highest probability of clinical success. Manual analysis inevitably varied across scientists, with different thresholds, phenotype groupings, and interpretive frameworks. The pipeline eliminates that variability. When a requestor sees a safety signal flagged, they know exactly what statistical criteria produced it.
For now, we've built something that works fast enough to keep pace with our asset deal flow, is accurate enough to trust, and is transparent enough to audit. The current pipeline is a foundation, not the final form.
Our long-term vision extends beyond operational efficiency to deeper genetic insight. Our current implementation considers only binary outcomes, such as whether someone was diagnosed with a disease. The biobanks we query also capture continuous measurements, including protein and metabolite levels. Incorporating these features could clarify how variants affect protein function and provide mechanistic insight into how a genetic association translates to disease progression.
We currently rely on each biobank's existing variant consequence annotations to determine which variants are loss-of-function versus functionally neutral. Emerging genetics foundation models may let us interpret variant consequences more precisely.
On the operational side, as human feedback accumulates, we can fine-tune the underlying models. This moves us toward a pipeline where human review becomes unnecessary in most cases, because the models will have learned to replicate expert judgment.
The goal we started with remains: to improve the probability that drugs succeed in the clinic, bringing better medicines to patients faster and more efficiently. In human genetics data, we have one of the clearest predictive signals available, and we’re building the infrastructure to use it at scale.






