Building Durable AI Workflows for Indication Landscaping

How Formation Bio is designing a resilient research engine using durable execution

by Iyob Gebremariam

Senior Software Engineer

11 min. read•Feb. 23, 2026

Building Durable AI Workflows for Indication Landscaping featured image

To evaluate potential acquisitions to our portfolio, Formation Bio built an intelligence platform that compares drug assets in development against current and future standards of care. An in-depth explanation of the motivations behind this work can be found in Part I of our blog series on indication landscaping. In this post, Part II, we cover the technical choices we made in designing, developing, and launching this platform.

Pieces of the architecture

As outlined in Part I, building the platform involves two pieces of work:

Data gathering, which involves assembling the raw data, including a list of relevant drugs, drug labels, clinical trials with patient populations, baseline characteristics, and eventual published outcomes.
Analytics and data visualization, where the raw data is normalized, standardized, and visualized for broader consumption and quick insights.

Here, we focus specifically on how we built the first piece: the LLM-guided data gathering architecture for our tool.

While research processes vary by indication and individual workflows tend to be highly iterative, these are the general steps our analysts take to gather data during indication landscaping:

When evaluating a marketed drug, analysts typically start with a drug product’s website for healthcare professionals, then move to the FDA product label, which often contains safety and efficacy information drawn from pivotal trials. If the drug label or product website lacks granular outcomes or safety data, analysts proceed to search through literature, then through clinical trial registries, all while cataloging which data appear where and where gaps remain.

Some data sources carry more weight than others – for example, peer-reviewed literature is generally favored over clinical trial registry entries (such as https://clinicaltrials.gov/) because it is more current, contextualized, and vetted. Data gathering must be indication-specific, so that trial outcomes can be interpreted within the correct clinical frame.

LLM-based deep research has significantly accelerated data gathering for our internal teams, but we observed persistent shortcomings:

Even with precise instructions, LLM deep-research tools often retrieve an overly broad set of sources while searching for empirical data, such as loosely related studies, adjacent indications, or secondary analyses that do not directly answer the empirical question. They also overlook subtle but important cues including pooled or redacted analyses in place of pivotal trials, or conflate similar-sounding but distinct indications or drugs.
For a structural, hierarchical workflow as outlined above, deep research may ignore steps and hierarchies outlined in the prompt.
In the initial stages of research, long-running deep research tools effectively refine goals through a series of clarifying questions, but ultimately fail to involve analysts at key decision points. To iterate on the output, analysts may tweak the prompt and start over, but this results in loss of any potential efficiency gains because of the slow, tedious, and expensive nature of prompt engineering.
Even accurate deep research outputs still require substantial manual parsing and organization to connect with the analyst’s broader understanding and contextual knowledge.

Conceptualizing the workflow

To address these gaps, we needed an automated data ingestion workflow that integrated LLMs and humans throughout. There are several mental models for conceptualizing data pipeline workflows, each with distinct tradeoffs:

DAG (directed acyclic graph) ETL workflows model data pipelines as a one-directional sequence of steps, but do not support workflows that are inherently iterative and cyclical, and offer no natural way of incorporating human feedback. The graph-based approach extends DAG pipelines by accommodating cycles and iteration, exemplified by frameworks like PydanticAI's Pydantic Graph. Graphs are versatile and can model most real-world workflows and visualize them, but that benefit quickly diminishes as workflows get more complex.

Alternatively, the workflow-as-code model avoids treating workflows as static nodes and edges altogether. Instead of defining pipelines through configuration files or visual diagrams, developers express workflow logic directly in code. In this approach, complexity is assumed and maintained via code organization. However, this flexibility comes at the cost of visualization provided by DAG and graph models.

Building the workflow via durable execution

A conventional approach to building the indication landscaping platform would have been to model the workflow as a graph, in which some tasks would fetch data from internal sources, while other tasks would leverage LLMs with web search to fetch off-label drugs. Further downstream, another LLM task would search for trial outcomes across drug labels and publications. At critical points, there would be dedicated tasks to pause and seek human feedback.

A simplified view of the landscaping workflow depicted as a graph. The workflow starts with the user's indication input.

However, the inherent complexity of indication research and drug development raised several concerns about this approach:

Can a statically-defined graph capture the inherently iterative and looping nature of the workflow?
Each indication introduces subtleties that may require a tweak in the workflow. Can we evolve the workflow to accommodate indication-specific nuances without breaking existing executions or creating unmanageable complexity?
We want our analysts tightly integrated in the workflow, guiding the LLM’s research at several stages. How can we embed user touchpoints and feedback without introducing excessive state management in our workflow graphs?
Given the cost of LLM-driven steps, how do we ensure that failures don’t require restarting the workflow from scratch?

In principle, many of these challenges could be addressed with additional task orchestration and state management layered onto a graph model. However, doing so would push complexity into the graph itself – as execution paths grow, the workflow would progressively become harder to reason about, evolve, and debug. This puts the workflow at risk of becoming a fragile, stateful system where resilience and flexibility are afterthoughts rather than core design principles. In a domain like indication research, where workflows are iterative and analyst judgment frequently alters the path forward, this brittleness becomes especially problematic.

These constraints forced us to think beyond the traditional graph model and pointed us toward durable execution – a workflow-as-code pattern that models workflows as deterministic code execution. Tasks in the workflow reliably produce the same output given the same input, making them predictable and retriable, with automatic handling for a variety of errors and the ability to roll back to any pre-defined state in case of failure.

A durable execution engine also allowed us to express the workflow directly in code, including the loops and parallel stages, in a way that feels closer to writing a script than configuring a static pipeline. Temporal, our engine of choice, gave us a clean way to break the workflow into the discrete units of work, called activities, depicted in our workflow graph. As we expand the codebase to support more indications, we can introduce new steps by adding new activities or swapping implementations, without risking breakage in existing or in-flight workflows. Temporal also natively supports pausing workflows for user input and updating state based on feedback using workflow messaging, allowing us to embed human judgment at critical points in the research workflow.

Finally, the central concept of “durability” in durable execution was crucial for ensuring that the landscaping workflow proceeds in cases of error and user feedback. Any workflow can be translated to code, but what durability ensures is that the crucial units of work (tasks or “activities” in Temporal) can be reset and re-run any number of times and still produce the same result and are robust to error conditions. Our workflow gathers data from internal databases, for instance, to compile the initial drug list for indication. Later stages execute code to pull data from online sources via LLM web searches. Any of these steps may fail or time out. However, durable execution allows us to recover, either in code or manually, to any prior point in the workflow without starting over.

Beyond the technical benefits, durable execution also supports our AI-native approach where corporate memory compounds rather than wastes. Every task is durably persisted—auditable by default, without building separate audit infrastructure. We can compare workflow states across time and across individuals, accumulating institutional knowledge of how we understand indications, drugs, and endpoints.

With durable execution at the core, our indication landscaper could now be described simply as code as follows:

Snippet of the indication landscaping workflow implemented as code

Putting LLMs to work

With the workflow architecture in place, we had a clear framework for how to integrate LLMs in the data pipeline. We identified several use cases for which LLM automation would be most valuable:

Parsing unstructured data sources. Extracting data that does not exist in structured internal databases, such as evidence of off-label use described in publications or clinical guidance.
Manually transcribing large volumes of data. Extracting efficacy and safety data from tables, figures, and abstracts that would otherwise require time-consuming manual entry which, while tedious, still necessitates scientific expertise.
Gathering data from the most up-to-date sources. Identifying recent trial readouts and pipeline updates, where press releases and other web sources often precede formal publications. The addition of web search to LLMs has been particularly valuable here.
Harmonizing and normalizing qualitative data. Normalizing and deduplicating semantically similar data points, such as outcome measures that may appear with slight variations across different drug labels and publications. For example, it is quite common to find two sources that report lung function as measured by FVC (Forced Vital Capacity) in slightly different ways (e.g., rate of change in FVC vs. rate of decline in FVC), even though both are comparable metrics. LLMs do a reasonable job semantically normalizing and deduplicating these outcome measures, but human feedback is still needed to recognize subtle but meaningful differences.

With these use cases defined, we integrated LLM-driven steps directly into the workflow. The durable execution engine allowed us to embed these tasks at appropriate points while preserving tight human oversight. At critical stages, the workflow can pause for analyst input, enabling reviewers to edit structured outputs or rerun LLM steps with targeted feedback, particularly when:

Users need to indicate relevant drugs for an indication. Our analysts have the expertise to quickly weed out drugs or mechanisms of actions that may not be relevant in the landscaping of an indication.
Users need to decide which endpoints are most relevant to gather efficacy data for drugs. This is highly indication-specific and at times subjective, and can drastically alter how one understands the efficacy of drugs in an indication.

Evaluations

Because these outputs ultimately inform business decisions, rigorous evaluation of data quality produced by the workflow was critical – not only to build confidence in LLM-assisted data gathering, but also to guide our design choices around how and where LLMs should be integrated into the landscaping process.

We designed two forms of evaluations – one that focused on accuracy (precision and recall) and one that focused on traceability (how well the LLM can point us to where it found the data). The second piece becomes important in issue detection and verification. The more granular an LLM’s citation, the easier it would be for users to trace where a piece of data came from and verify its validity.

As we evaluated different integration patterns, a central tradeoff emerged around LLM autonomy. We had scoped the LLM’s role to two areas: finding and structuring standard of care for a given indication, and finding detailed trial outcomes from internal and external sources, both requiring an LLM agent with web search capability. Within that scope, the level of the agent’s autonomy exists on a spectrum. On one end of the spectrum, the LLM can be given full autonomy to search, read sources, pick relevant sources and return a structured output. On the other end, the search and extraction workflow is tightly orchestrated in code, with the LLM invoked only at specific points in parsing and structuring data. Our goal was to use these evaluations to determine where along this spectrum performance is strongest. The evaluation results warrant a deeper dive, but here are some initial observations:

High extraction accuracy: LLMs performed well when extracting metrics from a specific source (e.g., research papers, press releases, FDA labels), even when data was embedded deeply in tables or figures.
Errors tracked with study complexity: Mistakes were more common in complex trial designs, such as pooled cohorts or studies with patients transitioning between arms.
Prompt and schema sensitivity: Subtle prompt wording and schema field descriptions significantly affected results, especially for similar-sounding but distinct fields (e.g., baseline vs. analyzed patient populations).
Autonomy tradeoffs: Initial evaluations favored autonomous web search-enabled LLMs, but multi-step agents with external search tools improved steadily as the search improved, especially with models that strictly followed URL-fetching instructions.
Evaluation as unit testing: Treating evaluation like unit testing proved valuable – isolating specific output data points helped pinpoint what was working and what wasn't.

Issue detection and traceability

As observed above, some degree of data inaccuracy and indeterminism is inevitable. Even with highly optimized prompts and architecture, LLMs are inherently not deterministic. Web search itself can vary in the ranking and content of sources, and outcome reporting in drug development lacks standardization. As a result, we needed an efficient way for a human to find and correct errors and inconsistencies.

To make this manageable and scalable, we focused on two approaches. First, given the size of the dataset, we prioritized flagging potentially anomalous data points for targeted review, which can be done with varying degrees of sophistication. The low-hanging fruit was statistical anomaly detection. Once structured data is extracted, values that fall far outside the expected range for a given endpoint are surfaced for user inspection. For future iterations, we plan to create a more sophisticated model that can discriminate between each study and drug, and rank extracted data points by likelihood of error – determined by the complexity of the trial design, number of endpoints used, indication rarity, and other factors.

The second approach focused on precise traceability – clearly showing the user where each data point originated and how the LLM derived the extracted value. A simple solution that greatly helped our users pinpoint data issues was to implement a text fragment link, which directs users to the exact line of text within a cited document. While this approach works, text fragments are too fragile to serve as a durable solution, as the LLM's citation may not appear verbatim in the source and the source itself may change over time.

What’s Next

While this platform has greatly reduced the time for our teams to gather data and understand an indication’s treatment landscape, we continue to work toward key enhancements of the platform, particularly around interactivity, evaluation, and traceability.

Interactivity will be central to the next iteration of the product and reflects a consistent theme of the feedback we received from users. Understanding an indication requires not only aggregating hard numbers on efficacy and safety, but filtering out noise as well. Analysts want the ability to quickly run what-if analyses as they explore the data, investigate emerging trends across trials, and dynamically query new endpoints as questions arise. At the same time, they have also expressed that they view the landscaping workflow as a learning exercise, not just a mechanical one. In the next iteration of the tool, we plan to add this interactive layer over the robust dataset generated by the current workflow.

Enabling this interactivity introduces new challenges. While the current system relies on human-in-the-loop workflows to gather raw data, future iterations will focus on fast querying, aggregation, and visualization, allowing our analysts to explore and visualize the raw data through natural language.

Evaluation and traceability remain equally critical. So far, evaluation has not only helped identify which LLM models perform best, but also re-interrogate how we architect an LLM agent. We plan to translate the learnings into a new version of the workflow with each LLM task (activity) in the workflow updated to the most performant architecture and prompt. We’re working to advance beyond locating text fragments, to semantically surfacing equivalent sections of a source webpage to make verification even more robust.

As we continue refining this platform, our goal remains the same: to build durable, human-centered AI systems that translate complex evidence into better decisions, and ultimately, more new medicines for patients.

See Part I of our blog series on indication landscaping.

Back to blog