Building the Data Foundation for Clinical Study Design

Standardizing a decade of unstructured clinical trial data to power scalable, evidence-driven study design

by Cohen Karnell

Sr. Software Engineer

8 min. read•Nov. 21, 2025

Building the Data Foundation for Clinical Study Design featured image

Clinical trials evaluate how a treatment affects the human body, including its safety and its ability to deliver the intended therapeutic benefit. Designing trials requires protocols that mirror real-world clinical practice, align with the drug’s mechanism of action, and anticipate regulatory expectations. Strong designs also rely on evidence of what has worked, what has failed, and why, built from decades of trials, each contributing a small but meaningful piece through its own objectives, endpoints, populations, and operational choices.

At Formation Bio, we design trials across a number of therapeutic areas. To do this well, we take an extremely data-driven approach – standardizing and drawing from all the public data available to bootstrap initial clinical strategies and plans. Before entering any new program, we need a clear understanding of every prior study in the indication and what has worked and what has not.

Fortunately, clinical trial data exists in public, easily accessible places; however, it is highly unstructured and requires significant cleaning and standardization. Adding structure to this data comes with many challenges, but it ultimately forms a key foundation for our study design capabilities and much more. Let’s dive into how this works.

Building a Clinical Trial Data Structure

When we first began building this data foundation, we knew it wasn’t enough for the model to simply produce “plausible” output. The results had to be grounded in reality, which meant the system needed to explain how each design was derived using data from real clinical trials. We hypothesized that by taking the public data from clinicaltrials.gov, processing it into a structured format, and using it to answer questions about common patterns across all study designs for a given condition, we could guide the model to generate high-quality trial designs much faster than a full clinical development team could on its own.

From the start, we ran into a problem familiar to anyone who has tried to work with clinical trial data. The information is all there, but it exists in a format that makes large-scale analysis very difficult. Clinicaltrials.gov contains over half a million trials spanning more than twenty years of research, each entered through free text fields by different sponsors and primary investigators (who are legally required to report their studies in the United States). The result is a dataset with almost no control over writing style, structure, or even how the fields are interpreted. This inconsistency is reflected in even the most atomic pieces of data. As of this writing, there are 116 different variations of how “type 2 diabetes” is entered in the database.

Since our goal was to create a feasible clinical trial design built on top of existing real-world study designs, it was crucial that we used clean, structured data. We examined dozens of study designs across different clinical domains to identify the common structural elements, and within those, the parameters, the many-to-many relationships, and the one-to-ones.

With guidance from our Clinical Development team, LLM assistance, and many hours of staring at walls of text, we arrived at a structure that showed strong commonality across the many studies we examined, and that captured the most important design elements of each study – to a point that it became feasible to write a script to break any given study down into its parts.

At the highest level, our structure included data unique to each study that didn’t require processing, such as the study name and description. Then, the endpoints provided substance: what was being measured, how and when it was measured, and the criteria for success. Downstream of the endpoints were the inclusion and exclusion criteria, which define who is eligible to participate. Lastly, we extracted the arms and doses, which are also key components of a study’s design.

{	"high level data": <one to one study data. name, description, etc.>,
	"high level non-unique data": <reused data across trials. condition, masking information such as blind or double blind, specific required inclusion / exclusion criteria such as
 an age range and the sex of the participants>,
	"low level study data": <endpoint definitions, study specific inclusion / exclusion criteria, trial arms data, and doses data. These are one to many with a study, and may appear in multiple studies>
}

Within each piece of data, we added deeper specificity to its structure by breaking down their common variables. For example, every endpoint specified what was being measured, how it was measured, and a timeframe. Similarly, eligibility criteria were decomposed into a set of pre-identified categories (demographics information, medical history, etc.), and each category had its own bespoke data structure that the criteria were processed to fit into.

Next, we wrote a script to programmatically parse every trial to conform to our structure. The script fetched all data related to a study’s design and passed it through a thin orchestration layer. This layer routed each data type to a specialized ingestion function, with the ingestion function of each component type agnostic of the rest of the design. At a high level, it looked like this:

To illustrate the process with an example, take the inclusion criteria of the study NCT04011748, exactly as it appears on its clinicaltrials.gov page:

With the exception of reformatting special characters, what you see here is exactly how the eligibility criteria are stored in the database: a single text field that may be delimited by numbers, bullets, stars, or otherwise.

After a first pass using an LLM to break the large string down into individual eligibility criteria, we passed each string through two processing steps. The first step was specialized to the data type being ingested (endpoint, eligibility criteria, etc.). The second step was generalized text processing, agnostic of data type. For example, take this individual criteria:

Must have a clinical diagnosis of AA, at least 50% hair loss involving the scalp.

Our script passed this to an LLM with a system prompt which carefully defined the limited set of categories for bucketing each eligibility criteria. From there, it defined a category-specific structure to parse the rest of the data into. We outputted the data as JSON, and ended up with our original string translated to this:

{
"value": ["50", "100"],
"category": "Medical Conditions and History",
"attribute": "Severity of Alopecia Areata (AA)",
"evaluation": "range"
}

This first pass accomplished two main goals:

It broke the plain-english criteria into its core logic, with enough specificity that it could almost be interpreted by a simple domain-specific language.
It removed certain information, such as parentheticals, that would otherwise distance it from closely-related material, even if doing so sacrificed some precision.

From there, all components regardless of type were passed through a second processing step with the sole purpose of normalizing the text. The finished, fully processed study component looked like this:

{
"value": ["50", "100"],
"category": "Medical Conditions and History",
"attribute": "severity of alopecia areata",
"evaluation": "range"
}

This processing pass expanded abbreviations, replaced symbols with text (e.g. “%” to “percent”), and removed capitalization, which drastically increased the similarity between components, at least from a machine’s point of view.

Putting Our Data Pipeline to the Test

We ran the script on every study completed in the last ten years within two clinical areas of interest for Formation Bio: Knee Osteoarthritis and Ulcerative Colitis. Our hope was that the larger-scale generation would produce many completely identical processed study components recurring across different trials. This data would be an excellent basis for developing an algorithm that could generate high-quality study designs built from common components used repeatedly across many studies, lending them high credibility and reliability grounded in real-world precedents.

Three key learnings away from this exercise:

Even with careful parallelization, passing any significant amount of data through a multi-step processing and re-processing pipeline that relies on LLMs at every stage is slow.
Duplicate components were present after processing, but weren’t as common as we’d hoped.

Endpoints, or “outcomes,” as they are called in the clinicaltrials.gov database, are way too complex to be meaningfully deduplicated by applying a structure to them. They are often multiple paragraphs long, attempting to capture branching logic to the point of being totally inscrutable.

Because endpoints are the most critical and influential component of a study’s design, this was a blocker. After a few iterations on the structural design of endpoints and several passes at the prompt we used to ingest them, it became clear that this was beyond the limit of what LLMs were capable of at the time (November 2024). This felt particularly demoralizing because our script was able to parse eligibility criteria into identical JSON structures quite well, even with significant differences in their text, despite being broken down into more complex logical structures than the endpoints were. However, the inconsistent nature of the endpoint prose was too much for the models to transform into even a relatively simple structure. A primary endpoint might appear like this across two different studies:

Semantically these are saying the same thing, but our script was simply not able to take these two components, pass them through our very generalized processing step for endpoints, and have them come out the other side as identical JSON.

Taking an Embeddings Approach

For the next pass, we decided to experiment with vectorizing both the processed and unprocessed forms of the endpoints text, using multiple methods of vectorization, and multiple clustering algorithms (and inside of that, a range of methods of pre-computing or hard coding the variables used by those algorithms). To gloss over a large amount of experimentation and data curation, our results were basically the following:

Embeddings are a much more effective approach to understanding similarity between endpoints than simple structuring into stringified json followed by a “double equals” comparison.
The performance of the different embedding methods on the processed vs unprocessed text is as follows, from worst to best:
1. Unprocessed text; TF-IDF embeddings
2. Processed text; TF-IDF embeddings
3. Unprocessed text; OpenAI embeddings
4. Processed text; OpenAI embeddings

The clear winner was the two-step pass at processing the text with our existing algorithm, embedding it with OpenAI embeddings, and using DBSCAN clustering to ensure the vectorized versions of our ground truth data ended up in the same buckets.

This research had a happy side effect: we discovered that while our original approach of deduplicating eligibility criteria by processing the text and counting the duplicates worked reasonably well, we could achieve better results by applying these embeddings, just as we had done successfully for endpoints.

So, at the end of all of this, we had a pretty robust method of turning a study into its bare components as JSON, and a reliable way to programmatically assess the similarity of those components across studies!

Putting it All Together

Once we confirmed it was possible to clean the data effectively, it was time to clean all of it. By late December 2024, we had scaled up the script and, after some careful cost calculations, got the green light to ingest the past decade of data from clinicaltrials.gov.

After two weeks of script run time, multiple bugs which caused the job to stall in the middle of the night, and me personally babysitting the process during a Christmas Eve party, the ingestion finished within about ten percent of the center of our projected cost range.

The end result was ~6,500,000 individual study components, complete with categorization, structured logic, and critically OpenAI embeddings, which we would use as the foundation for our clinical study design capabilities. It was the first time we had a unified, standardized view of a decade of clinical research, distilled into a form that could actually be used programmatically.

This dataset gave us a new way of reasoning about clinical precedent, of comparing studies at scale, and of understanding how designs evolve across indications. For the first time, we could trace patterns that had previously been buried in unstructured text scattered across thousands of trials.

There’s more to the story. In a future post, we’ll discuss how we took this dataset and turned it into a tool that can generate a full clinical study design from a single prompt, complete with written justifications tied to real-world precedent. Stay tuned.

Back to blog

Building the Data Foundation for Clinical Study Design

Building a Clinical Trial Data Structure

Putting Our Data Pipeline to the Test

Taking an Embeddings Approach

Putting it All Together

More Articles