Structuring Clinical Trial Data at Scale

How LLM-assisted processing transformed 260,000 unstructured clinical trials into 6.5 million structured components and embeddings

by Cohen Karnell

Sr. Software Engineer

8 min. read•Dec. 15, 2025

Structuring Clinical Trial Data at Scale featured image

Clinicaltrials.gov is one of the most valuable datasets in life sciences, containing hundreds of thousands of clinical trials spanning decades of research. When structured, this trial data can support everything from competitive intelligence and regulatory strategy to study feasibility and clinical trial design, serving as a foundational pillar for any modern life sciences data platform.

Despite its value, the dataset is notoriously messy and difficult to use. Each study is manually entered into free-text fields with no enforced structure, formatting standards, or standardized terminology. Even the simplest details vary widely. For example, as of this writing, there are 116 different variations of how “Type 2 Diabetes” appears in the database. This lack of standardization makes large-scale analysis slow, error-prone, and often impossible without significant preprocessing.

At Formation Bio, we set out to make this data usable. We built a system to ingest every trial on clinicaltrials.gov and translate its unstructured text into a standardized, machine-usable format. The result is a dataset with over 6.5 million structured “study components”, which refer to any data element within the structure of a study – including endpoints, eligibility criteria, arms, doses, conditions, and more. Each component is normalized and labeled consistently, forming the foundation for the more complex systems that we build on top of this data. Let’s dive into how we did it.

Designing Our Data Structure

With millions of data points spread across unstructured text, our first challenge was deciding how a “structured” clinical trial should look. We needed a framework that could be applied consistently across studies regardless of therapeutic area, writing style, or complexity.

We began by manually reviewing dozens of study designs across a range of clinical domains, trying to understand what they all had in common. Across these studies, we examined condition treated, study endpoints, inclusion/exclusion criteria, and arms/dosage data, among all the other critical elements in a clinical trial.

With guidance from our Clinical Development team, support from LLMs, and some time spent reading walls of text, we drafted an initial structure that showed strong commonality across the studies we reviewed, and captured the most important design elements of each one.

This first pass was relatively straightforward, essentially breaking each section of a study down into a concrete schema, similar to structures found in the AACT (Aggregate Analysis of clinicaltrials.gov) database:

This gave us a solid starting point, but it quickly revealed a key limitation. Many of the most important components, especially inclusion and exclusion criteria and study endpoints, were still just free text. If we wanted a data foundation that could be reused, compared, and analyzed programmatically, we needed to break these sections down much more precisely.

Let’s walk through an example. Take the eligibility criteria of the study NCT04556734, exactly as it appears on its clinicaltrials.gov page:

With the exception of reformatted special characters, what you see here is exactly how the eligibility criteria are stored in the database – as a single, unstructured text field delimited by bullets. To structure these text fields, we took a few steps:

First, we used an LLM to break down the text into individual inclusion/exclusion criteria.
For each criterion, we parsed the text into a specialized JSON format specific to the text being parsed (e.g., endpoints and eligibility criteria have their own JSON formats)
Finally, we normalized the JSON using an LLM for general text processing (capitalization, parentheses, etc.).

After running step 1, a single exclusion criteria could read:

This was then passed into an LLM, which bucketed the criteria into a defined set of categories that would make them comparable across studies, such as “Medical Conditions and History,” “Lifestyle and Social Factors,” “Laboratory and Diagnostic Findings,” or “Demographic Information". Additionally, this process identified key pieces of information, including the attribute the criteria is measuring, the target value, and evaluation method of the criteria. For the above criteria, this looks like:

Once the criteria was distilled into a machine-usable format, words that were identified as unneeded specificity were removed.

Lastly, a final processing step normalized the text. This expanded abbreviations, removed redundant parentheticals, and standardized capitalization. The finished, fully processed study criteria looked like this:

Finally, we’re left with a finished first pass at a structured clinical trial, complete with critical free text fields (endpoints, criteria, arms/doses, etc.) structured and normalized in a machine-friendly manner.

Putting Our Data Pipeline to the Test

Next, it was time to test our structure at scale. We wrote a script to programmatically parse any trial to conform to our structure. The script would ingest all data related to a clinical trial and pass it through our normalization process above.

As we wrote the script, our hope was that large-scale processing would uncover many identical components across trials, particularly within a single indication, where validated endpoints and population definitions are frequently reused.

These recurrences matter because they reveal which elements of clinical trials consistently occur across studies. They highlight the objectives, criteria, endpoints, and procedures that researchers rely on and that regulators have historically accepted. When we extract these recurring components into a single structured view, we can unlock a wide range of insights, such as which study features correlate with approval, which elements are more common in successful trials, how endpoints vary by phase, and much more. This becomes a powerful foundation for building more advanced analytics and applications across many functions.

With that in mind, we deployed our script across a subset of studies to see how well the approach held up in practice. The script ran successfully, but it surfaced several important learnings, especially around our goal of finding meaningful overlaps. Our first pass revealed three key takeaways:

Even with careful parallelization, passing any significant amount of data through a multistep processing and reprocessing pipeline that relies on LLMs at every stage is slow.
Overlapping components (e.g., criteria, endpoints) were present after processing, but weren’t as common as we’d hoped. Only about 9% of study components appeared in more than one study.

Endpoints (or “outcomes” in the clinicaltrials.gov database) were far too complex to deduplicate through structure alone. Many spanned multiple paragraphs and attempted to encode branching logic, making them extremely difficult to interpret consistently. Even when two endpoints were semantically identical, the phrasing varied so dramatically that our system struggled to treat them as the same.

The third takeaway involving the complexity of the endpoints was a major problem for us. Endpoints are one of the most critical components of a study’s design, and we wanted to be able to better understand when trials were using similar or different endpoints.

However, after a few iterations on the structural design of endpoints and several passes at the prompt we used to ingest them, it became clear that this was beyond the limit of what LLMs were capable of at the time. Our script was able to handle recurring eligibility criteria quite well, even with significant textual differences and more complex logical structures, but the inconsistent nature of endpoints prose proved too challenging for models to transform consistently.

For example, in one trial, a primary endpoint across two different studies might read:

Semantically, both these endpoints are saying the same thing, but our script was unable to take these two endpoint components, pass them through our generalized processing step for endpoints, and have them come out the other side as identical JSON, which we needed for true standardization and consistency.

Taking an Embeddings Approach

To address this problem, we experimented with vectorizing both the processed JSON and unprocessed plain text forms of the endpoint text. Vectorization converts text into numerical representations (embeddings) that capture semantic meaning, allowing us to compare endpoints that differ significantly in phrasing.

We tested multiple embedding methods and several clustering algorithms, each with different approaches to setting or pre-computing their parameters. Our hypothesis was that embeddings would allow us to group semantically similar endpoints more effectively, and that generating embeddings from a stringified version of our processed and normalized JSON might perform even better.

With the help of our clinical team, we built a ground truth dataset of endpoints that were semantically similar enough that a strong version of our algorithm should identify them as equivalent. We then experimented with embedding both the unprocessed plain text and the processed text, using K-Means clustering to test whether duplicates from our ground truth set would fall into the same clusters. For the embeddings, we compared TF-IDF embeddings, which use a traditional non-AI method for measuring string similarity, with OpenAI embeddings, which capture semantic meaning in a more sophisticated way.

After a large amount of experimentation, the clear winner was a two-step approach: processing the text with our existing algorithm, then embedding it using OpenAI embeddings.

This research also had an unexpected benefit. While our original approach to finding recurrences in eligibility criteria (i.e., processing the text and counting exact matches) worked reasonably well, we found that applying embeddings and clustering to those components increased the number of semantic duplicates our algorithm could identify, just as it had for endpoints.

At the end of this, we finally had a robust method for breaking each study into its core components as JSON, and a reliable way to programmatically measure the similarity of those components across studies.

Putting it All Together

Once we confirmed it was possible to clean the data effectively, it was time to clean all of it. After scaling up the script and conducting some careful cost modeling, we got the green light to ingest the past decade of studies from clinicaltrials.gov.

After two weeks of continuous runtime and several late-night stalls from unexpected bugs, the ingestion finally completed, landing within 10% of the midpoint of our projected cost range.

The end result was ~260,000+ structured studies and 6,500,000 individual study components, corresponding with all studies posted to clinicaltrials.gov in the last decade, complete with categorization, structured logic, and OpenAI embeddings. This was our final structure across 260k+ studies:

After breaking down the data, including fields that previously had no delimiters in the AACT database and couldn’t be counted reliably, we were finally able to see the full distribution of study components. The numbers aligned with what you might expect from an average clinical trial. There were more secondary endpoints than primary endpoints, more exclusion criteria than inclusion criteria, and a generally high ratio of criteria to total endpoints. For the first time, we had a unified, standardized view of decades of clinical research, distilled into a format that could be used programmatically and serve as a foundation for more advanced analytics.

Conclusion and next steps

This dataset gave us a new way of reasoning about clinical precedent, comparing studies at scale, and understanding how trial designs evolve across indications. For the first time, we could trace patterns that had previously been buried in unstructured text scattered across hundreds of thousands of records.

More importantly, this foundation has unlocked entirely new capabilities. We can search and compare trials with precision, identify recurring design elements across therapeutic areas, study how endpoints rise and fall in popularity over time, and evaluate which components tend to appear in successful or unsuccessful programs. It also enables downstream analytics and systems that simply weren’t possible before, from feasibility modeling and competitive landscaping, to automated understanding of trial precedents.

This is only the beginning. By turning clinical research into structured, machine-usable data, we’ve created a foundation that we’ll be building on for a long time. In a future post, we’ll provide an in-depth exploration of the systems we’ve built on top of this dataset, including one that can generate a full clinical study design from a single prompt, complete with evidence grounded in real-world precedent.

Back to blog