Structuring Unstructured Clinical Data

Transforming healthcare information through an innovative pipeline approach

The Gap in Healthcare

"Over 80% of digital data in healthcare is available as unstructured data." - National Library of Medicine

In an industry where data-driven decision-making is critical, relying on only 20% of available information presents a significant challenge. This is not merely an issue of quantity but also of quality—only 13% of unstructured EHR data has corresponding entries in structured fields - medRxiv, leaving vast amounts of clinically valuable insights untapped.

The implications are profound:

• Incomplete analyses
• Missed opportunities for improved patient care
• Risk of suboptimal decisions

The Attack Plan

For a problem as significant as this and with only 7 days to build a solution, being precise is non-negotiable. We focused on a simplified version of the challenge: receiving free-form clinical data in .txt files and transforming it into FHIR-format structured data.

The current approach is to pre-train a Large Language Model (LLM) on extensive clinical and medical data and then fine-tune it for tasks such as data extraction and structuring. However, tackling such a broad problem in one step is neither efficient nor effective for us as engineers.

Our modular strategy:

• Break down the challenge into three smaller, manageable parts
• Focus on key components individually
• Achieve more accurate results
• Provide a scalable and iterative path toward a comprehensive solution

The Pipeline

Our pipeline is structured in 3 main components: Concept Enrichment, Sectioning, and FHIR Formatter

1. Concept Enrichment

We extract key concepts from the text and enrich them using UMLS—a framework that integrates medical vocabularies and ontologies, providing structured information about each concept.

2. Sectioning

From the enriched data, we use the LLM to classify information into specific categories, each mapped to a corresponding FHIR resource, preparing it to be easily transformed later.

3. FHIR Formatter

Using the categorized data and predefined mappings, we leverage an LLM with a dedicated FHIR formatter tool to structure the information and generate the final FHIR file.

Technical Implementation

Concept Enrichment Layer

To extract data initially, we utilize a rule based system. While LLMs excel at generating text, they lack true conceptual understanding and domain expertise.

To enhance the model's contextual awareness, we first extract key concepts from the text and enrich them using UMLS—a framework that integrates medical vocabularies and ontologies. This enrichment provides structured information about each concept, such as whether it represents a symptom, a substance, or another medical entity.

Example: Enrichment for the concept "back pain" yields:

Back pain (medical_term)

• UMLS CUI: C0004604
• Semantic Types: Sign or Symptom
• Synonyms: pain; back, Back Pains, Pain back
• Definition: Painful sensation in the back region.

Sectioning Layer

From the enriched data, we then use the LLM to classify the information into specific categories, preparing it to be easily transformed into FHIR format later.

Each category is directly mapped to a corresponding FHIR resource, which forms the foundation of the FHIR standard. For example, data identified under "Family History" is mapped to the FHIR resource "FamilyMemberHistory".

The LLM identifies and categorizes different parts of the clinical text into logical sections such as:

• Patient history
• Symptoms and complaints
• Clinical observations
• Diagnoses and assessments
• Treatment plans

FHIR Formatter Layer

Using the categorized data and predefined mappings, we leverage an LLM in conjunction with a dedicated FHIR formatter tool to structure the information.

The final FHIR file is then generated and made available for download.

This layer maps the sectioned clinical data to appropriate FHIR resources such as:

• Patient
• Condition
• Observation
• MedicationStatement
• Procedure

The Future of MediStructure

As we continue to evolve our solution, here's what's on the horizon

1Enhanced Input Processing

Currently, our pipeline supports a single text file as input. Moving forward, we plan to enhance its capabilities by introducing batch processing for multiple files.

2Database Integration

Enabling direct integrations with database tools will significantly improve both scalability and usability, making MediStructure a seamless addition to existing healthcare systems.

3Model Optimization

We're excited to experiment with different LLMs and fine-tuning strategies to optimize performance and deliver increasingly accurate results.

4Continuous Refinement

Our goal is to continuously refine MediStructure, ensuring it delivers the best possible results for the healthcare industry through ongoing improvements and updates.

Transforming clinical unstructured data into structured data. Created by Gil Arroteia , Miguel Nunes & João Perleques