NB: this series is still a work in progress.

Healthcare AI Development Infrastructure

This post is a part of the healthcare AI infrastructure series. For a general overview and introduction check out the introduction post.

This post aims to introduce some of the tools available for making healthcare AI models that will eventually end up impacting clinical care. Although AI models can be developed and then attention can be turned to implementation I find that its best to think of these processes as closely linked. Implementation is the step in which most healthcare AI projects die. This often occurs because development occurred without confronting some of the constraints that will be imposed by implementation. If you seriously understand implementation limitations and design accordingly during development, then you’ll have a greater chance of success up when you get to that implementation.

So even though this post is focused on development, it is written with a perspective of someone who’s gone downstream several times. And you can take the time to read through the post on implementation so you can understand the whole lifecycle before your start your journey.

Overview

As we briefly discussed in the last post, you need two things to build a model: data and development environment (a computer). Data often comes in the form of a report or extract from a clinical database, like the database of an EMR or another clinical system. This data are then transferred to computing environments that are set up for developing models. These environments tend to be configured with special software and hardware.

Overview of model development. Data are extracted clinical systems, like the EMR. These data are then transferred to model development environments, where developers can write code that develop and validate AI models.
Overview of model development. Data are extracted from clinical systems, like the EMR. These data are then transferred to model development environments, where developers can write code that develop and validate AI models.

The above figure depicts the environments and data flows between them for model development. Generally, its fairly straightforward with data being extracted from clinical system and then being moved into a model development environment, where most of the development work is done.

Data

The data report out of underlying clinical systems can take a variety of forms. Their most basic embodiment is that of a simple table of data, where each patient is a row and columns represent different types of info about that patient. Once you have research or QI access it is pretty straightforward to get extracts of data from the EMR, when working with your local Epic analysts (employed by the hospital) they will probably give you data in the form of an excel or CSV file. You can also get data from other sources, like collaborative institutions (where you have a shared IRB or BAA) or open source datasets like those available on PhysioNet.

Development Environments

Healthcare AI model development has typically taken place on premises servers that were maintained by the health system or engineering departments capable of attaining HIPAA compliance. Privacy is super important - worthy of its own set of posts - but we won’t be able to it justice here - so make sure to work with your compliance people to do the right thing. In terms of software tts fairly standard to use a linux or windows operating system with a python development environment, you usually want to be able to allow python packages to be downloaded as there’s a lot of great open source software out there for this type of work (e.g., scikit-learn , pytorch, tensorflow). You’ll want to make sure that you have a fairly capable machine (lots of RAM and CPU cores), ideally having access to GPUs will make your life easier as well. Maintaining all this infrastructure can be pretty difficult, as such there’s been a growing consideration for using cloud-based computing environments.1

Research Infrastructure

Now we can start to talk about the specific infrastructure that you may have to deal with. This infrastructure is often a shared resource that supports multiple different types of data driven research, like health services research, epidemiology, and multi-omics.

Architecture diagram for typical healthcare organization research infrastructure. Several clinical databases, like the laboratory information system (LIS), EMR (Chronicles and Clarity), along with many other sources may get fed into a central research data warehouse (RDW). This is then queried to get reports that can be used to develop models.
Research infrastructure architecture diagram. Several clinical systems, like the laboratory information system (LIS), EMR, and other sources may get fed into a central research data warehouse (RDW). This is then queried to get reports that can be used to develop models.

If your institution uses Epic your research IT set up may be similar to what we have at Michigan (depicted above). Our data makes several stops before it gets to model developers. These stops are known as ETLs (short for extract, transform, load), processes that take data in certain format and convert to another format for entry into a database. There are two ETLs, the first of which is pretty much mandatory.

Chronicles → Clarity

Chronicles is a database meant to support healthcare operations, but its not optimized for massive queries on large populations of patients. To offload and optimize these types of analytical queries Epic created Clarity a SQL database (its built using Microsoft SQL Server) that is a transformation of the data stored in Chronicles. There is an ETL that runs every day that pulls data out of Chronicles and into Clarity.

Clarity → RDW

Some institutions allow researchers to directly access data from Clarity. That’s not the case at Michigan, instead there is a database that is specifically designed for researchers, known as research data warehouse (RDW). RDW is also a SQL database and is built on top of CareEvolution’s Orchestrate tooling. This additional layer imposes some additional transformations but also allows other types of data, such as information from wearables or insurers, to be merged alongside the EMR data.

Data are then queried from RDW and then passed to the model development infrastructure. The engineers can then work diligently to produce a model.

A note on ETLs

We have found that ETLs may impact the performance of AI models. There may be subtle differences between the data that come out of an ETL process and the underlying real-time data. This is a type of dataset shift that we termed infrastructure shift and it means that you can expect slightly worse model performance in these situations. For more information check out our Mind the Performance Gap paper.

The Interface Between Development and Implementation

As we start to finalize models we end up at the interface between development and implementation. This interstitial space is tricky because it not only spans a couple steps of the lifecycle, but it also spans different types of infrastructure as well. I use the arbitrary distinction of technical integration as the demarcating line. If the model does not yet receive prospective data (not technically integrated) then its still in development. Much of the discussion from here on out hinges on how the model developer is choosing to implement the model. We will talk extensively about the choices and the implications in a little bit, but we’ve got to set up the last bit of development for one of these avenues.

Epic Development Infrastructure

If you choose to implementation using Epic’s tooling (or any other vendor’s) you will have to get your model to work on their infrastructure. This is a wonky space that will likely get better over time. But in order to do technical integration with Epic you need to test and package your model using a custom development environment that they provide. I won’t go into a ton of details here, as you’re best served by going to Epic Galaxy to see the latest and greatest documentation.

As a part of model development Epic provides a Python environment with a load of standard Python AI/ML libraries (…list) They also provide a couple custom Python libraries that help you interact with the data interfaces.

  • You can receive tabular data in a JSON structure that is parsed by their libraries You can then pre-process the data and pass it to your model
  • Once you have your predictions you packaged up the data using another set of Epic’s Python calls.

Although the development environment is sandboxed, you are not overly constrained in terms of the code you want to include. You can include additional python and data files in the model package Additionally you can call external APIs from within this environment if they are whitelisted. This means that you could include information from other sources or do data processing via another service.

You can take an existing model that was developed in and as long as you use epic approved list of libraries, you can use epic bundling tools to then convert it into a package that can be run on their ECCP instance the way that the model receives, data is through a reporting workmen report so you’ll work with your epic analyst to set up a report essentially is the tabular data you want your models received so you specify all the columns and have this done you’ll also have an epic analyst.

Architecture diagram for developing models inside of an EMR vendor's system. Clinical database generates reports that are then sent to the model development environment, where developers write code for model development and validation which then lead to a model being created. This model is then tested and packaged using the vendor's software. Once tested the model can then be packaged and is ready for implementation.
Architecture diagram for developing models inside of an EMR vendor's system. Clinical database generates reports that are then sent to the model development environment, where developers write code for model development and validation which then lead to a model being created. This model is then tested and packaged using the vendor's software. Once tested the model can then be packaged and is ready for implementation.

In this workflow you assess and convert a model that you made with your own infrastructure into a package that can be run on Epic’s implementation infrastructure. What’s crucial about the workflow depicted above is that there’s a data report that comes directly out of Chronicles (not Clarity) that you use as a part of this packaging workflow. This is report is often a small extract representing current patients in the health system. Thus, despite being small it is a very good representation of what the data will look like prospectively, as its generated by the prospective infrastructure. I think its a really good opportunity to address infrastructure shift, if the model developer uses this data in additional to a larger retrospectively collected research dataset for development. Maybe I’ll do some research in this direction…

Wrapping Up

In this post we’ve covered the foundational aspects of healthcare AI model development, briefly touching on from data acquisition and development environment setup.

Looking ahead, our next post will detail the infrastructure needed to support AI implementation. The brief discussion on the intersection of development and implementation prepares us for this discussion and underscores a crucial recurring theme: the need for foresight and integrated planning for these types of projects. By understanding the detailed processes of how data is handled, transformed, and utilized, developers can better anticipate the practical challenges that arise during model implementation. This anticipatory approach not only streamlines the transition from development to use but also enhances the adaptability and effectiveness of the solutions we create.

Cheers,
Erkin
Go ÖN Home

  1. I’ve never done a cost break-down analysis for on-premises vs. cloud for healthcare AI research, but I’d love to see results if anyone has some handy.