Technical Review Of Modern Machine Learning In Healthcare Informatics

Karthik Shiraly
November 15, 2021
Machine learning in health care informatics: doctor talking to a patient

The health care industry generates almost one-third of all the data in the world, and by 2025, its volume is projected to increase by about 36%. To uncover insights from such massive volumes of data, you need machine learning, deep learning, artificial intelligence, data science, and data mining at the forefront of your health care informatics.

Unlike bioinformatics that focuses on informatics for biomedical fields — like genomics, drug discovery, clinical trials of drugs, and biomedical engineering — health informatics and medical informatics focus on improving decision-making by clinicians and researchers to improve personalized and public health care. Health informatics focuses on the use, storage, and retrieval of:

  • Health data like electronic health records (EHRs) and other medical records
  • Data from health care equipment like magnetic resonance imaging (MRI) and electrocardiogram (ECG) machines

Let’s explore how your organization can benefit from the use of artificial intelligence, deep learning, and machine learning in health care informatics.

Techniques of Machine Learning in Health Care Informatics

An understanding of available machine learning approaches for your health care informatics initiatives can speed up your prototyping and implementation plans. Common machine learning algorithms that use deep neural networks for health care informatics include:

  • Convolutional neural networks (CNNs): CNNs are typically used for medical imaging tasks. However, they are adept as classifiers and detectors for non-imaging data like time series data if it can be converted into some kind of an image.
  • Transformer networks: Transformer networks are the current state-of-the-art (as of 2021) for natural language processing (NLP) — where they outperform recurrent neural networks — and are replacing CNNs for imaging tasks. They are great at both spatial and temporal pattern recognition.
  • Recurrent neural networks (RNNs): Recurrent neural networks and their variants, like long short-term memory (LSTM) networks and gated recurrent unit (GRU) networks, are extensively used in problems where the sequence of data matters. However, RNNs are being rapidly replaced by transformers because the latter are easier to train and more capable.
  • Interactive machine learning: Interactive machine learning approaches place human medical experts in the optimization and prediction workflows to improve accuracies and make fewer errors. Health care mistakes can be socially unacceptable and financially expensive for your business. To avoid them, human-in-the-loop approaches to tasks like subspace clustering and data anonymization help reduce an exponential search space with heuristic inputs provided by medical experts. 
  • Reinforcement learning: Reinforcement networks are learning models used when a machine learning system should learn to mimic a human expert. Remotely conducted robotic examinations and surgeries are good applications of reinforcement networks. However, currently they aren’t being used much in health care informatics.

You’ll get to know both transformer and convolutional neural networks in detail in the case studies that follow.

Detect Heart Diseases From Heart Sounds Using Machine Learning

Machine learning in health care informatics: Heart sound waveform and spectrogram
Heart sound waveform and spectrogram (Source: Li et al.)

The World Health Organization (WHO) says cardiovascular diseases cause a third of all global deaths. To reduce their toll, it recommends detecting them and starting treatments as early as possible. One way to do this is by listening to the beats and murmurs made by the heart using a stethoscope, a process called auscultation. However, only highly experienced doctors can detect diseases this way. In areas with inadequate medical personnel and large populations, it’s an impractical approach.

Automated detection, using machine learning on low-cost devices, can drastically improve early detection and triaging of deadly cardiovascular diseases. It brings economical, real-time clinical decision support, possibly remotely, for large populations. Let’s explore one research study of automated detection that can serve as a template for other health care detection tasks. It proposes a lightweight convolutional deep neural network to detect heart diseases from heart sound waves captured using phonocardiogram (PCG) machines.

Convolutional Neural Network Model for Detecting Heart Disease

You normally use a CNN to describe objects in images through their characteristic visual features. Can you use it on audio data too? Yes, you can use a CNN as an audio feature detector by converting the audio data to a visual representation like an amplitude-vs-time waveform or a frequency-vs-time spectrogram. A frequency-vs-time representation is preferable because it separates out frequencies as fine-grained features that can improve the accuracy of machine learning algorithms.

In heart sound detection, a sound waveform from a PCG is converted to a set of frequency-vs-time spectrograms using the Fourier transform. Unlike healthy hearts, diseased hearts have characteristic frequency components generated by murmurs. You can use a CNN to detect these characteristic frequencies.

Machine learning in health care informatics: CNN model for phonocardiogram detection
CNN model for phonocardiogram detection (Source: Li et al.)

A simple three-layer binary classification CNN like the one shown here is trained to look for characteristic features in a spectrogram’s image and classify the heart as diseased or healthy.

The model uses just three convolutional layers with eight to 16 filters each. Resolution is halved just once, effectively learning features at two zoom levels. The extracted features are passed through a max pooling layer to reduce both the number of parameters and the possibility of overfitting. A fully connected layer then flattens the features to pass them to a softmax classification layer. The small number of layers and small size of images make for a lightweight neural network that can easily run on a cheap smartphone or Raspberry Pi. Its accuracy is not as high as that of a deeper network, but it helps reduce this public health problem in a practical, economic way.

Data Preprocessing

The CNN is trained on the PhysioNet 2016 heart sound recordings database. The data preprocessing involves the following steps:

  • Each recording is broken up into three-second-long overlapping segments to capture complete heart cycles.
  • The segments are passed through a 10-Hz high-pass filter to cut away noise below 10 Hz.
  • Every three-second segment is converted to a spectrogram using a short-time Fourier transform (STFT).

Loss Function and Recall

The heart sound CNN has to overcome two problems worth understanding because they frequently affect other machine learning tasks in health care informatics:

  1. Class imbalance: Health care datasets frequently display class imbalances. If health sounds are collected from everybody in a clinic, the healthy hearts will outnumber diseased ones. But if sounds are collected only from heart patients, the number of diseased hearts will be greater. Either way, class imbalances can bias the model’s metrics if ignored.
  2. False negatives: If a person has heart issues but the system says they don’t (i.e., a false negative), they may never receive treatment in time. If a person is healthy but the system thinks they’re ailing (i.e., a false positive), other tests will likely correct the mistake. That’s why, in health care settings, it’s better to bias the model for a low false-negative rate even if it results in a high false-positive rate. In other words, high recall is preferable over high precision.

The class imbalance is compensated by using the focal loss function, a weighted variant of binary cross-entropy loss. The weight is a hyperparameter that’s tuned to minimize loss depending on the level of imbalance in the training set.

You can reduce the false negatives by tuning hyperparameters — like the Fourier transform’s parameters — until the recall is maximized.


The CNN classifies each segment of a recording as healthy or abnormal. A complete recording of the patient’s heart is classified as diseased if the ratio of the number of abnormal segments to the total number of segments is above a chosen threshold.

The research achieved a sensitivity of 87%, specificity of 85%, accuracy of 85%, and mean accuracy of 86%.

Clinical Relation Extraction Using Natural Language Processing

An electronic health record
An electronic health record (Source: OpenEMR)

Information captured in electronic health records (EHRs) is vital for personalized health care and for detecting trends in public health care. While some information in EHRs is in structured formats, data like patient history may be in free-form text. This unstructured text contains important information like symptoms, diseases, drugs, and the semantic relations between them.

Information extraction from such unstructured text using natural language processing (NLP) is a focus area of machine learning in health care informatics. This approach involves keyword extraction, named entity recognition (NER), and relation extraction (RE) for downstream clinical decision support tasks like:

  • Classifying the data in EHRs
  • Correlating features in medical images with the information in EHRs, enabling automated medical image diagnosis
  • Searching EHRs intelligently using medical search engines
  • Summarizing clinical text
  • Question answering to help clinicians and patients find health information quickly

Let’s understand, in depth, the role NLP plays in health care informatics through a research study that uses state-of-the-art transformer networks for the challenging problem of relation extraction.

A clinical relation is an association between two concepts present in EHR text. For example, in the example relation “prednisolone acetate treats postoperative inflammation,” “prednisolone acetate” is a medication, “postoperative inflammation” is a health symptom, and “treats” is the relation.

An NLP system is expected to recognize and extract concepts like medications and symptoms along with the relations between them. What makes clinical relation extraction a challenging problem is that some concepts and relations may be separated by sentences, paragraphs, and even whole sections of text. Processing text to identify such long-range associations mimics how the human brain processes information, and state-of-the-art transformer networks have proven quite good at it.

Transformer Neural Networks for Clinical Relation Extraction

The Transformer architecture
The Transformer architecture (Source: Vaswani et al.)

Transformer networks evolved from the need to overcome the sequential training bottlenecks of recurrent neural networks while retaining their ability to learn sequential and temporal context. Transformer networks don’t suffer the limitations of convolutional or recurrent networks and can learn sequential, temporal, long-range, and global context all in one network architecture. As of 2021, they are the state of the art for both natural language processing and computer vision. Let’s understand some key aspects of transformer networks when used for NLP.

Attention Mechanism

The attention mechanism, a vital concept in transformers, mimics the human brain’s behavior of not treating all words in a sentence as equally important but instead paying attention to only some keywords and phrases.

An attention module attaches an attention score to each word and learns them while training on real-world text corpora. Attention weights enable it to behave like the receptive fields of convolutional networks.

Transformers use multi-head attention consisting of banks of parallel attention modules. Each module learns different aspects of the data; some learn the localized context of the surrounding words, while others learn long-range dependencies (between sentences, paragraphs, and sections). Transformers typically consist of two subnetworks:

  • Encoder: The encoder network takes in input text as word embeddings. It encodes the positions of words as an additional context vector. The embedding and positional vectors are input to the encoder’s multi-head attention layers that output attention vectors for the decoder network.
  • Decoder: The decoder network consists of many decoder units, each with two multi-head attention layers. The first layer, called the masked multi-head attention layer, outputs attention vectors that describe how each term is related to every other term inside an input sequence. The second layer, called the encoder-decoder multi-head attention layer, outputs another set of attention vectors that describe how the outputs from the encoder and the masked decoder are related to one another.

The final layer is a softmax layer that models the probability distribution for each word in the output vocabulary. By looking for associations between the encoder and decoder vectors, a transformer can encode local, long-range, and global contexts inside the attention vectors.

Let’s look at three popular transformer-based neural networks from the BERT family. They come with pretrained models that can be reused, using fine-tuning, for any downstream NLP tasks like classification, named entity recognition (NER), or relation extraction.     


The Bidirectional Encoder Representation from Transformers (BERT) architecture consists of just encoder units to learn a language model from real-world public corpora like websites and news. It outputs pretrained, context-specific attention vectors for each word. For downstream NLP tasks like NER and relation extraction, suitable decoder networks have to be appended.

BERT undergoes training on large public text corpora using two self-supervised models. They’re self-supervised in the sense that, unlike in supervised learning, there are no manually labeled outputs here; instead, the expected outputs are contained in the same dataset. Its masked language model randomly masks words in the inputs so that BERT learns to predict them, enabling bidirectional context learning. The other training model is next sentence prediction, enabling BERT to encode relationships between two sentences.


The Robustly optimized BERT approach (RoBERTa) improves on BERT’s pretraining methodology. Instead of the next sentence prediction stage of the pretraining, it dynamically masks different words in each training epoch to improve the model’s generalizability. In addition, RoBERTa includes more datasets to improve the language model.


XLNet too improves on BERT pretraining, specifically on its static masking of words. Instead, XLNet opts for a permutation language model in which all words are predicted one at a time in random order, unlike how BERT predicts only the masked words. This helps XLNet generalize its language model better than BERT.               

Relation Extraction Using Transformer Models

Transformer network for relation extraction
Transformer network for relation extraction (Source: Xi et al.)

Since all three transformer models are meant for pretrained contextual representation learning, you have to attach and fine-tune suitable decoder networks to handle downstream NLP tasks like relation extraction (RE).

There are many learning techniques for relation extraction. One technique is to model it as a multiclass classification problem — given the concept terms and concept labels (like drug and symptom) as inputs, the RE network should learn to predict their relation. Another technique is to model RE as a binary classification model that predicts whether a relation exists between two concepts and follows that prediction with a simple rule-based pipeline to give the relation.

Since two concepts and their relation can exist across sentences, a robust model should be tuned for cross-sentence RE too. One approach is to simply ignore sentence distances and predict relations regardless of whether the concepts are in the same sentence or across sentences. An alternate approach is to create different classifiers for each sentence distance and compare their metrics.

Another aspect is testing whether a model pretrained on generic text corpora performs better than one pretrained on clinical text corpora.

Yet another aspect in RE, but quite specific to the BERT family, is deciding which embeddings to use as inputs.


The RE research study found the following:

  • On multiclass prediction, the XLNet-clinical model pretrained on clinical text corpus produced the best F1 score of 89.19%.
  • On binary class prediction of the existence of relation, the RoBERTa-clinical model pretrained on clinical text corpus produced the best F1 score of 89.59%, while XLNet-clinical came third behind RoBERTa-large.
  • On multiclass prediction of cross-sentence relation, the same three models produced the best F1 scores but on different datasets.
  • For selecting the embeddings to use as inputs, they found that supplying everything, including classification token and all entity markers, generally performed the best on all metrics.

Two Key Challenges for Machine Learning in Health Care Informatics

Machine learning in health care informatics faces some persistent challenges from social, business, and regulatory norms. Luckily, reasonable solutions to these challenges may also come from artificial intelligence and machine learning methods in medical information systems.

The Data Availability Problem

A very hard problem to solve is simply the lack of sufficient medical data. For robust training, deep neural models need hundreds to thousands of ground-truth examples labeled by experienced medical specialists. But obtaining so much real-world medical big data involves several barriers:

  • Ill health is an exception: The good news is that most people are healthy or at least don't feel the need to visit clinics regularly for tests and scans. But this means most health datasets tend to be imbalanced, with neither enough positive data (i.e., data that shows a disorder) nor enough negative data (i.e., data from healthy people). This poses challenges for balanced data mining.
  • Medical data is inaccessible: Unlike personal or consumer data, medical datasets come from specialists and special equipment that's only available to a small number of hospitals and laboratories. Most of that data is stored inaccessibly in their private networks.
  • Medical specialists are busy people: Obtaining good ground-truth data isn't quick or cheap even for hospitals. In machine learning workflows, ground-truth images or EHRs have to be marked and labeled by medical specialists whose time is valuable and arguably better spent on reducing immediate patient workload.

To avoid these barriers, a useful approach is data augmentation. Data augmentation helps your neural network models generalize better, as demonstrated by studies. It expands the training set by creating synthetic variants of actual data. Synthetic images are generated using basic image processing techniques like elastic deformations as well as advanced techniques like generative adversarial networks (GANs). Audio and time series are similarly augmented. 

Another approach is using few-shot and one-shot network models. A one-shot model like the convolutional Siamese neural network attempts to learn a new label from just one example image by determining its distance from images of a known label.

The Data Privacy Problem

To its credit, health care seems to be one of the few industries where the organizations themselves take data privacy seriously, whether they're companies, hospitals, or labs. Regulations like HIPAA and processes like ethics committee reviews are deeply embedded in processes that by and large avoid problems in other industries, like data breaches, data brokers, de-anonymization, and misuse of computational intelligence.

Nonetheless, there are opportunities for more fundamental improvements. One of them is privacy-preserving federated machine learning. Traditionally, raw patient data is collected and stored centrally in a hospital's or lab's systems. From that point on, it becomes their data and responsibility. Centralized storage is an implicit assumption in gradient descent and other algorithm implementations of popular ML frameworks like TensorFlow.

Federated ML is a paradigm shift away from this approach. The raw patient data is always kept under the ownership and access control of the patients themselves. Their devices such as smartphones and laptops are used to compute and update local models trained only on their data. Just the parameters of these local models, like their network weights, are then transferred to the central server. The central server runs neural network algorithms like gradient descent and backpropagation that are adapted to this decentralized model. Patient data never goes outside their personal devices.

Homomorphic encryption of patient data is an alternative approach that's similar to the current centralized approach but improves its privacy-preserving aspects. It encrypts the patient's raw data in a way that only they can decrypt it, but it still supports neural network computations directly on the encrypted data. Network training algorithms like gradient descent generate the same network parameters when run on the encrypted data, thus making decryption unnecessary. You can use projects like OpenMined and libraries like PySyft to implement homomorphically encrypted neural networks.

Machine Learning in Health Care Informatics — The Right Tool for the Job

With the industry practically drowning in a glut of health care data every year, you’re sitting on a gold mine of health informatics data that can bring you incomparable business insights and novel business opportunities. Whether you run a hospital, research lab, clinic, or private practice, there is simply no alternative to automating as much of your data mining as possible using the latest machine learning and deep learning approaches.

With deep knowledge of the domain and the technology, we can help you mine your health data effectively and efficiently. Contact us!


  • Li, Tao, Yibo Yin, Kainan Ma, Sitao Zhang, and Ming Liu. 2021. "Lightweight End-to-End Neural Network Model for Automatic Heart Sound Classification" Information 12, no. 2: 54. https://doi.org/10.3390/info12020054 
  • Xi Yang, Zehao Yu, Yi Guo, Jiang Bian, Yonghui Wu. "Clinical Relation Extraction Using Transformer-based Models". arXiv:2107.08957 [cs.CL]. https://arxiv.org/abs/2107.08957v2. 2021
  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need". arXiv:1706.03762 [cs.CL]. https://arxiv.org/abs/1706.03762. 2017
  • Liu C, Springer D, Li Q, Moody B, Juan RA, Chorro FJ, Castells F, Roig JM, Silva I, Johnson AE, Syed Z, Schmidt SE, Papadaniil CD, Hadjileontiadis L, Naseri H, Moukadem A, Dieterlen A, Brandt C, Tang H, Samieinasab M, Samieinasab MR, Sameni R, Mark RG, Clifford GD. An open access database for the evaluation of heart sound algorithms. Physiol Meas.2016 Dec;37(12):2181-2213
  • Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. E215–e220.
  • OpenEMR