Technical Evaluation of Modern Machine Learning In Medical Imaging

Karthik Shiraly
November 15, 2021

Radiologists and clinicians diagnose conditions from one or more modalities of biomedical images like X-ray radiography, computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), ultrasound, and others.

It's known in healthcare circles that diagnosis based on medical images takes time and complicated cases take longer. You may sometimes need multiple specialists to analyze more than one modality to reach a consensus.

Machine learning in medical imaging can improve decision making and diagnosis time by providing reliable clinical decision support to your busy specialists. Machine learning systems are capable of large-scale analysis and triaging, processing thousands of images in minutes.

Let's look at some approaches, network architectures, and uses of machine learning in medical imaging.

Image Classification for Medical Imaging

Image classification is the task of applying one or more medical labels to an image based on visual characteristics like colors, textures, objects, and shapes.

Notable Network Architectures for Image Classification

Res Net and Deep Net
ResNet and DenseNet (Image: Liu et al.)

A convolutional neural network (CNN) is the most popular machine learning technique for image classification. A CNN consists of layers of convolutional filters. Each filter behaves like a neuron that lights up when it sees a particular texture or shape or other image feature but remains inactive otherwise.

Training a CNN on images of a particular class — say heart X-rays — habituates these filter neurons into adjusting their convolutional weights to strongly activate whenever visual features unique to heart X-rays are shown. When a heart X-ray is shown, a large number of these filters switch on and output values indicate that the image is a heart X-ray. This is the basic working of activation and pattern recognition in CNNs.

The outputs from convolutional layers are called feature maps and generating them for an image is called feature extraction. An image's feature maps are important inputs to every biomedical imaging task including classification, object detection, and segmentation.

Plain CNNs that consist of multiple convolutional layers suffer several problems like vanishing gradients and poor generalizability. To improve them, better architectures have been developed.

ResNet is one such popular architecture that uses skip connections between convolutional layers so that inputs are not just from the previous layer but also the layer before that. This helps preserve gradients in very deep networks. ResNet is a good choice when you have big data sets for training.

DenseNet is another popular architecture that connects each convolutional layer to every other. A set of visual features detected in an earlier layer influences all subsequent layers not just indirectly (which is true for plain CNNs too) but also directly. It performs better by reusing features instead of recalculating.

Recently, state-of-the-art visual transformer architectures have been favored over CNNs because they can handle global features and long-range dependencies better.

Let's now explore some uses of image classification in clinical practice.

Image Classification for Intracranial Hemorrhage Classification

Intracranial hemorrhage for machine learning
Intracranial hemorrhage (Image: Ye et al.)

Intracranial hemorrhage is any kind of bleeding due to accidents or violence inside the cranial space that protects the brain. There are five types and they are frequently detected in emergency wards from non-contrast computed tomography (CT) radiology scans of the head. As CT scanners are found in most emergency wards, detecting hemorrhages automatically using artificial intelligence can speed up triaging time.

Network Architectures for 3D CT Scans

The main difficulty in classifying CT scans is that the data is 3D. One solution is to use regular CNNs with 3D convolutions to extract features.

The extracted feature vectors are sent to a fully connected softmax layer for classifying the condition. Multi-class classification identifies one of the five types of hemorrhages. Some architectures opt for five parallel layers, each doing binary classification for one of the five types.

Another approach is slicing the 3D data into 2D images and using regular CNNs. But then the 3D spatial context of the condition is lost when the data is sliced. Can this context be retrieved somehow? That's exactly what a study tried by combining a CNN with a recurrent neural network (RNN). The CNN encodes spatial characteristics of each image. If the slices are supplied in the correct sequence, the RNN can encode visual characteristics spread across the sequence of slices.


Both studies reported sensitivity and specificity comparable to experienced radiologists and sometimes better than less experienced radiologists. For intracranial hemorrhage, the best model scored 0.99 on sensitivity close to the 1.0 of a senior radiologist and better than the average 0.94 of three junior radiologists. So these systems can be deployed in real-world emergency settings where triage time is critical.

Whole Image Classification for Bone Fracture

bone fracture detection
Bone fracture detection (Image: Ma et al.)

Deep learning's ability to detect bone fractures in large gross anatomical features like hips is impressive. These are whole image classification tasks using convolutional neural networks and since most new ideas in deep vision are first implemented for classification, these tasks benefit whenever a better CNN architecture comes out.

Image classification has already helped find hip fractures. These are usually diagnosed from frontal pelvic X-ray radiographs. However, to avoid a misdiagnosis, patients are advised to get additional scans, which increases costs, delays treatments, and is impractical in remote areas without radiology facilities.

A deep learning system that can accurately detect hip fractures can solve these problems. Since they are more attentive to little visual details that people can miss, they will hopefully perform better or as good as human experts, acting as decision support.

One study used DenseNet CNNs for hip fracture classification. Their training used augmentation operations like small translations, rotations, and shearing to expand the training set. They also preprocessed the images using histogram equalization.

The study reported that:

  • Using small CNN models on small datasets is effective and requires only modest time commitments from busy medical experts.
  • Instead of labeling the entire dataset, incremental labeling is better. After a run, only the false positive and false negative images were reviewed by experts to decide which edge cases to train for next time. This improved accuracy from 95% to 99.9% while hand-labeling only 7.4% of the dataset.

Object Detection for Medical Imaging

Object detection is the task of locating one or more objects, belonging to one or more classes, in an image and calculating their bounding boxes.

Convolutional neural networks are preferred for medical field object detection too. Object detection can use any CNN architecture for feature extraction and appends classification and regression layers to predict object classes and coordinates. More recently, visual transformers have been tried too.

Some popular CNN-based detection architectures are:

  • R-CNN family of architectures with Faster R-CNN being popular
  • YOLO family of architectures which YOLOv4 being the best performer
  • Single-shot detector (SSD) family

YOLO and SSD are single stage detector architectures that classify and locate multiple objects using a single network. In contrast, R-CNN family architectures contain two sub-networks — one comes up with region proposals that possibly contain objects while the other classifies them and predicts their locations.

Faster R-CNN Network Architecture

faster R-CNN
Faster R-CNN (Image: Ren et al.)

You can choose the proven and mature Faster R-CNN for most of your medical detection tasks. It's fast enough to output results in real-time, giving you a more efficient workflow.

It first extracts features using a backbone CNN that you specify. Then a sub-network called the region proposal network (RPN) examines the extracted features and tells the main network where to look for objects by proposing region rectangles with their confidence scores. The RPN is also a CNN that’s fully convolutional without any dense layers and shares layers with the main R-CNN network.

RPN's region proposals are routed through a region of interest (RoI) pooling layer to reshape them before passing to a fully connected layer that predicts the class and coordinates.

Inception-ResNet that combines ResNet-50 and inception modules is a great choice for the backbone CNN. ResNet allows for very deep networks without running into the vanishing gradient problem. Inception aims for a computationally lighter network than a regular CNN by using 1x1 convolutions for fewer parameters. Their combination gives you a very deep, computationally light network to calculate feature maps.

Detecting Fractures and Other Musculoskeletal Injuries

Wrist Fracture Detection With Machine Learning
Wrist fracture detection (Image: Gan et al.)

In busy emergency wards, surgeons and radiologists may focus more on trauma injuries and therefore miss fractures. Artificial intelligence systems that can quickly detect possible fractures in X-ray radiographs can be a big help in such high-pressure environments.

One approach is to use an object detection network to detect the fractures directly. However, because they generally operate at lower resolutions, they are more suitable for detecting large objects rather than inconspicuous fractures hiding in a large image.

A better approach is to use detection to first locate musculoskeletal parts of interest — such as wrists — and then pass those small regions to a second fracture classification network. Since this second network examines only small areas and not the full image, its accuracy will be better.

One such study used a Faster R-CNN network and a second Inception-v4 classification network to detect dorsal radius fractures in wrist X-ray radiographs.

Network Architecture

We’ve already gone over the Faster R-CNN architecture. Let’s look at the Inception-v4 classification network used here.

Inception architectures solve the vanishing gradient issues of very deep neural networks using inception blocks which consist of a large number of convolutional filters stacked not vertically but horizontally. So Inception behaves like a very deep network but by going wide instead of deep with fewer network parameters.

Data Preparation

The imaging data is manually labeled by experienced orthopedists using labeling tools like LabelImg.

Synthetic images are generated using augmentation operations like horizontal flipping, random translations, rotations, shearing, and scaling, all within fixed limits. If there’s a chance that the images can come from different X-ray machines, you should use normalization techniques like histogram equalization too.


If you have large datasets after augmentation (thousands of images), you can train the network from scratch. But if you have just a few dozen or hundreds, then you should use transfer learning methods where a pre-trained Faster R-CNN model is fine-tuned by unfreezing its final layers and retraining them on your X-ray training data. Use best practices like keeping test and validation data subsets apart. The fracture classifying Inception model is trained the same way. 


The study compared this machine learning method’s performance with those of experienced orthopedists and radiologists using metrics like accuracy, sensitivity, specificity, and Youden index. Amazingly, they found that the system outperformed radiologists and performed at par with orthopedists.

Object Detection for Dental X-Rays

Teeth detection (Image: Chen et al.)

Object detection can automate routine analysis of dental periapical and bitewing X-ray radiographs such as:

  • Detecting teeth
  • Identifying teeth using the ISO-3950 numbering system, a prerequisite for comparing teeth in dental forensics
  • Detecting dental cavities (or caries)
  • Detecting treatments like implants, tooth restorations, or endodontic treatments

You can opt for a pre-trained Faster R-CNN machine learning model fine-tuned for these tasks using transfer learning.

Data Pre-Processing Cleaning Up Data

Dental radiographs are high-resolution images that can be safely downscaled without reducing detection accuracy. However, different X-ray machines produce images with different contrasts, which affects accuracy. For that, you should normalize contrasts using image processing techniques like contrast-limited adaptive histogram equalization (CLAHE) that equalizes contrasts in local regions without adding noise..


A Faster R-CNN generalizes better with more data. You should augment training images with additional images using operations like horizontal and vertical flipping, adding random noise, and making random contrast modifications.

As radiograph datasets tend to be small, transfer learning is the best approach to train such a deep architecture. Start with a pre-trained model like the Faster R-CNN Inception ResNet V2 that's trained on the COCO dataset. Unfreeze only its final layers and retrain it on your teeth dataset to fine-tune it for dental features. Transfer learning performs well because textures and shapes have already been learnt by the pre-trained model.


Use standard object detection metrics like mean average precision (mAP) and intersection over union (IoU) to evaluate your fine-tuned model. Similar models have reported mean IoU, precision, and recall of 90% and above.

Image Segmentation for Medical Imaging

Image segmentation is a frequently used computer vision task in medical image analysis. It involves isolating regions of medical interest in natural tissues. It's used in every medical field with every modality — breast cancer and lung cancer detection, Alzheimer's disease classification, and nerve detection are just a few examples. 

Since regions have irregular shapes, segmentation has to classify — i.e., assign a class label for — every pixel in the image. For example, an oncology MR image can contain regions of healthy tissue, benign lesions, and malignant tumors.

Let's explore two popular segmentation neural networks — U-Net and FC-DenseNet.

U-Net Segmentation Network

U-Net segmentation network (Image: Liu et al.)

U-Net is a popular deep CNN architecture developed by a medical research team for medical image segmentation. Its name comes from depicting its architecture in the shape of a “U,” consisting of:

  • Feature encoding and downscaling layers in its left leg
  • Decoding and upscaling layers in its right leg
  • Inputs to each upscaling layer from the encoding layer at the same resolution

Since U-Net is a fully convolutional network (FCN) with no dense layers at all, it can accept images of any size. The only purpose of the encoding layers on the left is feature extraction at every resolution to pass to corresponding upscaling layers on the right.

The deconvolution layers on the right iteratively upscale pixel masks by deconvolving features from the previous layer with features from its corresponding encoding layer. The result is a pixel mask that's the same size as the input image.

During training, RGB images are input as rank-4 tensors. The ground truths are segmentation maps for each image where each pixel is labeled with a numeric class index. Since a segment map contains multiple regions, this is a typical multi-class classification at the pixel level and hence uses cross-entropy as the loss function.

However, because you're classifying a large number of pixels, you need to optimize at the aggregate level too so that most pixels match their ground truth labels. For this, the Dice coefficient for set similarity is included in the loss function along with cross-entropy.

The effectiveness of segmentation models is evaluated using the Jaccard similarity score between ground truth regions and predicted regions.

FC-DenseNet Segmentation Network

FC-DenseNet is another segmentation network that uses DenseNet as a feature extractor. The main intuition behind DenseNet is that directly connecting every layer to every other layer makes the network easier to train and lighter with fewer parameters.

Like U-Net, FC-DenseNet is also a fully convolutional U-shaped architecture with a downscaling path and an upscaling path consisting of dense blocks. Each dense block is a set of convolutional layers where each layer is connected to every other layer.

In the downscaling path, each dense block's input and output feature maps are concatenated. Thus there is a linear growth as well as reuse of feature maps as one moves down. However, in the upsampling path, it's not a good idea to expand the feature maps while the spatial resolution is also expanding. If that happens, the final softmax layer has to contend with an intractable number of features.

But you still want to reuse already calculated feature maps. So in the upscaling path, only the last dense block's feature maps are input to the deconvolution layer. Since full feature maps were already calculated in the downsampling path, they are supplied from the corresponding encoding layer to the deconvolution layer through skip connections. This is where it differs from U-Net which uses multiple deconvolution layers and combines all feature maps at every layer.

Cardiomegaly Detection

Cardiomegaly (Image: Que et al.)

Cardiomegaly is an enlarged heart condition that often indicates a more serious cardiovascular disease. Since chest X-ray radiographs are easily available, automated flagging of possible cardiomegaly in chest X-ray radiographs can save triaging time for medical personnel.

One indicator of cardiomegaly is if the cardiothoracic ratio (CTR) — the ratio of heart width to lung width — is above 0.5 instead of being in the normal range of 0.39-0.5 with an average of 0.45.

Cardiomegaly can be detected by segmenting the heart and thoracic cavities and measuring CTR. One study did this using both U-Net and FC-DenseNet and compared their results. They found that while both performed well, U-Net showed better accuracy and precision while DenseNet showed better recall. Compared to U-Net, DenseNet made fewer mistakes in labeling people who had cardiomegaly as not having it. As cardiomegaly is an indicator of underlying disease, it would be a bad idea to not detect it when it's present. So DenseNet is the safer network from a healthcare point of view.

Since chest radiographs are likely to be a small dataset, you should use data augmentation techniques like slight rotations, shearing, shifting, and zooming to expand the training set with synthetic images. Additionally, since these are soft tissues, you can use elastic deformations to further expand the training set and help your network generalize better.

Stroke Analysis With Segmentation

A stroke lesion is a region of the brain where brain cells are dead due to lack of sufficient blood flow and can cause death or permanent disability. Neurologists detect stroke lesions from 3D magnetic resonance images (MRI) of the brain. MR images can be obtained through multiple modalities:

  • Anatomical scans like T1 contrast and T2
  • Diffusion scans like DWI
  • Perfusion scans like CBF, CBV, TTP, and Tmax

Often, lesions show up in one or more of these modalities. You can use a volumetric segmentation architecture like 3D U-Net to automatically detect lesions in such multimodal MRIs. Analyzing the 3D voxels directly ensures there's no loss of local information which is a problem when analyzing them as 2D slices. 

3D U-Net is just a 3D version of normal U-Net that accepts 3D volumes as inputs and uses 3D convolutions and pooling. For multimodal MRI data, an input set of 3D MRIs is a rank-5 tensor that’s passed through 3D convolutional layers to extract features.

One problem you'll face is a class imbalance in the data because most areas across all images will be healthy tissue while only a small set will be damaged lesions. This can be solved by using a dynamically weighted loss function like focal loss so that the network is less biased towards confident classifications and more biased towards misclassified examples.

One study used this architecture and loss function on multimodal MRI data and reported Dice similarity as high as 0.84 while scoring high on other metrics like sensitivity and positive predictive value (PPV).

Now's the Right Time to Adopt Machine Learning in Medical Imaging

Machine learning in medical imaging is becoming smarter every day, offering you several opportunities to improve operational efficiency in your healthcare company, hospital, or laboratory. Contact us to learn how you can benefit!


  • Xiaoqing Liu, Kunlun Gao, Bo Liu, Chengwei Pan, Kongming Liang, Lifeng Yan, Jiechao Ma, Fujin He, Shu Zhang, Siyuan Pan, Yizhou Yu, "Advances in Deep Learning-Based Medical Image Analysis," Health Data Science, vol. 2021, Article ID 8786793, 14 pages, 2021. https://doi.org/10.34133/2021/8786793
  • Ye, H., Gao, F., Yin, Y. et al., “Precise diagnosis of intracranial hemorrhage and subtypes using a three-dimensional joint convolutional and recurrent neural network.” Eur Radiol 29, 6191-6201, 2019. https://doi.org/10.1007/s00330-019-06163-2
  • Jakub Olczak, Niklas Fahlberg, Atsuto Maki, Ali Sharif Razavian, Anthony Jilert, André Stark, Olof Sköldenberg & Max Gordon, “Artificial intelligence for analyzing orthopedic trauma radiographs.” Acta Orthopaedica, 88:6, 581-586, 2017. DOI: 10.1080/17453674.2017.1344459
  • Kaifeng Gan, Dingli Xu, Yimu Lin, Yandong Shen, Ting Zhang, Keqi Hu, Ke Zhou, Mingguang Bi, Lingxiao Pan, Wei Wu, Yunpeng Liu, “Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments.” Acta Orthopaedica, 90:4, 394-400, 2019. DOI: 10.1080/17453674.2019.1600125
  • Chen, H., Zhang, K., Lyu, P. et al., “A deep learning approach to automatic teeth detection and numbering based on object detection in dental periapical films.” Sci Rep 9, 3840, 2019. https://doi.org/10.1038/s41598-019-40414-y
  • Que Q, Tang Z, Wang R, Zeng Z, Wang J, Chua M, Gee TS, Yang X, Veeravalli B, “CardioXNet: Automated Detection for Cardiomegaly Based on Deep Learning.” Annu Int Conf IEEE Eng Med Biol Soc., 612-615, 2019. DOI: 10.1109/EMBC.2018.8512374. PMID: 30440471.
  • Yangling Ma, Yixin Luo, “Bone fracture detection through the two-stage system of Crack-Sensitive Convolutional Neural Network.” Informatics in Medicine Unlocked, Volume 22, 2021, 100452, ISSN 2352-9148, https://doi.org/10.1016/j.imu.2020.100452.