top of page
Search

Enhancing Text-to-Image Alignment in Chest X-ray Domain: MedXChat Development and Metric Innovations

  • Writer: Kyuta Yasuda
    Kyuta Yasuda
  • Jan 23
  • 5 min read

Updated: Jan 23

This project aimed to improve the integration of AI in medical imaging, specifically in chest X-ray (CXR) diagnostics, by evaluating and enhancing MedXChat, a text-to-image generative model. The work addresses key challenges in evaluating the quality and usability of AI-generated CXRs.


Ultimately, the aim of this project was to provide medical trainees and students with visuals where in real life actual x-ray images might not be accessible enough for them to familiarise themselves with the certain conditions.

MedXChat System Architecture
MedXChat System Architecture


Dataset 


Chest X-rays Dataset 


The Chest X-rays (Indiana University) dataset is a widely utilized resource in the fields of medical imaging and artificial intelligence, curated specifically to support research and development (OpenI). This dataset consists of thousands of de-identified chest X-ray images accompanied by their corresponding radiology reports, which document various medical conditions such as lung disease, pneumonia, heart failure, and normal findings. Two particularly significant features of the dataset are the MeSH terms (Medical Subject Headings) and the Findings column in the reports. The MeSH terms provide standardized medical indexing, offering a consistent reference system for diseases, conditions, and anatomical information. These terms play a vital role in organizing and retrieving medical data, especially when training machine learning models that rely on accurate medical terminology for consistency. This field will be used to determine how similar two medical reports are. The idea being that the Jaccard distance can be used to see how many overlapping MeSH terms two report have. The greater the overlap, the more similar they are. The Findings column contains detailed diagnostic observations made by radiologists, such as abnormalities or normal findings identified in the X-ray images. The primary text input for the metric evaluation is based on the Findings column. In cases where the Findings feature is unavailable, the Impressions feature, which provides a high-level summary of the findings, is used as a substitute. For the purposes of this research, two specific datasets will be derived from the Chest X-rays dataset. Each dataset is designed to serve distinct roles: one for validating the proposed metric and another for evaluating both the MedXchat model and the metric in a practical, real-world scenario. 




Evaluation Metric


The evaluation metric was created with both CLIP and MedCLIP models. The normal CLIP model is a part of stable diffusion model, which is designed for general topics, whereas MedCLIP was finetuned to perform better on medical images, especially on chest x-rays. The performance of the metrics derived from these models were compared, which indicated that they had similar performance with a slight advantage to MedCLIP.


The evaluation metrics were created in the same way from both models:

  1. Convert images and texts into vectors so that models can process.

  2. Compare the values to compute the differences.


These images below are the examples of pairs with different similarity scores.


Difference: 0.7453

Original Filename: 79 IM-0583-3003.dcm.png

Text Report: The cardiomediastinal silhouette and vasculature are within normal limits for size and contour. The lungs are normally inflated and clear. Osseous structures are within normal limits for patient age. 



Worst Scoring Image Pair
Worst Scoring Image Pair

Difference: 0.172321588

Original Filename: 98 IM-2467-1001.dcm.png

Text Report: Frontal and lateral views of the chest show an unchanged cardiomediastinal silhouette. Reduced lung volumes with basilar atelectasis. No XXXX focal airspace consolidation or pleural effusion. 


Average Scoring Image Pair
Average Scoring Image Pair

Difference: 0.0245234

Original Filename: 51 IM-2125-1001.dcm.png

Text Report: Heart size is normal and cardiomediastinal silhouette is normal. There are scattered calcified granulomas throughout both lung XXXX. Lungs are clear bilaterally otherwise. No bony or soft tissue abnormalities. 


Best Scoring Image Pair
Best Scoring Image Pair


Display heatmap on the x-ray images to show corresponding parts of the images to the texts


The function to display a heatmap where the parts of the images that are strongly related to the text input are shown in warm colours (red) and the other parts are shown as cold colours (blue) was also implemented to help users understand the relationships between the generated images and the texts better. While the first evaluation metric mainly shows the global similarities, this metric can visualise similarities between parts of the images and certain words of the text input.


What Are Attention Scores?


Attention scores are computed in transformer-based models (e.g., MedCLIP, CLIP) to determine how much focus (or weight) the model gives to specific elements in the input. In the context of text-image alignment:

  • Text Tokens: Words or subwords from the report (e.g., "lungs," "clear").

  • Image Patches: Regions of the image divided into smaller, fixed-sized segments (e.g., parts of a chest X-ray).

The attention score indicates the contribution of each token (from the text) to specific image patches (or vice versa). These scores are derived from the attention mechanism within transformer layers.


Extracting Attention Scores


To compute or visualize attention scores for text-image alignment (as in your project):

  1. Access the Attention Weights:

    • Transformer models expose attention weights from their intermediate layers, which represent the computed attention scores for each Query-Key pair.

    • Example in code:

      python

      CopyEdit

      attention_scores = model(**inputs, output_attentions=True)['attentions']

  2. Text-Image Interaction:

    • For text tokens attending to image patches:

      • Extract the attention scores between text embedding tokens (Query) and image patch embeddings (Key).

    • Example: The model generates a T×PT \times PT×P matrix, where TTT is the number of text tokens and PPP is the number of image patches.

  3. Averaging Across Layers and Heads:

    • Transformer models have multiple attention heads and layers. Attention scores are often averaged across these to simplify visualization or analysis.

      • Example:

        python

        CopyEdit

        avg_attention_scores = torch.mean(attention_scores, dim=[0, 1])

  4. Normalize or Visualize:

    • Normalize the scores to highlight the strongest attention connections.

    • Use visualizations like heatmaps to display the alignment between tokens and image patches.


Visualization Example

A heatmap is commonly used to display attention scores. Each cell in the heatmap shows the attention weight between a text token and an image patch.


Patch 1

Patch 2

Patch 3

Patch 4

Lungs

0.30

0.20

0.40

0.10

Clear

0.25

0.25

0.30

0.20

Are

0.10

0.15

0.20

0.55

Heatmap with Attention Scores
Heatmap with Attention Scores

Key Features of GUI


  1. Text-to-Image Generation: Generates synthetic chest X-rays from medical reports.

  2. Interactive GUI: Provides a user-friendly platform to input reports, generate CXRs, and evaluate alignment metrics.

  3. Alignment Metrics: Utilizes MedCLIP's cosine similarity to evaluate how well the generated CXRs align with medical reports.


Final Iteration with Light Mode
Final Iteration with Light Mode
Final Iteration with Dark Mode
Final Iteration with Dark Mode

Conclusion


This project required multiple sills and techniques that are in demand in the machine learning/AI industry, such as text-2-image, computer vision, stable diffusion model etc. Publication of this tool was not feasible due to its computational expensiveness, however, with the right amount of budget and resources, this tool can be used online as a website, or as a software. Also, due to time constraint, some of the initial objectives were not achieved, however, the future studies can address challenges such as analysing the types of images and texts that the metric can perform better on, adding a function to allow users to click on parts of the text input and display the corresponding parts in the generated images and so on.

 
 
 

Comments


bottom of page