A Comparison of Feature Extraction Models for Medical Image Captioning

Sebastian Germer*, Hristina Uzunova, Jan Ehrhardt, Nils Feldhus, Philippe Thomas, Heinz Handels

*Corresponding author for this work


Introduction: In recent years, there has been significant progress in the area of image captioning using a combination of convolutional neural networks for feature extraction and recurrent neural networks for language generation [1], [2]. Inspired by this, this work suggests the automatic generation of medical image descriptions. However, findings from general-domain image captioning typically cannot be transferred one-to-one due to the specifics of medical images.

Many of the published works in this domain focus on improving the language generation component while relying on known image detection networks as feature extraction models [3], [4], [5]. On the contrary, in our work, we aim to find out which features and therefore extraction models are suitable for the task of medical image captioning.

Methods: For our study we take three feature extraction models into consideration: The DenseNet-121 [6] is used as a baseline. Another classifier with a reduced number of layers as well as an autoencoder architecture are developed by us. For language generation, we use an architecture similar to the Show-and-Tell approach [7]. It uses an LSTM to determine the next word in a sentence based on the image features received from the Feature Extractor and the already generated words. We chose to inspect two publicly available datasets for our study, the Open-I Indiana University Chest X-Ray dataset (IU-XRAY) [3] and the chest X-ray dataset of the National Institutes of Health (NIH-XRAY) [2]. Based on the textual findings of the IU-XRAY, we achieved shorter, more streamlined captions for each image based on the presence or absence of each of 15 disease categories in both datasets.

Results: The sentences generated by the Language Decoder are evaluated quantitatively for the different datasets and feature extraction architectures using several scoring methods (BLEU [8], METEOR [9], BERTScore [10]). Two major conclusions can be drawn from this evaluation: Firstly, the results are comparable for all of the features used for text generation. Also, the metrics for text evaluation seem to be correlated. This is especially interesting since n-gram-based metrics like BLEU and METEOR are intuitively less suited for this task than an embedding-based metric like BERTScore. One possible reason for this is that the BERT embedding is not designed for the medical domain.

Discussion: The quantitative results achieved by the different architectures are comparable to each other. On the one hand, this is interesting since the developed classifier has significantly less parameters than the Densenet, which indicates that simpler architectures yield similar results but require less computational resources. On the other hand, the usage of an autoencoder as a feature extractor for image captioning has hardly been mentioned in the literature so far. By not explicitly learning according to a given class distribution, they could remedy the problem of unevenly distributed classes, which is especially common in the medical field. We argue that this promising direction should be investigated in future work.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


Conference67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF)
Internet address

Research Areas and Centers

  • Centers: Center for Artificial Intelligence Luebeck (ZKIL)


Dive into the research topics of 'A Comparison of Feature Extraction Models for Medical Image Captioning'. Together they form a unique fingerprint.

Cite this