TY - JOUR
T1 - Fusing information from multiple 2D depth cameras for 3D human pose estimation in the operating room
AU - Hansen, Lasse
AU - Siebert, Marlin
AU - Diesel, Jasper
AU - Heinrich, Mattias P.
N1 - Funding Information:
We would like to thank the reviewers for their many insightful comments and suggestions helping to improve our paper. We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.
Publisher Copyright:
© 2019, CARS.
Copyright:
Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2019/11/1
Y1 - 2019/11/1
N2 - Purpose: For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain. Methods: We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder. Results: We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors. Conclusions: We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.
AB - Purpose: For many years, deep convolutional neural networks have achieved state-of-the-art results on a wide variety of computer vision tasks. 3D human pose estimation makes no exception and results on public benchmarks are impressive. However, specialized domains, such as operating rooms, pose additional challenges. Clinical settings include severe occlusions, clutter and difficult lighting conditions. Privacy concerns of patients and staff make it necessary to use unidentifiable data. In this work, we aim to bring robust human pose estimation to the clinical domain. Methods: We propose a 2D–3D information fusion framework that makes use of a network of multiple depth cameras and strong pose priors. In a first step, probabilities of 2D joints are predicted from single depth images. These information are fused in a shared voxel space yielding a rough estimate of the 3D pose. Final joint positions are obtained by regressing into the latent pose space of a pre-trained convolutional autoencoder. Results: We evaluate our approach against several baselines on the challenging MVOR dataset. Best results are obtained when fusing 2D information from multiple views and constraining the predictions with learned pose priors. Conclusions: We present a robust 3D human pose estimation framework based on a multi-depth camera network in the operating room. Depth images as only input modalities make our approach especially interesting for clinical applications due to the given anonymity for patients and staff.
UR - http://www.scopus.com/inward/record.url?scp=85070328031&partnerID=8YFLogxK
U2 - 10.1007/s11548-019-02044-7
DO - 10.1007/s11548-019-02044-7
M3 - Journal articles
C2 - 31388959
AN - SCOPUS:85070328031
SN - 1861-6410
VL - 14
SP - 1871
EP - 1879
JO - International Journal of Computer Assisted Radiology and Surgery
JF - International Journal of Computer Assisted Radiology and Surgery
IS - 11
ER -