Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others' emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of crossmodal prediction. In emotion perception, as in most other settings, visual information precedes the auditory one. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, it has not been addressed so far in audiovisual emotion perception. Based on the current state of the art in (a) crossmodal prediction and (b) multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG) and magnetoencephalographic (MEG) studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow for a more reliable prediction of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 response in the EEG and the duration of visual emotional but not non-emotional information. If the assumption that emotional content allows for more reliable predictions can be corroborated in future studies, crossmodal prediction is a crucial factor in our understanding of multisensory emotion perception.