Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
Learning cross-modal features is an olea europaea montra essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval.Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered.