SELF-SUPERVISED LEARNING FOR AUDIO-VISUAL RELATIONSHIPS OF VIDEOS WITH STEREO SOUNDS