EGO4D AVSD Challenge | Chin-Jou Li

This is one of the most complicated projects I led!

The audio-visual diarization task focuses on tackling the problem of ’who spoke when’ in a given video. Inspired by the recent success of self-supervised learning (SSL) in speech processing and audio-visual applications, We wonder if self-superivsed embeddings (SSE) from the SSL models can benefit the AVD task as well.

Our model overview. We enhanced the blocks in blue.

Our approach was based on the baseline system. We replaced three building blocks with more advanced modules: face detection and tracking pipeline, AV-HuBERT based model for active speaker detection, and a pre-trained HuBERT model is used for generating audio embeddings.

On the validation data, frame-level prediction’s accuracy was increased from 79% to 84%, and a gain of 3% at diarization error rate is observed.