The algorithm works quite simply – first, it detects sound sources in the frame. They are divided into two types – specific objects and places with a characteristic background sound (for example, a cafe).
The original video is divided into scenes by a sharp change in the histogram between two frames, after which the CLIP neural network classifies the objects in it. Epidemic Sound is used as a base of effects – a library with 90 thousand sounds.
Ultimately, artificial intelligence “equips” each scene with the five most likely sound effects for objects and environments. In this case, only one of them is initially activated, but the user can turn on all five.
Having selected the necessary sounds, the algorithm creates time intervals for them – this allows for greater realism, since not all objects are on the scene throughout the entire video.