On September 28th and 29th, the Audio Engineering Society (AES) had its first machine learning focused event called “Applications of Machine Learning in Audio“. The virtual event covered topics including automatic mixing, machine learning for audio visualization, and sourcing audio data for machine learning. It also featured a presentation on five current trends in audio source separation given by Fabian-Robert Stöter, an Audio-AI researcher at Inria (National Institute for Research in Digital Science and Technology), and Stefan Uhlich, who works in Sony’s R&D center in Stuttgart, Germany.
Below are the trends Stöter and Uhlich discussed in their presentation.
1. Moving from Supervised Separation to Universal Separation
The first trend introduced was the movement from supervised separation to universal separation. Uhlich explained, “In supervised separation, we have acquired some prior knowledge about the separation problem that we want to solve. For example, we know a priori the number of sources that are there in the mixture, and typically also the number of sources is quite small, so either two or four. And we also know what type of sources we want to separate.” He then explained universal separation saying, “On the other end of the scale, we have really universal separation where we have an unknown number of sources. And even more important, also we deal with an unknown source type in the mixture. So we really don’t know a priori what type of sources are there.”
Uhlich said that universal sound separation has yet to be solved, but lately, there has been a lot of progress. As an example, he talked about how now it is possible to perform audio source separation for an unknown number of sources. Uhlich described the approach stating, “The idea is to recursive the split of one source from the mixture using one and rest speech separation system [1]. The network has two outputs, s of t and r of t, where s of t contains the estimate of one sound source, in this case here for one speaker. And r of t contains the remaining speakers that are left. And the idea is now to iteratively apply this one at rest speech separation system to separate the mixture until basically, we have really split the mixture into all these speakers, into all the sources.” Another example Uhlich gave was the ability to perform source separation for many classes [2]. He explained, “So first the sound event detector is trained, and this sound event detector is then used to identify anchor segments which can be used for training. And using this approach, basically we can train a system on 527 AudioSet classes, which is really amazing.”
2. Training with Imperfect Data
The second trend Uhlich discussed was the ability to deal with imperfect training data. He explained that imperfect could mean several things. In one situation, it could mean the data is weakly labeled. “A way to deal with weakly labeled data is to use sound event classifiers,” said Uhlich. He pointed to research work by Mitsubishi [3] saying, “They showed basically how to train a separator when we have only maybe frame-level information, frame-level labels, or maybe only clip level labels available.”
Under another situation, imperfect could mean the training data is comprised solely of unknown mixtures of sources. In a Google research paper titled “Unsupervised sound separation using mixtures of mixtures,” Uhlich said they “showed with their mixing invariant training that it’s still possible to train a separation network on such data.” He explained that Google researchers extended their permutation invariant training to their mixture invariant training [4]. “Now that we only have mixtures available, we don’t have to solve a permutation problem here, but we have to really solve a mixing problem. We have to decide for every estimated source whether it goes with the first mixture or with the second mixture, and this is solved by finding a binary mixing matrix (a) that you can see here. And this mixture invariant training is really very interesting, and by this, it’s possible to train on training data which only contains mixtures of sources.”
In a third situation, “Imperfect could mean that we want to train speech enhancement systems on noisy speech,” according to Uhlich. He said it would be ideal to have access to large datasets of high-quality and multi-language speech to train such systems. However, collecting this data is challenging and not straightforward. “What people often use are speech data sets that are created for automatic speech recognition, for ASR. And the problem is that this data set often still contains noise, and maybe even on purpose contains noise because we want to have our ASR system to be robust for noise (for typical noise) that can be there. And one example is the Mozilla common voice data set, which is very nice because it’s multilingual, has many speakers, and it’s really also a large amount of samples. But there are typical noises like microphones’ switch on sounds, or even sample dropping that we actually want to remove by speech enhancement.”
So if we don’t have access to clean speech recordings, how can we train speech enhancement systems? Uhlich said, “There is some work that tackles this problem with an approach that is known from image denoising; it’s called Noise2Noise approach. And there’s another approach where a kind of bootstrapping is used, where we access at the beginning to a smaller clean speech data set. And this can be used to bootstrap the training on the larger noisy speech training data.” However, despite these methods, the problem remains largely unsolved.
3. Perceptual Loss Functions for Speech Enhancement
The third trend presented was the use of perceptual loss functions for speech enhancement. Uhlich said that “Perceptual measures are commonly used for speech enhancement.” Examples he gave were PESQ (the perceptual evaluation of speech quality), and the composite measures CSIG (a signal distortion measurement), CBAK (a background noise distortion measurement), COVL (an overall quality measurement), and STOI, which has shown to correlate well with speech intelligibility. He said these perceptual measures are used as additional loss terms to improve speech enhancement systems’ perceived quality. He added that while the use of perceptual measures is commonplace for speech, it is uncommon for music.
Uhlich also mentioned a method for measuring the perceptual quality of audio source separation called PEASS (perceptual evaluation of audio source separation). He said, “Quite recently by the University of Surrey, it was checked whether this PEASS score still correlates well with human perception. And also they showed that the aps, the artifact-related perceptual score, showed a good correlation with human perception,”
Additionally, he introduced an approach from MIT and Spotify where instead of comparing the ground-truth signal and their estimate, they compared the low-bitrate codec version of the ground truth and the separation. He explained the reasoning was that the low bitrate MP3 codec would have removed everything that is not important for our perception. Therefore, it should be more meaningful to compare our separation to that. Uhlich concluded, “In general, as you can also see from PEASS, there’s really the need for more work on perceptual measures and also loss functions for music. And currently, there’s really not the tools, or at least they are not used by researchers for doing a perpetual evaluation.”
4. Multi-Task Training
The fourth trend addressed was the use of multi-task training for speech enhancement. In other words, combining source separation with other tasks like classification tasks. Stoter said, “A long-standing problem and question we have in audio separation and many related tasks which is… well given we have a perfect separation system, would we then be able to get the better classification system or a better recognition system for speech for example.” He pointed to an event that DCASE (Detection and Classification of Acoustic Scenes and Events) has in this area, saying, “There’s a recent evaluation campaign for the 2020 DCASE where for the first time they ask for the task of sound event detection and to have a separate evaluation for a source separation system and a sound event detection system.”
Uhlich continued by explained the benefit of multi-task training. “It’s beneficial to train at the same time also the mask for the noise component and to have really a loss function which is containing not only the error that we do in the speech prediction but also the error we contain in the predicting the noise mask.” He added, “This additional task acts as a regularizer, and we can observe that the speech mask generalizes better.” Uhlich referred to research done by Sony, where a trained system that predicted speech and noise masks at the same time performed better than the system that only predicted the speech mask.
5. From Research to Deployment
The final trend Fabian and Uhlich talked about was the transition from research to deployment. “This is because of two reasons that we can see this commercial use of audio separation,” according to Uhlich. “The first one is that the quality really reached a level that we couldn’t think of five years ago. And also, the computational resources that we have available on edge devices like smartphones are sufficient for doing real-time inference.” One example he gave was the integration of real-time speech enhancement into products like Google Meet and NVIDIA RTX Voice. “You can find many demos on YouTube where people are trying with vacuum cleaners and all this stuff that you can think of how good the speech enhancement system works, and it’s really quite fun to watch those videos.” He also highlighted how this technology had become more vital, saying, “Such real-time speech enhancement has really also gained more importance now that there are much more video conferences because of COVID-19 and people having to work from home.”
Another example Uhlich provided was Karaoke, which is featured in apps likes Spotify and Line Music. “From Spotify, there’s the sing-along which allows you to turn down the volume of the vocals, and that basically, you can sing yourself to the song. And another one is from the Line music app, which also realizes the karaoke feature by using source separation from Sony.” Additional examples mentioned were the intelligent wind filter in the Sony Xperia 1 mark II smartphone, and the Dolby Atmos versions of Lawrence of Arabia and Gandhi, which used Sony’s AI separation to upmix the films from stereo to 5.1, or from 5.1 to Dolby Atmos.
You can get access to the full presentation, which covers things like the history of audio source separation as well as tools and datasets via Stöter’s website here.
Sources:
[1] Takahashi, Naoya, et al. “Recursive speech separation for unknown number of speakers.” InterSpeech 2019
[2] Kong, Qiuqiang, et al. “Source separation with weakly labelled data: An approach to computational auditory scene analysis.” ICASSP 2020.
[3] Pishdadian, Fatemeh et al. “Finding strength in weakness: Learning to separate sounds with weak supervision.” TASLP 2020.
[4] Wisdom, Scott, et al. “Unsupervised sound separation using mixtures of mixtures.” arXivpreprint arXiv:2006.12701 (2020).