Voice Separation with an Unknown Number of Multiple Speakers

Publish as conference paper at "International Conference on Machine Learning" - ICML 2020

Eliya Nachmani, Yossi Adi, Lior Wolf

Facebook AI Research, Tel-Aviv University


This post presents "Voice Separation with an Unknown Number of Multiple Speakers", a deep model for multi speaker voice separation with single microphone.

We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and a the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.

Link to txt files for 4-5 speaker dataset - Download here

Link to 2-3 speaker dataset [1] - Download from MERL website



Architecture

Italian Trulli

Italian Trulli




Loss Terms

Italian Trulli


Results - WSJ0-2mix

Italian Trulli

WHAM! & WHAMR!

Italian Trulli

Here are some samples from our model for you to listen:

  • Mixture input - original mixed audio
  • Ground Truth - original separated samples
  • Ours - our proposed method
  • DPRNN[2] - refer to the method "Dual-path RNN"
  • Conv-TasNet[3] - refer to the method "Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation"



WHAM! Dataset Samples

Mixture input Ground Truth Ours




WHAMR! Dataset Samples

Mixture input Ground Truth Ours


WSJ-2mix Dataset Samples

Mixture input Ground Truth Ours DPRNN Conv-TasNet


WSJ-3mix Dataset Samples

Mixture input Ground Truth Ours DPRNN Conv-TasNet


WSJ-4mix Dataset Samples

Mixture input Ground Truth Ours DPRNN Conv-TasNet


WSJ-5mix Dataset Samples

Mixture input Ground Truth Ours DPRNN Conv-TasNet

WSJ-2mix DB Tested on WSJ-5mix Speakers Model

Results for model that train on WSJ-5mix and tested on WSJ-2mix. Results suggests that all model have managed to separate the two speakers. However, compared to DPRNN and ConvTasNet which output noise and mixed signals in the rest of the output channels, our model produce silence with significantly less artifacts.

Mixture input Ground Truth Ours DPRNN Conv-TasNet


Mixtures recorded in the wild - 2 speakers

Mixture input Ours


References

[1]. Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.

[2]. Luo, Yi, and Nima Mesgarani. "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation." IEEE/ACM transactions on audio, speech, and language processing 27.8 (2019): 1256-1266.

[3]. Luo, Yi, Zhuo Chen, and Takuya Yoshioka. "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation." ICASSP (2019).