Voice Separation with an Unknown Number of Multiple Speakers

This post presents "Voice Separation with an Unknown Number of Multiple Speakers", a deep model for multi speaker voice separation with single microphone.

We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and a the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.

Link to txt files for 4-5 speaker dataset - Download here

Link to 2-3 speaker dataset ^[1] - Download from MERL website

Here are some samples from our model for you to listen:

Mixture input - original mixed audio
Ground Truth - original separated samples
Ours - our proposed method
DPRNN^[2] - refer to the method "Dual-path RNN"
Conv-TasNet^[3] - refer to the method "Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation"

WHAM! Dataset Samples

Mixture input	Ground Truth	Ours

WHAMR! Dataset Samples

Mixture input	Ground Truth	Ours

WSJ-2mix Dataset Samples

Mixture input	Ground Truth	Ours	DPRNN	Conv-TasNet

WSJ-3mix Dataset Samples

Mixture input	Ground Truth	Ours	DPRNN	Conv-TasNet

WSJ-4mix Dataset Samples

Mixture input	Ground Truth	Ours	DPRNN	Conv-TasNet

WSJ-5mix Dataset Samples

Mixture input	Ground Truth	Ours	DPRNN	Conv-TasNet

WSJ-2mix DB Tested on WSJ-5mix Speakers Model

Results for model that train on WSJ-5mix and tested on WSJ-2mix. Results suggests that all model have managed to separate the two speakers. However, compared to DPRNN and ConvTasNet which output noise and mixed signals in the rest of the output channels, our model produce silence with significantly less artifacts.

Mixture input	Ground Truth	Ours	DPRNN	Conv-TasNet

Mixtures recorded in the wild - 2 speakers

Mixture input	Ours

References

[1]. Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.

[2]. Luo, Yi, and Nima Mesgarani. "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation." IEEE/ACM transactions on audio, speech, and language processing 27.8 (2019): 1256-1266.

[3]. Luo, Yi, Zhuo Chen, and Takuya Yoshioka. "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation." ICASSP (2019).