Publish as conference paper at "International Conference on Machine Learning" - ICML 2020
Facebook AI Research, Tel-Aviv UniversityThis post presents "Voice Separation with an Unknown Number of Multiple Speakers", a deep model for multi speaker voice separation with single microphone.
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously. The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed. A different model is trained for every number of possible speakers, and a the model with the largest number of speakers is employed to select the actual number of speakers in a given sample. Our method greatly outperforms the current state of the art, which, as we show, is not competitive for more than two speakers.
Link to txt files for 4-5 speaker dataset - Download here
Link to 2-3 speaker dataset [1] - Download from MERL website
Here are some samples from our model for you to listen:
Mixture input | Ground Truth | Ours |
---|---|---|
Mixture input | Ground Truth | Ours |
---|---|---|
Mixture input | Ground Truth | Ours | DPRNN | Conv-TasNet |
---|---|---|---|---|
Mixture input | Ground Truth | Ours | DPRNN | Conv-TasNet |
---|---|---|---|---|
Mixture input | Ground Truth | Ours | DPRNN | Conv-TasNet |
---|---|---|---|---|
Mixture input | Ground Truth | Ours | DPRNN | Conv-TasNet |
---|---|---|---|---|
Results for model that train on WSJ-5mix and tested on WSJ-2mix. Results suggests that all model have managed to separate the two speakers. However, compared to DPRNN and ConvTasNet which output noise and mixed signals in the rest of the output channels, our model produce silence with significantly less artifacts.
Mixture input | Ground Truth | Ours | DPRNN | Conv-TasNet |
---|---|---|---|---|
Mixture input | Ours |
---|---|
[1]. Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., Hershey, J.R., "Single-Channel Multi-Speaker Separation using Deep Clustering", Interspeech, DOI: 10.21437/Interspeech.2016-1176, September 2016, pp. 545-549.
[2]. Luo, Yi, and Nima Mesgarani. "Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation." IEEE/ACM transactions on audio, speech, and language processing 27.8 (2019): 1256-1266.
[3]. Luo, Yi, Zhuo Chen, and Takuya Yoshioka. "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation." ICASSP (2019).