encoding and communicating navigable soundfields

Research into encoding and communicating navigable soundfields is part of an ARC Discovery Project jointly conducted between UOW (lead institution) and RMIT University. At UOW, Xiguang Zheng and Chen Meng are PhD students working on this project. The project focuses on providing users with the ability to selectively choose their “listening point” within a 3D audio scene. Research has included the creation of an analysis-by-synthesis approach for jointly encoding multiple speech sources extracted from a spatial sound scene into a mono or stereo downmix signal that can be efficiently compressed with an existing standard speech coder. Recent research published at ICASSP2012 has demonstrated the success of this approach in maintaining the quality of individual speech sources when decoded from a the compressed mixture. This has application to spatial audio teleconferencing.

Some example files are provided below to demonstrate the performance of this approach when compared to separate encoding of each source using the AMR-WB+ coder. The first set of examples is for single speech sources obtained from a database of anechoic recordings that were used to create artificial mixtures of three overlapping speech sentences. The proposed approach is used to obtain a single channel mixture signal that is then compressed with the AMR-WB+ coder at 36 kbps (side information used to denote each source in the mixture can be losslessly compressed at a rate of approximately 2 kbps). Separate encoding of each source is achieved using AMR-WB+ operating at 12 kbps for a total bit rate 36 kbps.

Original

Proposed approach

Separate Encoding

Male 1

Male 1_joint

Male 1_sep

Female 1

Female 1_joint

Female 1_sep

 

The second set of examples is for real recordings of meeting speech obtained from the AMI meeting corpus (http://corpus.amiproject.org/ ). In this case, each meeting contained four participants and the proposed approach was again compared with the separate encoding of each source using the AMR-WB+ coder. Total bit rates were again set to approximately 36 kbps for all cases.

Original

Proposed approach

Separate Encoding

Male 1

Male 1_joint

Male 1_sep

Female 1

Female 1_joint

Female 1_sep

 

  Relevant Publications

[1]     Xiguang Zheng, Christian Ritz, and Jiangtao Xi, “Encoding navigable speech sources: an analysis by synthesis approach”, Proc. IEEE 2012 International Conference on Acoustics, Speech and Signal Processing (ICASSP'2012), Kyoto, Japan, Mar 25-30, 2012.

[2]     Xiguang Zheng, Christian Ritz, “Compression of Navigable Speech Soundfield Zones”, Proc. 2011 IEEE International Workshop on Multimedia Signal Process (MMSP2011), pp. Hangzhou, China, October 17-19, 2011.

[3]     Xiguang Zheng, Christian Ritz, “Hybrid FEC and MDC Models for Low-Delay Packet-Loss Recovery”, Proc. 5th International Conference on Signal Processing and Communication Systems, 12 - 14 December 2011, Honolulu, Hawaii.

[4]     Christian H. Ritz, Muawiyath Shujau, Xiguang Zheng, Bin Cheng, Eva Cheng and Ian S Burnett (2011). Backward Compatible Spatialized Teleconferencing based on Squeezed Recordings, Advances in Sound Localization, Pawel Strumillo (Ed.), ISBN: 978-953-307-224-1, InTech,  Available from: http://www.intechopen.com/articles/show/title/backward-compatible-spatialized-teleconferencing-based-on-squeezed-recordings