Winter School

Introduction

In order to promote APSIPA in the local area and extend the education of signal and information technology, we propose a winter school as a satellite event of APSIPA ASC 2019. The theme of the winter school will be 'Speech Technologies and AI'. We will invite several known scholars in speech processing to give several lectures, from the fundamental theory (in the morning) to the research frontier (in the afternoon).

Venue

Northwest Minzu University, Wenjin Building, Lecture Hall (西北民族大学文津楼学术报告厅）

Time

Nov. 17, 2019

Program

	Time	Action
Morning Sessions (Fundamentals)	08:00 - 8:20	Registration
	8:20 - 8:30	Opening
	8:30 - 10:00
	8:30 - 10:00	Lecture 1: Speech Signal Modeling and Processing: Fundamentals and Applications Speaker: Christian Ritz, University of Wollongong, Australia
	10:00- 10:30	Coffee break
	10:30 -12:00
	10:30 -12:00	Lecture 2: Deep Learning and Its Applications to Speech Processing: Recognition and Generation Speaker: Yu Tsao, Academia Sinica, Taiwan, China
Lunch Time	12:00 - 13:20	Lunch break
Afternoon Sessions (Research Frontier)	13:20 - 13:30	Message from Sponsors
	13:30 - 14:30
	13:30 - 14:30	Lecture 3: History of Personal Media Terminals: From Walkman to Apple Watch Speaker: Akihiko K. Sugiyamai, Yahoo, Japan
	14:30 - 15:30
	14:30 - 15:30	Lecture 4: Metric Learning for Speaker Recognition Speaker: Xiaolei Zhang, North West Polytechnical University, China
	15:30 - 16:00	Coffee break
	16:00 - 17:00
	16:00 - 17:00	Lecture 5: Generative Adversarial Networks (GANs) for Speech Technology Speaker: Hemant Patil (DA-IICT), India
	17:00 - 18:00
	17:00 - 18:00	Lecture 6: Speech Factorization: Squeeze Stereo Information from Speech Signals Speaker: Dong Wang, Tsinghua University, China
Evening Event	19:00 - 21:00	Dinner of speakers

Speakers

Lecture 1: Speech Signal Modeling and Processing: Fundamentals and Applications

Christian Ritz

Christian Ritz
University of Wollongong, Australia

Slide

Abstract
This lecture will first introduce the fundamentals of speech signal processing, including basic properties, the speech production system, speech perception and standard models such as Linear Prediction. It will then briefly introduce the basic components of some selected applications including speech coding, speech recognition and speech enhancement. Following this, an overview will be provided of the main techniques for evaluating the performance of these applications, such as speech quality measures, speech intelligibility measures and speech recognition accuracy. The lecture will conclude with an introduction to microphones and microphone arrays and their used within speech signal processing applications. The lecture is designed as an introduction for researchers who are new to speech signal processing as well as a review of the key areas of knowledge for more experiences speech signal processing researchers.

Biography
Christian graduated with a Bachelor of Electrical Engineering and a Bachelor of Mathematics (both in 1999) and a PhD in Electrical Engineering (in 2003) all from the University of Wollongong (UOW), Australia. His PhD research focused on very low bit rate coding of wideband speech signals. Since 2003, Christian has held a position within the School of Electrical, Computer and Telecommunications Engineering at UOW where he is currently a Professor. Concurrently, he is also the Associate Dean (International) for UOW’s Faculty of Engineering and Information Sciences, with responsibility for managing the Faculty’s international strategy including significant transnational programs and partnerships in China, Hong Kong, Dubai, Singapore and Malaysia. Christian is the deputy director of the Centre for Signal and Information Processing (CSIP) and leads the audio, speech and acoustics signal processing research of the centre. He is actively involved in several projects, some funded from the Australian government and industry, including microphone array signal processing for the directional sound enhancement, acoustic scene classification, loudspeaker-based sound field reproduction and control and visual object classification using machine learning. He is currently a Distinguished Lecturer (2019 to 2020) of the Asia Pacific Signal and Information Processing Association (APSIPA). For more information see here.

Lecture 2: Deep Learning and Its Applications to Speech Processing: Recognition and Generation

Yu Tsa

Yu Tsao
Academia Sinica, Taiwan, China

Slide

Abstract
Pattern recognition and machine learning have become indispensable tools for managing big data from diverse sources such as image, text, speech, and medical data. Deep learning allows machines to learn high-level concepts from very low-level data. The hierarchical deep structure mimics human cognitive functions such as those involved in vision or hearing perception. In the first part of this talk, we will introduce the fundamentals and advantages of the deep learning algorithms. Several well-known deep learning models, both generative and discriminative models, will be presented. In the second part of this talk, we will two representative speech recognition and generation tasks based on deep learning technologies, namely pathological voice recognition and speech enhancement.

Biography
Yu Tsao (M’09) received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Japan, where he was engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is an Associate Research Fellow with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech recognition, audio-coding, deep neural networks, bio-signals, and acoustic modeling. He received TAAI 2012 Excellent Paper Award, APSIPA 2017 Poster Presentation Award, ROCLING 2017 Best Paper Award, Academia Sinica Career Development Award and National Innovation Award in 2017 and National Innovation Award in 2018 and 2019. He is currently an APSIPA Distinguished Lecturer, the Vice-Chair of Speech, Language, and Audio (SLA) Technical Committee, APSIPA, and the Associate Editor of IEEE/ACM Transactions on Audio, Speech and Language Processing.

Lecture 3: History of Personal Media Terminals: From Walkman to Apple Watch

Akihiko K. Sugiyamai

Akihiko K. Sugiyamai
Yahoo, Japan

Slide

Abstract
A brief history of personal media terminals, highlighting the development of the Silicon Audio, the world’s first all solid-state audio player. The background of its development, its concept, and details of early versions are explained. The family of personal media terminals are presented followed by the impact on the following products such as smartphones, tablet PCs, and smart watches as well as production. Reference.

Biography
Akihiko Sugiyama (a.k.a. Ken Sugiyama), affiliated with Yahoo! JAPAN Research after 38 years at NEC Corporation, has been engaged in a wide variety of research projects in signal processing such as audio coding and interference/noise control. His team in NEC developed the Silicon Audio, the world's first all solid-state audio player and a precursor of iPod, in 1994. He served as the Chair of Audio and Acoustic Signal Processing Technical Committee, IEEE Signal Processing Society (SPS) [2011-2012], as associate editors for several journals such as IEEE Trans. Signal Processing [1994-1996], as the Secretary and a Member at Large to the Conference Board of SPS [2010-2011], as a member of the Awards Board of SPS [2015-2017], as the Chair of Japan Chapter of SPS [2010-2011], and serves as a member of IEEE Fellow Committee. He was a Technical Program Chair for ICASSP2012. He has contributed to 17 chapters of books and is the inventor of 217 registered patents with more pending applications in the field of signal processing in Japan and overseas. He received 19 awards such as the 2002 IEICE Best Paper Award, the 2006 and the 2018 IEICE Achievement Award, the 2013 Ichimura Industry Award, and the 2017 APSIPA Industrial Distinguished Leader Award. He has delivered 134 invited talks in 70 cities of 27 countries.

Lecture 4: Metric Learning for Speaker Recognition

Xiaolei Zhang

Xiaolei Zhang
North West Polytechnical University, China

Abstract
Metric learning, which finds a suitable similarity metric between two utterances, has received much attention in recent studies of speaker recognition. This talk will discuss some of our recent progress on metric learning based speaker verification and diarization. In speaker verification, we introduce two metric learning objectives (minimization of equal error rates and maximization of partial area under the ROC curves) to optimize the evaluation metrics directly. In speaker diarization, we introduce an unsupervised similarity metric deep learning backend (multi-layer bootstrap network) to learn a uniformly distributed feature space.

Biography
Xiao-Lei Zhang is currently a full professor with the Northwestern Polytechnical University, Xi’an China. He received the Ph.D. degree from the Information and Communication Engineering, Tsinghua University, Beijing, China, and did his postdoctoral research with the Department of Computer Science and Engineering, The Ohio State University, Columbus, OH. His research interests are the topics in audio and speech signal processing, machine learning, statistical signal processing, and artificial intelligence. He has published over 40 articles in Neural Networks, IEEE TPAMI, IEEE TASLP, IEEE TCYB, IEEE TSMC, ICASSP, etc. He has coedited a text book in statistics. He was elected as an APSIPA Distinguished Lecturer. He was selected into the youth program of National Distinguished Experts of China, and the program of one hundred talented plan of Shaanxi Province. He was awarded the First-Class Beijing Science and Technology Award, and the best paper award of Ubi-Media 2019. He is/was an editor of several international journals including Neural Networks, EURASIP Journal on Audio, Speech, and Music Processing, IEEE Access, etc. He is a member of the APSIPA, IEEE SPS and ISCA.

Lecture 5: Generative Adversarial Networks (GANs) for Speech Technology

Hemant Patil

Hemant Patil
DA-IICT, India

Slide

Abstract
Adversarial training or Generative Adversarial Networks (GANs) is the most interesting and technologically challenging idea in the field of machine learning. GAN is a recent framework for estimating generative models via the adversarial training mechanism in which we simultaneously train two models, namely, a generator G that captures the (true) data distribution and a discriminator model D that estimate the probability that a sample came from training data rather than G. The training procedure of GANs for G is to maximize the probability of D making mistake. This framework corresponds to a mini-max two-player game (such as thief-Police game!). GANs are widely used in various applications (first used in image processing and computer vision and recently in speech areas). In particular, image (sample) generation, single image super resolution, text-to-image synthesis, and several speech technology applications (mostly after 2017), such as voice conversion, Non-audible Murmur (NAM)-to- whisper conversion, whisper-to-normal conversion, voice imitation, speech enhancement, speech synthesis, and a very recent application to speaker recognition, etc. The objective of this talk is to first understand the fundamentals of GANs w.r.t. motivation, applications, various GAN architectures along with future research directions. Finally, the talk will bring out several open research problems (relationship with variational autoencoders and their asymptotic consistency, convergence of GANs), that needs immediate attention to fully realize the potential of GANs in several technological applications.

Biography
Hemant A. Patil received Ph.D. degree from the Indian Institute of Technology (IIT), Kharagpur, India, in July 2006. Since 2007, he has been a faculty member at DA-IICT Gandhinagar, India and developed Speech Research Lab at DA-IICT recognized as ISCA speech labs. Dr. Patil is member of ISCA, IEEE, IEEE Signal Processing Society, IEEE Circuits and Systems Society, EURASIP, APSIPA and an affiliate member of IEEE SLTC. He is regular reviewer for ICASSP and INTERSPEECH, Speech Communication, Elsevier, Computer Speech and Language, Elsevier and Int. J. Speech Tech, Springer, Circuits, Systems and Signal Processing, Springer. He has published around 240+ research publications in national and international conferences/journals/book chapters. His research interests include speech and speaker recognition, analysis of spoofing attacks, TTS, and infant cry analysis. Dr. Patil has taken a lead role in organizing several ISCA supported events, such as summer/winter schools/CEP workshops and progress review meetings for two MeitY consortia projects all at DA-IICT Gandhingagar. He also offered a joint tutorial during APSIPA ASC 2017, APSIPA ASC 2018 and INTERSPEECH 2018. He has been selected as APSIPA Distinguished Lecturer (DL) for 2018-2019 and he has 20 APSIPA DLs in three countries, namely, India, China and Canada. Recently, he is selected as ISCA Distinguished Lecturer (DL) for 2020-2021.

Lecture 6: Speech Factorization: Squeeze Stereo Information from Speech Signals

Dong Wang

Dong Wang
Tsinghua University, China

Slide

Abstract
Speech signals involve complex information which convolves in an unknown way. By factorizing speech signals into independent (hopefully) factors, speech information processing tasks will be dramatically simplified. In this lecture, we will discuss various ways to factorize speech signals and show the promising results with these techniques.

Biography
Prof. Dong Wang is an associate professor at Tsinghua University, the deputy dean of the Center for Speech and Language Technologies (CSLT) at Tsinghua University. He obtained the Bachelor and Master degrees at Tsinghua University, and the PhD degree at the University of Edinburg in 2010. Prof. Wang worked in Oracle China, IBM China, EURECOM France and Nuance US. He worked on speech processing since 1998, and published more than 140 academic papers. He is the chair of APSIPA SLA track, and serves as a distinguished lecture 2018-2019.