|
内容記述 |
The speaker identification is one of the key technologies for person identification in<br />humanoid robots. Especially, when the face information is not available, the speaker<br />identification is the only way to identify person, thus, to improve the speaker identi-<br />fication performance is an important issue for person identification tasks.<br /> There are four major issues in speaker identification for humanoid robots in prac-<br />tice. First, the humanoid robots should identify the speaker in real-time with high<br />identification rates. In these days, the kernel methods such as the support vector<br />machine (SVM) and kernel logistic regression (KLR) are popular for speaker identifi-<br />cation tasks, and the kernel based systems outperform the conventional Gaussian<br />Mixture Model (GMM) based system. However, the kernel based speaker iden-<br />tification systems are usually computationally intensive, and this is of course not<br />preferable for real-time implementation. To deal with the computational issue, we<br />propose a method of approximating the sequence kernel that is shown to be compu-<br />tationally very efficient in Chapter 3. More specifically, we formulate the problem<br />of approximating the sequence kernel as the problem of obtaining a <i>pre-image</i> in<br />a reproducing kernel Hilbert space. The effectiveness of the proposed approximation<br />is demonstrated in text-independent speaker identification experiments with 10 male<br />speakers?our approach provides significant reduction in computation time while per-<br />formance degradation is kept moderately. Based on the proposed method, we develop<br />a real-time kernel-based speaker identification system using the Virtual Studio Tech-<br />nology (VST).<br /> Second, the speech features vary over time due to session dependent variation,<br />the recording environment change, and physical conditions/emotions. However, con-<br />ventional kernel based systems implicitly ignore these facts, and they just simply<br />assume that the training and test input probability distributions of the training and<br />test datasets are same at any time. To alleviate the influence of session dependent<br />variation, it is popular to use several sessions of speaker utterance samples or to use<br /><i>cepstral mean normalization</i> (CMN). However, gathering several sessions of speaker<br />utterance data and assigning the speaker ID to the collected data are expensive both<br />in time and cost and therefore not realistic in practice. Moreover, it is not possi-<br />ble to perfectly remove the session dependent variation by CMN alone. Thus, in<br />Chapter 4, we propose a novel semi-supervised speaker identification method that<br />can alleviate the influence of non-stationarity such as session dependent variation,<br />the recording environment change, and physical conditions / emotions. We assume<br />that the voice quality variants follow the <i>covariate shift</i> model, where only the voice<br />feature distribution changes in the training and test phases. Our method consists of<br />weighted versions of kernel logistic regraession and cross validation and is theoretically<br />shown to have the capability of alleviating the influence of covariate shift, where the<br />weight (a.k.a importance) is estimated from the training and test distribution using<br />the Kullback-Leibler Importance Estimation Procedure (KLIEP). We experimentally<br />show through text-independent / dependent speaker identification simulations that the<br />proposed method is promising in dealing with variations in voice quality.<br /> Third, the humanoid robots are desired to automatically detect the unknown<br />speaker and add the unknown speaker into the dictionary. Thus, the speaker detec-<br />tion task can be formulated as the outlier detection problem (i.e., outliers can be the<br />unknown speakers). Since the outlier detection problem can be solved through the<br />comparison between the log likelihoods of the unknown speaker and the speakers,<br />the estimation accuracy of the log likelihoods is an important issue to improve the<br />speaker detection performance. Thus, in chapter 5, we propose a new importance<br />(a.k.a likelihood) estimation method using Gaussian mixture models (GMMs) and<br />principal component analyzers (PPCAs) mixture, where the proposed approach esti-<br />mates the importance without going through the density estimation. An advantage of<br />the proposed methods is that covariance matrices or projection matrices can also be<br />learned through an expectation-maximization procedure, so the proposed method ex-<br />pected to work well when the true importance function has high correlation. Through<br />experiments of outlier detection, we show the validity of the proposed approaches.<br /> Forth, the humanoid. robots move throughout the world, and the surrounding<br />environment, source positions, and source mixtures are constantly changing. In ad-<br />dition, the speech overlaps are frequently occurred during conversation. Thus, the<br />source separation techniques are useful for improving the speaker identification per-<br />formance. To deal with those problems, in Chapter 6, we consider the problem of<br />two-source signal separation from a two-microphone array, where a point source such<br />as a speech signal is placed in front of a two-microphone array, while no information<br />is available about another <i>interference</i> signal. We propose a simple and computation-<br />ally efficient method. for estimating the geometry and source type (a point or diffuse)<br />of the interference signal, which allows us to adaptively choose a suitable unmixing<br />matrix initialization scheme. Our propose method, <i>noise adaptive optimization of<br />matrix initialization</i>(NAOMI), is shown to be effective through source separation<br />and speaker identification simulations. |