WEKO3
アイテム
Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification
https://ir.soken.ac.jp/records/1675
https://ir.soken.ac.jp/records/1675102abbf8-eb9c-44b6-b64b-c232fdd2f367
名前 / ファイル | ライセンス | アクション |
---|---|---|
要旨・審査要旨 (355.6 kB)
|
||
本文 (1.5 MB)
|
Item type | 学位論文 / Thesis or Dissertation(1) | |||||
---|---|---|---|---|---|---|
公開日 | 2011-01-18 | |||||
タイトル | ||||||
タイトル | Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification | |||||
タイトル | ||||||
タイトル | Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification | |||||
言語 | en | |||||
言語 | ||||||
言語 | eng | |||||
資源タイプ | ||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_46ec | |||||
資源タイプ | thesis | |||||
著者名 |
山田, 誠
× 山田, 誠 |
|||||
フリガナ |
ヤマダ, マコト
× ヤマダ, マコト |
|||||
著者 |
YAMADA, Makoto
× YAMADA, Makoto |
|||||
学位授与機関 | ||||||
学位授与機関名 | 総合研究大学院大学 | |||||
学位名 | ||||||
学位名 | 博士(統計科学) | |||||
学位記番号 | ||||||
内容記述タイプ | Other | |||||
内容記述 | 総研大甲第1333号 | |||||
研究科 | ||||||
値 | 複合科学研究科 | |||||
専攻 | ||||||
値 | 15 統計科学専攻 | |||||
学位授与年月日 | ||||||
学位授与年月日 | 2010-03-24 | |||||
学位授与年度 | ||||||
値 | 2009 | |||||
要旨 | ||||||
内容記述タイプ | Other | |||||
内容記述 | The speaker identification is one of the key technologies for person identification in<br />humanoid robots. Especially, when the face information is not available, the speaker<br />identification is the only way to identify person, thus, to improve the speaker identi-<br />fication performance is an important issue for person identification tasks.<br /> There are four major issues in speaker identification for humanoid robots in prac-<br />tice. First, the humanoid robots should identify the speaker in real-time with high<br />identification rates. In these days, the kernel methods such as the support vector<br />machine (SVM) and kernel logistic regression (KLR) are popular for speaker identifi-<br />cation tasks, and the kernel based systems outperform the conventional Gaussian<br />Mixture Model (GMM) based system. However, the kernel based speaker iden-<br />tification systems are usually computationally intensive, and this is of course not<br />preferable for real-time implementation. To deal with the computational issue, we<br />propose a method of approximating the sequence kernel that is shown to be compu-<br />tationally very efficient in Chapter 3. More specifically, we formulate the problem<br />of approximating the sequence kernel as the problem of obtaining a <i>pre-image</i> in<br />a reproducing kernel Hilbert space. The effectiveness of the proposed approximation<br />is demonstrated in text-independent speaker identification experiments with 10 male<br />speakers?our approach provides significant reduction in computation time while per-<br />formance degradation is kept moderately. Based on the proposed method, we develop<br />a real-time kernel-based speaker identification system using the Virtual Studio Tech-<br />nology (VST).<br /> Second, the speech features vary over time due to session dependent variation,<br />the recording environment change, and physical conditions/emotions. However, con-<br />ventional kernel based systems implicitly ignore these facts, and they just simply<br />assume that the training and test input probability distributions of the training and<br />test datasets are same at any time. To alleviate the influence of session dependent<br />variation, it is popular to use several sessions of speaker utterance samples or to use<br /><i>cepstral mean normalization</i> (CMN). However, gathering several sessions of speaker<br />utterance data and assigning the speaker ID to the collected data are expensive both<br />in time and cost and therefore not realistic in practice. Moreover, it is not possi-<br />ble to perfectly remove the session dependent variation by CMN alone. Thus, in<br />Chapter 4, we propose a novel semi-supervised speaker identification method that<br />can alleviate the influence of non-stationarity such as session dependent variation,<br />the recording environment change, and physical conditions / emotions. We assume<br />that the voice quality variants follow the <i>covariate shift</i> model, where only the voice<br />feature distribution changes in the training and test phases. Our method consists of<br />weighted versions of kernel logistic regraession and cross validation and is theoretically<br />shown to have the capability of alleviating the influence of covariate shift, where the<br />weight (a.k.a importance) is estimated from the training and test distribution using<br />the Kullback-Leibler Importance Estimation Procedure (KLIEP). We experimentally<br />show through text-independent / dependent speaker identification simulations that the<br />proposed method is promising in dealing with variations in voice quality.<br /> Third, the humanoid robots are desired to automatically detect the unknown<br />speaker and add the unknown speaker into the dictionary. Thus, the speaker detec-<br />tion task can be formulated as the outlier detection problem (i.e., outliers can be the<br />unknown speakers). Since the outlier detection problem can be solved through the<br />comparison between the log likelihoods of the unknown speaker and the speakers,<br />the estimation accuracy of the log likelihoods is an important issue to improve the<br />speaker detection performance. Thus, in chapter 5, we propose a new importance<br />(a.k.a likelihood) estimation method using Gaussian mixture models (GMMs) and<br />principal component analyzers (PPCAs) mixture, where the proposed approach esti-<br />mates the importance without going through the density estimation. An advantage of<br />the proposed methods is that covariance matrices or projection matrices can also be<br />learned through an expectation-maximization procedure, so the proposed method ex-<br />pected to work well when the true importance function has high correlation. Through<br />experiments of outlier detection, we show the validity of the proposed approaches.<br /> Forth, the humanoid. robots move throughout the world, and the surrounding<br />environment, source positions, and source mixtures are constantly changing. In ad-<br />dition, the speech overlaps are frequently occurred during conversation. Thus, the<br />source separation techniques are useful for improving the speaker identification per-<br />formance. To deal with those problems, in Chapter 6, we consider the problem of<br />two-source signal separation from a two-microphone array, where a point source such<br />as a speech signal is placed in front of a two-microphone array, while no information<br />is available about another <i>interference</i> signal. We propose a simple and computation-<br />ally efficient method. for estimating the geometry and source type (a point or diffuse)<br />of the interference signal, which allows us to adaptively choose a suitable unmixing<br />matrix initialization scheme. Our propose method, <i>noise adaptive optimization of<br />matrix initialization</i>(NAOMI), is shown to be effective through source separation<br />and speaker identification simulations. | |||||
所蔵 | ||||||
値 | 有 | |||||
フォーマット | ||||||
内容記述タイプ | Other | |||||
内容記述 | application/pdf |