WEKO3
アイテム
{"_buckets": {"deposit": "34d0e35d-b45c-41b5-83b5-cc124a651db9"}, "_deposit": {"created_by": 21, "id": "1675", "owners": [21], "pid": {"revision_id": 0, "type": "depid", "value": "1675"}, "status": "published"}, "_oai": {"id": "oai:ir.soken.ac.jp:00001675", "sets": ["17"]}, "author_link": ["0", "0", "0"], "item_1_biblio_info_21": {"attribute_name": "書誌情報(ソート用)", "attribute_value_mlt": [{"bibliographicIssueDates": {"bibliographicIssueDate": "2010-03-24", "bibliographicIssueDateType": "Issued"}, "bibliographic_titles": [{}]}]}, "item_1_creator_2": {"attribute_name": "著者名", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "山田, 誠"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_1_creator_3": {"attribute_name": "フリガナ", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "ヤマダ, マコト"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_1_date_granted_11": {"attribute_name": "学位授与年月日", "attribute_value_mlt": [{"subitem_dategranted": "2010-03-24"}]}, "item_1_degree_grantor_5": {"attribute_name": "学位授与機関", "attribute_value_mlt": [{"subitem_degreegrantor": [{"subitem_degreegrantor_name": "総合研究大学院大学"}]}]}, "item_1_degree_name_6": {"attribute_name": "学位名", "attribute_value_mlt": [{"subitem_degreename": "博士(統計科学)"}]}, "item_1_description_1": {"attribute_name": "ID", "attribute_value_mlt": [{"subitem_description": "2010029", "subitem_description_type": "Other"}]}, "item_1_description_12": {"attribute_name": "要旨", "attribute_value_mlt": [{"subitem_description": "The speaker identification is one of the key technologies for person identification in\u003cbr /\u003ehumanoid robots. Especially, when the face information is not available, the speaker\u003cbr /\u003eidentification is the only way to identify person, thus, to improve the speaker identi-\u003cbr /\u003efication performance is an important issue for person identification tasks.\u003cbr /\u003e There are four major issues in speaker identification for humanoid robots in prac-\u003cbr /\u003etice. First, the humanoid robots should identify the speaker in real-time with high\u003cbr /\u003eidentification rates. In these days, the kernel methods such as the support vector\u003cbr /\u003emachine (SVM) and kernel logistic regression (KLR) are popular for speaker identifi-\u003cbr /\u003ecation tasks, and the kernel based systems outperform the conventional Gaussian\u003cbr /\u003eMixture Model (GMM) based system. However, the kernel based speaker iden-\u003cbr /\u003etification systems are usually computationally intensive, and this is of course not\u003cbr /\u003epreferable for real-time implementation. To deal with the computational issue, we\u003cbr /\u003epropose a method of approximating the sequence kernel that is shown to be compu-\u003cbr /\u003etationally very efficient in Chapter 3. More specifically, we formulate the problem\u003cbr /\u003eof approximating the sequence kernel as the problem of obtaining a \u003ci\u003epre-image\u003c/i\u003e in\u003cbr /\u003ea reproducing kernel Hilbert space. The effectiveness of the proposed approximation\u003cbr /\u003eis demonstrated in text-independent speaker identification experiments with 10 male\u003cbr /\u003espeakers?our approach provides significant reduction in computation time while per-\u003cbr /\u003eformance degradation is kept moderately. Based on the proposed method, we develop\u003cbr /\u003ea real-time kernel-based speaker identification system using the Virtual Studio Tech-\u003cbr /\u003enology (VST).\u003cbr /\u003e Second, the speech features vary over time due to session dependent variation,\u003cbr /\u003ethe recording environment change, and physical conditions/emotions. However, con-\u003cbr /\u003eventional kernel based systems implicitly ignore these facts, and they just simply\u003cbr /\u003eassume that the training and test input probability distributions of the training and\u003cbr /\u003etest datasets are same at any time. To alleviate the influence of session dependent\u003cbr /\u003evariation, it is popular to use several sessions of speaker utterance samples or to use\u003cbr /\u003e\u003ci\u003ecepstral mean normalization\u003c/i\u003e (CMN). However, gathering several sessions of speaker\u003cbr /\u003eutterance data and assigning the speaker ID to the collected data are expensive both\u003cbr /\u003ein time and cost and therefore not realistic in practice. Moreover, it is not possi-\u003cbr /\u003eble to perfectly remove the session dependent variation by CMN alone. Thus, in\u003cbr /\u003eChapter 4, we propose a novel semi-supervised speaker identification method that\u003cbr /\u003ecan alleviate the influence of non-stationarity such as session dependent variation,\u003cbr /\u003ethe recording environment change, and physical conditions / emotions. We assume\u003cbr /\u003ethat the voice quality variants follow the \u003ci\u003ecovariate shift\u003c/i\u003e model, where only the voice\u003cbr /\u003efeature distribution changes in the training and test phases. Our method consists of\u003cbr /\u003eweighted versions of kernel logistic regraession and cross validation and is theoretically\u003cbr /\u003eshown to have the capability of alleviating the influence of covariate shift, where the\u003cbr /\u003eweight (a.k.a importance) is estimated from the training and test distribution using\u003cbr /\u003ethe Kullback-Leibler Importance Estimation Procedure (KLIEP). We experimentally\u003cbr /\u003eshow through text-independent / dependent speaker identification simulations that the\u003cbr /\u003eproposed method is promising in dealing with variations in voice quality.\u003cbr /\u003e Third, the humanoid robots are desired to automatically detect the unknown\u003cbr /\u003espeaker and add the unknown speaker into the dictionary. Thus, the speaker detec-\u003cbr /\u003etion task can be formulated as the outlier detection problem (i.e., outliers can be the\u003cbr /\u003eunknown speakers). Since the outlier detection problem can be solved through the\u003cbr /\u003ecomparison between the log likelihoods of the unknown speaker and the speakers,\u003cbr /\u003ethe estimation accuracy of the log likelihoods is an important issue to improve the\u003cbr /\u003espeaker detection performance. Thus, in chapter 5, we propose a new importance\u003cbr /\u003e(a.k.a likelihood) estimation method using Gaussian mixture models (GMMs) and\u003cbr /\u003eprincipal component analyzers (PPCAs) mixture, where the proposed approach esti-\u003cbr /\u003emates the importance without going through the density estimation. An advantage of\u003cbr /\u003ethe proposed methods is that covariance matrices or projection matrices can also be\u003cbr /\u003elearned through an expectation-maximization procedure, so the proposed method ex-\u003cbr /\u003epected to work well when the true importance function has high correlation. Through\u003cbr /\u003eexperiments of outlier detection, we show the validity of the proposed approaches.\u003cbr /\u003e Forth, the humanoid. robots move throughout the world, and the surrounding\u003cbr /\u003eenvironment, source positions, and source mixtures are constantly changing. In ad-\u003cbr /\u003edition, the speech overlaps are frequently occurred during conversation. Thus, the\u003cbr /\u003esource separation techniques are useful for improving the speaker identification per-\u003cbr /\u003eformance. To deal with those problems, in Chapter 6, we consider the problem of\u003cbr /\u003etwo-source signal separation from a two-microphone array, where a point source such\u003cbr /\u003eas a speech signal is placed in front of a two-microphone array, while no information\u003cbr /\u003eis available about another \u003ci\u003einterference\u003c/i\u003e signal. We propose a simple and computation-\u003cbr /\u003eally efficient method. for estimating the geometry and source type (a point or diffuse)\u003cbr /\u003eof the interference signal, which allows us to adaptively choose a suitable unmixing\u003cbr /\u003ematrix initialization scheme. Our propose method, \u003ci\u003enoise adaptive optimization of\u003cbr /\u003ematrix initialization\u003c/i\u003e(NAOMI), is shown to be effective through source separation\u003cbr /\u003eand speaker identification simulations.", "subitem_description_type": "Other"}]}, "item_1_description_18": {"attribute_name": "フォーマット", "attribute_value_mlt": [{"subitem_description": "application/pdf", "subitem_description_type": "Other"}]}, "item_1_description_7": {"attribute_name": "学位記番号", "attribute_value_mlt": [{"subitem_description": "総研大甲第1333号", "subitem_description_type": "Other"}]}, "item_1_select_14": {"attribute_name": "所蔵", "attribute_value_mlt": [{"subitem_select_item": "有"}]}, "item_1_select_16": {"attribute_name": "複写", "attribute_value_mlt": [{"subitem_select_item": "印刷物から複写可"}]}, "item_1_select_17": {"attribute_name": "公開状況", "attribute_value_mlt": [{"subitem_select_item": "全文公開可"}]}, "item_1_select_8": {"attribute_name": "研究科", "attribute_value_mlt": [{"subitem_select_item": "複合科学研究科"}]}, "item_1_select_9": {"attribute_name": "専攻", "attribute_value_mlt": [{"subitem_select_item": "15 統計科学専攻"}]}, "item_1_text_10": {"attribute_name": "学位授与年度", "attribute_value_mlt": [{"subitem_text_value": "2009"}]}, "item_creator": {"attribute_name": "著者", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "YAMADA, Makoto", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_files": {"attribute_name": "ファイル情報", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2016-02-17"}], "displaytype": "simple", "download_preview_message": "", "file_order": 0, "filename": "甲1333_要旨.pdf", "filesize": [{"value": "355.6 kB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_11", "mimetype": "application/pdf", "size": 355600.0, "url": {"label": "要旨・審査要旨", "url": "https://ir.soken.ac.jp/record/1675/files/甲1333_要旨.pdf"}, "version_id": "292aa5dc-ef6d-44fe-b78d-8b5eb8134fcd"}, {"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2016-02-17"}], "displaytype": "simple", "download_preview_message": "", "file_order": 1, "filename": "甲1333_本文.pdf", "filesize": [{"value": "1.5 MB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_11", "mimetype": "application/pdf", "size": 1500000.0, "url": {"label": "本文", "url": "https://ir.soken.ac.jp/record/1675/files/甲1333_本文.pdf"}, "version_id": "d215b245-f9b1-41c8-94de-89c1fb0c7c86"}]}, "item_language": {"attribute_name": "言語", "attribute_value_mlt": [{"subitem_language": "eng"}]}, "item_resource_type": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"resourcetype": "thesis", "resourceuri": "http://purl.org/coar/resource_type/c_46ec"}]}, "item_title": "Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification", "item_titles": {"attribute_name": "タイトル", "attribute_value_mlt": [{"subitem_title": "Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification"}, {"subitem_title": "Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification", "subitem_title_language": "en"}]}, "item_type_id": "1", "owner": "21", "path": ["17"], "permalink_uri": "https://ir.soken.ac.jp/records/1675", "pubdate": {"attribute_name": "公開日", "attribute_value": "2011-01-18"}, "publish_date": "2011-01-18", "publish_status": "0", "recid": "1675", "relation": {}, "relation_version_is_last": true, "title": ["Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification"], "weko_shared_id": -1}
Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification
https://ir.soken.ac.jp/records/1675
https://ir.soken.ac.jp/records/1675102abbf8-eb9c-44b6-b64b-c232fdd2f367
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
||
![]() |
Item type | 学位論文 / Thesis or Dissertation(1) | |||||
---|---|---|---|---|---|---|
公開日 | 2011-01-18 | |||||
タイトル | ||||||
タイトル | Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification | |||||
タイトル | ||||||
言語 | en | |||||
タイトル | Kernel Methods and Frequency Domain Independent Component Analysis for Robust Speaker Identification | |||||
言語 | ||||||
言語 | eng | |||||
資源タイプ | ||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_46ec | |||||
資源タイプ | thesis | |||||
著者名 |
山田, 誠
× 山田, 誠 |
|||||
フリガナ |
ヤマダ, マコト
× ヤマダ, マコト |
|||||
著者 |
YAMADA, Makoto
× YAMADA, Makoto |
|||||
学位授与機関 | ||||||
学位授与機関名 | 総合研究大学院大学 | |||||
学位名 | ||||||
学位名 | 博士(統計科学) | |||||
学位記番号 | ||||||
内容記述タイプ | Other | |||||
内容記述 | 総研大甲第1333号 | |||||
研究科 | ||||||
値 | 複合科学研究科 | |||||
専攻 | ||||||
値 | 15 統計科学専攻 | |||||
学位授与年月日 | ||||||
学位授与年月日 | 2010-03-24 | |||||
学位授与年度 | ||||||
2009 | ||||||
要旨 | ||||||
内容記述タイプ | Other | |||||
内容記述 | The speaker identification is one of the key technologies for person identification in<br />humanoid robots. Especially, when the face information is not available, the speaker<br />identification is the only way to identify person, thus, to improve the speaker identi-<br />fication performance is an important issue for person identification tasks.<br /> There are four major issues in speaker identification for humanoid robots in prac-<br />tice. First, the humanoid robots should identify the speaker in real-time with high<br />identification rates. In these days, the kernel methods such as the support vector<br />machine (SVM) and kernel logistic regression (KLR) are popular for speaker identifi-<br />cation tasks, and the kernel based systems outperform the conventional Gaussian<br />Mixture Model (GMM) based system. However, the kernel based speaker iden-<br />tification systems are usually computationally intensive, and this is of course not<br />preferable for real-time implementation. To deal with the computational issue, we<br />propose a method of approximating the sequence kernel that is shown to be compu-<br />tationally very efficient in Chapter 3. More specifically, we formulate the problem<br />of approximating the sequence kernel as the problem of obtaining a <i>pre-image</i> in<br />a reproducing kernel Hilbert space. The effectiveness of the proposed approximation<br />is demonstrated in text-independent speaker identification experiments with 10 male<br />speakers?our approach provides significant reduction in computation time while per-<br />formance degradation is kept moderately. Based on the proposed method, we develop<br />a real-time kernel-based speaker identification system using the Virtual Studio Tech-<br />nology (VST).<br /> Second, the speech features vary over time due to session dependent variation,<br />the recording environment change, and physical conditions/emotions. However, con-<br />ventional kernel based systems implicitly ignore these facts, and they just simply<br />assume that the training and test input probability distributions of the training and<br />test datasets are same at any time. To alleviate the influence of session dependent<br />variation, it is popular to use several sessions of speaker utterance samples or to use<br /><i>cepstral mean normalization</i> (CMN). However, gathering several sessions of speaker<br />utterance data and assigning the speaker ID to the collected data are expensive both<br />in time and cost and therefore not realistic in practice. Moreover, it is not possi-<br />ble to perfectly remove the session dependent variation by CMN alone. Thus, in<br />Chapter 4, we propose a novel semi-supervised speaker identification method that<br />can alleviate the influence of non-stationarity such as session dependent variation,<br />the recording environment change, and physical conditions / emotions. We assume<br />that the voice quality variants follow the <i>covariate shift</i> model, where only the voice<br />feature distribution changes in the training and test phases. Our method consists of<br />weighted versions of kernel logistic regraession and cross validation and is theoretically<br />shown to have the capability of alleviating the influence of covariate shift, where the<br />weight (a.k.a importance) is estimated from the training and test distribution using<br />the Kullback-Leibler Importance Estimation Procedure (KLIEP). We experimentally<br />show through text-independent / dependent speaker identification simulations that the<br />proposed method is promising in dealing with variations in voice quality.<br /> Third, the humanoid robots are desired to automatically detect the unknown<br />speaker and add the unknown speaker into the dictionary. Thus, the speaker detec-<br />tion task can be formulated as the outlier detection problem (i.e., outliers can be the<br />unknown speakers). Since the outlier detection problem can be solved through the<br />comparison between the log likelihoods of the unknown speaker and the speakers,<br />the estimation accuracy of the log likelihoods is an important issue to improve the<br />speaker detection performance. Thus, in chapter 5, we propose a new importance<br />(a.k.a likelihood) estimation method using Gaussian mixture models (GMMs) and<br />principal component analyzers (PPCAs) mixture, where the proposed approach esti-<br />mates the importance without going through the density estimation. An advantage of<br />the proposed methods is that covariance matrices or projection matrices can also be<br />learned through an expectation-maximization procedure, so the proposed method ex-<br />pected to work well when the true importance function has high correlation. Through<br />experiments of outlier detection, we show the validity of the proposed approaches.<br /> Forth, the humanoid. robots move throughout the world, and the surrounding<br />environment, source positions, and source mixtures are constantly changing. In ad-<br />dition, the speech overlaps are frequently occurred during conversation. Thus, the<br />source separation techniques are useful for improving the speaker identification per-<br />formance. To deal with those problems, in Chapter 6, we consider the problem of<br />two-source signal separation from a two-microphone array, where a point source such<br />as a speech signal is placed in front of a two-microphone array, while no information<br />is available about another <i>interference</i> signal. We propose a simple and computation-<br />ally efficient method. for estimating the geometry and source type (a point or diffuse)<br />of the interference signal, which allows us to adaptively choose a suitable unmixing<br />matrix initialization scheme. Our propose method, <i>noise adaptive optimization of<br />matrix initialization</i>(NAOMI), is shown to be effective through source separation<br />and speaker identification simulations. | |||||
所蔵 | ||||||
値 | 有 | |||||
フォーマット | ||||||
内容記述タイプ | Other | |||||
内容記述 | application/pdf |