WEKO3
アイテム
{"_buckets": {"deposit": "55e3db86-b6a0-49a5-bea5-385c13ee4b41"}, "_deposit": {"created_by": 1, "id": "855", "owners": [1], "pid": {"revision_id": 0, "type": "depid", "value": "855"}, "status": "published"}, "_oai": {"id": "oai:ir.soken.ac.jp:00000855", "sets": ["19"]}, "author_link": ["0", "0", "0"], "item_1_biblio_info_21": {"attribute_name": "書誌情報(ソート用)", "attribute_value_mlt": [{"bibliographicIssueDates": {"bibliographicIssueDate": "2006-09-29", "bibliographicIssueDateType": "Issued"}, "bibliographic_titles": [{}]}]}, "item_1_creator_2": {"attribute_name": "著者名", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "WANG, Yuxin"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_1_creator_3": {"attribute_name": "フリガナ", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "ウァン, ユジン"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_1_date_granted_11": {"attribute_name": "学位授与年月日", "attribute_value_mlt": [{"subitem_dategranted": "2006-09-29"}]}, "item_1_degree_grantor_5": {"attribute_name": "学位授与機関", "attribute_value_mlt": [{"subitem_degreegrantor": [{"subitem_degreegrantor_name": "総合研究大学院大学"}]}]}, "item_1_degree_name_6": {"attribute_name": "学位名", "attribute_value_mlt": [{"subitem_degreename": "博士(情報学)"}]}, "item_1_description_1": {"attribute_name": "ID", "attribute_value_mlt": [{"subitem_description": "2006520", "subitem_description_type": "Other"}]}, "item_1_description_12": {"attribute_name": "要旨", "attribute_value_mlt": [{"subitem_description": " This disseration is devoted to investigate the method for building a high-quality\u003cbr /\u003ehomepage collection from the web effciently by considering the page group struc-\u003cbr /\u003etures. We mainly investigate in researchers\u0027 homepages and homepages of other\u003cbr /\u003ecategories partly.\u003cbr /\u003e A web page collection with a guaranteed high quality (i.e., high recall and high\u003cbr /\u003eprecision) is required for implementing high quality web-based information services.\u003cbr /\u003eBuilding such a collection demands a large amount of human work, however, be-\u003cbr /\u003ecause of the diversity, vastness and sparseness of web pages. Even though many\u003cbr /\u003eresearchers have investigated methods for searching and classifying web pages, etc.,\u003cbr /\u003emost of the methods are best-effort types and pay no attention to quality assurance.\u003cbr /\u003eWe are therefore investigating a method for building a homepage collection eff-\u003cbr /\u003eciently while assuring a given high quality, with the expectation that the investigated\u003cbr /\u003emethod can be applicable to the collection of various categories of homepages.\u003cbr /\u003e This dissertation consists of seven chapters. Chapter 1 gives the introduction,\u003cbr /\u003eand Chapter 2 presents the related work. Chapter 3 describes the objectives, the\u003cbr /\u003eoverall performance goal of the investigated system, and the scheme of the system.\u003cbr /\u003eChapters 4 and 5 discuss the two parts of our two-step-processing method in detail\u003cbr /\u003erespectively. Chapter 6 discusses the method for reducing the processing cost of the\u003cbr /\u003esystem, and Chapter 7 concludes the dissertation by summarizing it and discussing\u003cbr /\u003efuture work.\u003cbr /\u003e Chapter 3, taking into account the enormous size of the real web, introduces a\u003cbr /\u003etwo-step-processing method comprising rough filtering and accurate classifica-\u003cbr /\u003etion. The former is for narrowing down the amount of candidate pages effciently\u003cbr /\u003ewith the required high recall and the latter is for accurately classifying the candidate\u003cbr /\u003epages into three classes-assured positive, assured negative, and uncertain-while\u003cbr /\u003e We present in detail the con?guration, the experiments, and the evaluation of\u003cbr /\u003ethe rough filtering in Chapter 4. The rough filtering is a method for gathering\u003cbr /\u003eresearchers\u0027 homepages (or entry pages) by applying our original, simple, and effec-\u003cbr /\u003etive local page group models exploiting the mutual relations between the structure\u003cbr /\u003eand the content of a logical page group. It aims at narrowing down the candidates\u003cbr /\u003ewith a very high recall. First, property-based keyword lists that correspond to\u003cbr /\u003eresearchers\u0027 common properties are created and are grouped either as organization-\u003cbr /\u003erelated or non-organization-related. Next, four page group models (PGMs)\u003cbr /\u003etaking into consideration the structure in an individual logical page group are intro-\u003cbr /\u003educed. PGM_Od models the out-linked pages in the same and lower directories,\u003cbr /\u003ePGM Ou models the out-linked pages in the upper directories, PGM_I models\u003cbr /\u003ethe in-linked pages in the same and the upper directories, and PGM_U models the\u003cbr /\u003esite top and the directory entry pages in the same and the upper directories.\u003cbr /\u003e Based on the PGMs, the keywords are propagated to a potential entry page from\u003cbr /\u003eits surrounding pages to compose a virtual entry page. Finally, the virtual entry\u003cbr /\u003epages that scored at least a threshold value are selected. Since the application of\u003cbr /\u003ePGMs generally causes a lot of noises, we introduced four modified PGMs with\u003cbr /\u003etwo original techniques: the keywords are propagated based on PGM_Od only when\u003cbr /\u003ethe number of out-linked pages in the same and lower directories is less than a\u003cbr /\u003ethreshold value, and only the organization-related keywords are propagated based\u003cbr /\u003eon other PGMs. The four modified PGMs are used in combination in order to utilize\u003cbr /\u003eas many informative keywords as possible from the surrounding pages.\u003cbr /\u003e The effectiveness of the method is shown by comparing it with that of a single-\u003cbr /\u003epage-based method through experiments using a 100GB web data set and a manually\u003cbr /\u003ecreated sample data set. The experiment results show that the output pages from\u003cbr /\u003ethe rough filtering are less than 23% of the pages in the 100GB data set when the\u003cbr /\u003efour modified PGMs are used in combination under a condition that the recall is\u003cbr /\u003emore than 98%. Another experiment using a 1.36TB web data set with the same\u003cbr /\u003erough filtering configuration shows that the output pages are less than 15% of the\u003cbr /\u003epages in the corpus.\u003cbr /\u003e In Chapter 5 we present in detail the configuration, the experiments, and the\u003cbr /\u003eevaluation of the accurate classification method. Using two types of component\u003cbr /\u003eclassifiers (a recall-assured classifier and a precision-assured classifier) in\u003cbr /\u003ecombination, we construct a three-way classifier that inputs the candidate pages\u003cbr /\u003eoutput by the rough filtering and classifies them to three classes: assured posi-\u003cbr /\u003etive, assured negative, and uncertain. The assured positive output assures the\u003cbr /\u003eprecision and the assured positive and uncertain output together assure the recall,\u003cbr /\u003eso only the uncertain output needs to be manually assessed in order to assure the\u003cbr /\u003equality of the web data collection.\u003cbr /\u003e We first devise a feature set for building the high-performance component clas-\u003cbr /\u003esifiers using Support Vector Machine (SVM). We use textual features obtained\u003cbr /\u003efrom each page and its surrounding pages. After the surrounding pages are grouped\u003cbr /\u003eaccording to connection types (in-link, out-link, and directory entry) and relative\u003cbr /\u003eURL hierarchy (same, upper, or lower in the directory hierarchy), an independent\u003cbr /\u003efeature subset is generated from each group. Feature subsets are further concate-\u003cbr /\u003enated conceptually to compose the feature set of a classifier. We use two types of\u003cbr /\u003etextual features (plain-text-based and tagged-text-based). The classifier using\u003cbr /\u003eonly the plain-text-based features in each page alone is used as the baseline. Various\u003cbr /\u003efeature sets are tested in the experiment using manually prepared sample data, and\u003cbr /\u003ethe classifiers are tuned by two methods, one offset-based and the other c-j-option-\u003cbr /\u003ebased. The results show that the performance obtained by using c-j-option-based\u003cbr /\u003etuning method is statistically signi?cant at 95% confidence level. The F-measures\u003cbr /\u003eof the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%\u003cbr /\u003eand show that the proposed method is evidently effective.\u003cbr /\u003e To know the performances of the classiffers with the abovementioned feature sets\u003cbr /\u003ein more general cases, we experimented with our method on the Web-\u003eKB data\u003cbr /\u003eset, a test collection commonly used for the web page classi?cation task. It contains\u003cbr /\u003eseven categories and four of them-course, faculty, project, and student-are used\u003cbr /\u003efor comparing the performance. The experiment results show that our method out-\u003cbr /\u003eperformed all seven of the previous methods in terms of macro-averaged F-measure.\u003cbr /\u003eWe can therefore conclude that our method performs fairly well and is applicable not\u003cbr /\u003eonly to researchers\u0027 homepages in Japanese but also to other categories of homepages\u003cbr /\u003ein other languages.\u003cbr /\u003e By tuning the well-performing classifiers independently, we then build a recall-\u003cbr /\u003eassured classifier and a precision-assured classifier and compose a three-way classi-\u003cbr /\u003efier by using them in combination. We estimated the numbers of the pages to be\u003cbr /\u003emanually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and\u003cbr /\u003e98%/90%, using the output pages from a 100GB data set through the rough filter-\u003cbr /\u003eing. The results show that the manual assessment cost can be reduced, compared\u003cbr /\u003eto the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-\u003cbr /\u003esification result examples, and the results show the effectiveness of the classifiers.\u003cbr /\u003e In Chapter 6 the cascaded structure of the recall-assured classifiers, used in\u003cbr /\u003ecombination with the rough filtering, is proposed for reducing the computer pro-\u003cbr /\u003ecessing cost. Estimation on the numbers of pages requiring feature extraction in the\u003cbr /\u003eaccurate classification shows that the computer processing cost can be reduced\u003cbr /\u003edown to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.\u003cbr /\u003e In Chapter 7 we summarize our contributions. One of our unique contributions\u003cbr /\u003eis that we pointed out the importance of assuring the quality of web page collection\u003cbr /\u003eand proposed a framework for doing so. Another is that we introduced an idea of\u003cbr /\u003elocal page group models (PGMs) and demonstrated its effective uses for filtering\u003cbr /\u003eand classifying web pages.\u003cbr /\u003e We first presented a realistic framework for building a high-quality web page\u003cbr /\u003ecollection with a two-step process, composing the rough filtering followed by the\u003cbr /\u003eaccurate classification, in order to reduce the processing cost. In the rough filtering\u003cbr /\u003ewe contributed two original key techniques used in the modified PGMs to reduce the\u003cbr /\u003eirrelevant keywords to be propagated. One is to introduce a threshold on the number\u003cbr /\u003eof out-linked pages in the same and lower directories, and the other is to introduce\u003cbr /\u003ekeyword list types and propagate only the organization-related keyword lists from\u003cbr /\u003ethe upper directories. In the accurate classification we contributed not only a original\u003cbr /\u003emethod for exploiting features from the surrounding pages and concatenating the\u003cbr /\u003efeatures independently to improve web page classification performance but also a\u003cbr /\u003eway to use a recall-assured classifier and a precision-assured classifier in combination\u003cbr /\u003eas a three-way classifier in order to reduce the amount of pages requiring manual\u003cbr /\u003eassessment under the given quality constraints.\u003cbr /\u003e We also discuss the future work: finding a more systematic way for modifying\u003cbr /\u003ethe property set and property-based keywords for the rough ?ltering, investigating\u003cbr /\u003eways to estimate the likelihood of the component pages and incorporate them for\u003cbr /\u003ethe accurate classification, and further utilizing the information from the homepage\u003cbr /\u003ecollection for practical applications.", "subitem_description_type": "Other"}]}, "item_1_description_18": {"attribute_name": "フォーマット", "attribute_value_mlt": [{"subitem_description": "application/pdf", "subitem_description_type": "Other"}]}, "item_1_description_7": {"attribute_name": "学位記番号", "attribute_value_mlt": [{"subitem_description": "総研大甲第1000号", "subitem_description_type": "Other"}]}, "item_1_select_14": {"attribute_name": "所蔵", "attribute_value_mlt": [{"subitem_select_item": "有"}]}, "item_1_select_8": {"attribute_name": "研究科", "attribute_value_mlt": [{"subitem_select_item": "複合科学研究科"}]}, "item_1_select_9": {"attribute_name": "専攻", "attribute_value_mlt": [{"subitem_select_item": "17 情報学専攻"}]}, "item_1_text_10": {"attribute_name": "学位授与年度", "attribute_value_mlt": [{"subitem_text_value": "2006"}]}, "item_creator": {"attribute_name": "著者", "attribute_type": "creator", "attribute_value_mlt": [{"creatorNames": [{"creatorName": "WANG, Yuxin", "creatorNameLang": "en"}], "nameIdentifiers": [{"nameIdentifier": "0", "nameIdentifierScheme": "WEKO"}]}]}, "item_files": {"attribute_name": "ファイル情報", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2016-02-17"}], "displaytype": "simple", "download_preview_message": "", "file_order": 0, "filename": "甲1000_要旨.pdf", "filesize": [{"value": "394.0 kB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_11", "mimetype": "application/pdf", "size": 394000.0, "url": {"label": "要旨・審査要旨", "url": "https://ir.soken.ac.jp/record/855/files/甲1000_要旨.pdf"}, "version_id": "9fb578af-2f50-4356-bc3f-af1341d304f7"}, {"accessrole": "open_date", "date": [{"dateType": "Available", "dateValue": "2016-02-17"}], "displaytype": "simple", "download_preview_message": "", "file_order": 1, "filename": "甲1000_本文.pdf", "filesize": [{"value": "2.3 MB"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_11", "mimetype": "application/pdf", "size": 2300000.0, "url": {"label": "本文", "url": "https://ir.soken.ac.jp/record/855/files/甲1000_本文.pdf"}, "version_id": "5df2d64c-2e80-447b-97be-9081382713a1"}]}, "item_language": {"attribute_name": "言語", "attribute_value_mlt": [{"subitem_language": "eng"}]}, "item_resource_type": {"attribute_name": "資源タイプ", "attribute_value_mlt": [{"resourcetype": "thesis", "resourceuri": "http://purl.org/coar/resource_type/c_46ec"}]}, "item_title": "Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures", "item_titles": {"attribute_name": "タイトル", "attribute_value_mlt": [{"subitem_title": "Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"}, {"subitem_title": "Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures", "subitem_title_language": "en"}]}, "item_type_id": "1", "owner": "1", "path": ["19"], "permalink_uri": "https://ir.soken.ac.jp/records/855", "pubdate": {"attribute_name": "公開日", "attribute_value": "2010-02-22"}, "publish_date": "2010-02-22", "publish_status": "0", "recid": "855", "relation": {}, "relation_version_is_last": true, "title": ["Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"], "weko_shared_id": -1}
Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures
https://ir.soken.ac.jp/records/855
https://ir.soken.ac.jp/records/855538dfea3-de1c-4eb7-b9d5-471e28faff95
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
||
![]() |
Item type | 学位論文 / Thesis or Dissertation(1) | |||||
---|---|---|---|---|---|---|
公開日 | 2010-02-22 | |||||
タイトル | ||||||
タイトル | Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures | |||||
タイトル | ||||||
言語 | en | |||||
タイトル | Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures | |||||
言語 | ||||||
言語 | eng | |||||
資源タイプ | ||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_46ec | |||||
資源タイプ | thesis | |||||
著者名 |
WANG, Yuxin
× WANG, Yuxin |
|||||
フリガナ |
ウァン, ユジン
× ウァン, ユジン |
|||||
著者 |
WANG, Yuxin
× WANG, Yuxin |
|||||
学位授与機関 | ||||||
学位授与機関名 | 総合研究大学院大学 | |||||
学位名 | ||||||
学位名 | 博士(情報学) | |||||
学位記番号 | ||||||
内容記述タイプ | Other | |||||
内容記述 | 総研大甲第1000号 | |||||
研究科 | ||||||
値 | 複合科学研究科 | |||||
専攻 | ||||||
値 | 17 情報学専攻 | |||||
学位授与年月日 | ||||||
学位授与年月日 | 2006-09-29 | |||||
学位授与年度 | ||||||
2006 | ||||||
要旨 | ||||||
内容記述タイプ | Other | |||||
内容記述 | This disseration is devoted to investigate the method for building a high-quality<br />homepage collection from the web effciently by considering the page group struc-<br />tures. We mainly investigate in researchers' homepages and homepages of other<br />categories partly.<br /> A web page collection with a guaranteed high quality (i.e., high recall and high<br />precision) is required for implementing high quality web-based information services.<br />Building such a collection demands a large amount of human work, however, be-<br />cause of the diversity, vastness and sparseness of web pages. Even though many<br />researchers have investigated methods for searching and classifying web pages, etc.,<br />most of the methods are best-effort types and pay no attention to quality assurance.<br />We are therefore investigating a method for building a homepage collection eff-<br />ciently while assuring a given high quality, with the expectation that the investigated<br />method can be applicable to the collection of various categories of homepages.<br /> This dissertation consists of seven chapters. Chapter 1 gives the introduction,<br />and Chapter 2 presents the related work. Chapter 3 describes the objectives, the<br />overall performance goal of the investigated system, and the scheme of the system.<br />Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail<br />respectively. Chapter 6 discusses the method for reducing the processing cost of the<br />system, and Chapter 7 concludes the dissertation by summarizing it and discussing<br />future work.<br /> Chapter 3, taking into account the enormous size of the real web, introduces a<br />two-step-processing method comprising rough filtering and accurate classifica-<br />tion. The former is for narrowing down the amount of candidate pages effciently<br />with the required high recall and the latter is for accurately classifying the candidate<br />pages into three classes-assured positive, assured negative, and uncertain-while<br /> We present in detail the con?guration, the experiments, and the evaluation of<br />the rough filtering in Chapter 4. The rough filtering is a method for gathering<br />researchers' homepages (or entry pages) by applying our original, simple, and effec-<br />tive local page group models exploiting the mutual relations between the structure<br />and the content of a logical page group. It aims at narrowing down the candidates<br />with a very high recall. First, property-based keyword lists that correspond to<br />researchers' common properties are created and are grouped either as organization-<br />related or non-organization-related. Next, four page group models (PGMs)<br />taking into consideration the structure in an individual logical page group are intro-<br />duced. PGM_Od models the out-linked pages in the same and lower directories,<br />PGM Ou models the out-linked pages in the upper directories, PGM_I models<br />the in-linked pages in the same and the upper directories, and PGM_U models the<br />site top and the directory entry pages in the same and the upper directories.<br /> Based on the PGMs, the keywords are propagated to a potential entry page from<br />its surrounding pages to compose a virtual entry page. Finally, the virtual entry<br />pages that scored at least a threshold value are selected. Since the application of<br />PGMs generally causes a lot of noises, we introduced four modified PGMs with<br />two original techniques: the keywords are propagated based on PGM_Od only when<br />the number of out-linked pages in the same and lower directories is less than a<br />threshold value, and only the organization-related keywords are propagated based<br />on other PGMs. The four modified PGMs are used in combination in order to utilize<br />as many informative keywords as possible from the surrounding pages.<br /> The effectiveness of the method is shown by comparing it with that of a single-<br />page-based method through experiments using a 100GB web data set and a manually<br />created sample data set. The experiment results show that the output pages from<br />the rough filtering are less than 23% of the pages in the 100GB data set when the<br />four modified PGMs are used in combination under a condition that the recall is<br />more than 98%. Another experiment using a 1.36TB web data set with the same<br />rough filtering configuration shows that the output pages are less than 15% of the<br />pages in the corpus.<br /> In Chapter 5 we present in detail the configuration, the experiments, and the<br />evaluation of the accurate classification method. Using two types of component<br />classifiers (a recall-assured classifier and a precision-assured classifier) in<br />combination, we construct a three-way classifier that inputs the candidate pages<br />output by the rough filtering and classifies them to three classes: assured posi-<br />tive, assured negative, and uncertain. The assured positive output assures the<br />precision and the assured positive and uncertain output together assure the recall,<br />so only the uncertain output needs to be manually assessed in order to assure the<br />quality of the web data collection.<br /> We first devise a feature set for building the high-performance component clas-<br />sifiers using Support Vector Machine (SVM). We use textual features obtained<br />from each page and its surrounding pages. After the surrounding pages are grouped<br />according to connection types (in-link, out-link, and directory entry) and relative<br />URL hierarchy (same, upper, or lower in the directory hierarchy), an independent<br />feature subset is generated from each group. Feature subsets are further concate-<br />nated conceptually to compose the feature set of a classifier. We use two types of<br />textual features (plain-text-based and tagged-text-based). The classifier using<br />only the plain-text-based features in each page alone is used as the baseline. Various<br />feature sets are tested in the experiment using manually prepared sample data, and<br />the classifiers are tuned by two methods, one offset-based and the other c-j-option-<br />based. The results show that the performance obtained by using c-j-option-based<br />tuning method is statistically signi?cant at 95% confidence level. The F-measures<br />of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%<br />and show that the proposed method is evidently effective.<br /> To know the performances of the classiffers with the abovementioned feature sets<br />in more general cases, we experimented with our method on the Web->KB data<br />set, a test collection commonly used for the web page classi?cation task. It contains<br />seven categories and four of them-course, faculty, project, and student-are used<br />for comparing the performance. The experiment results show that our method out-<br />performed all seven of the previous methods in terms of macro-averaged F-measure.<br />We can therefore conclude that our method performs fairly well and is applicable not<br />only to researchers' homepages in Japanese but also to other categories of homepages<br />in other languages.<br /> By tuning the well-performing classifiers independently, we then build a recall-<br />assured classifier and a precision-assured classifier and compose a three-way classi-<br />fier by using them in combination. We estimated the numbers of the pages to be<br />manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and<br />98%/90%, using the output pages from a 100GB data set through the rough filter-<br />ing. The results show that the manual assessment cost can be reduced, compared<br />to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-<br />sification result examples, and the results show the effectiveness of the classifiers.<br /> In Chapter 6 the cascaded structure of the recall-assured classifiers, used in<br />combination with the rough filtering, is proposed for reducing the computer pro-<br />cessing cost. Estimation on the numbers of pages requiring feature extraction in the<br />accurate classification shows that the computer processing cost can be reduced<br />down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.<br /> In Chapter 7 we summarize our contributions. One of our unique contributions<br />is that we pointed out the importance of assuring the quality of web page collection<br />and proposed a framework for doing so. Another is that we introduced an idea of<br />local page group models (PGMs) and demonstrated its effective uses for filtering<br />and classifying web pages.<br /> We first presented a realistic framework for building a high-quality web page<br />collection with a two-step process, composing the rough filtering followed by the<br />accurate classification, in order to reduce the processing cost. In the rough filtering<br />we contributed two original key techniques used in the modified PGMs to reduce the<br />irrelevant keywords to be propagated. One is to introduce a threshold on the number<br />of out-linked pages in the same and lower directories, and the other is to introduce<br />keyword list types and propagate only the organization-related keyword lists from<br />the upper directories. In the accurate classification we contributed not only a original<br />method for exploiting features from the surrounding pages and concatenating the<br />features independently to improve web page classification performance but also a<br />way to use a recall-assured classifier and a precision-assured classifier in combination<br />as a three-way classifier in order to reduce the amount of pages requiring manual<br />assessment under the given quality constraints.<br /> We also discuss the future work: finding a more systematic way for modifying<br />the property set and property-based keywords for the rough ?ltering, investigating<br />ways to estimate the likelihood of the component pages and incorporate them for<br />the accurate classification, and further utilizing the information from the homepage<br />collection for practical applications. | |||||
所蔵 | ||||||
値 | 有 | |||||
フォーマット | ||||||
内容記述タイプ | Other | |||||
内容記述 | application/pdf |