{"created":"2023-06-20T13:20:48.047897+00:00","id":855,"links":{},"metadata":{"_buckets":{"deposit":"55e3db86-b6a0-49a5-bea5-385c13ee4b41"},"_deposit":{"created_by":1,"id":"855","owners":[1],"pid":{"revision_id":0,"type":"depid","value":"855"},"status":"published"},"_oai":{"id":"oai:ir.soken.ac.jp:00000855","sets":["2:429:19"]},"author_link":["0","0","0"],"item_1_creator_2":{"attribute_name":"著者名","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"WANG, Yuxin"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_creator_3":{"attribute_name":"フリガナ","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"ウァン, ユジン"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_date_granted_11":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2006-09-29"}]},"item_1_degree_grantor_5":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"総合研究大学院大学"}]}]},"item_1_degree_name_6":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士（情報学）"}]},"item_1_description_12":{"attribute_name":"要旨","attribute_value_mlt":[{"subitem_description":"　This disseration is devoted to investigate the method for building a high-quality<br />homepage collection from the web effciently by considering the page group struc-<br />tures. We mainly investigate in researchers' homepages and homepages of other<br />categories partly.<br />　A web page collection with a guaranteed high quality (i.e., high recall and high<br />precision) is required for implementing high quality web-based information services.<br />Building such a collection demands a large amount of human work, however, be-<br />cause of the diversity, vastness and sparseness of web pages. Even though many<br />researchers have investigated methods for searching and classifying web pages, etc.,<br />most of the methods are best-effort types and pay no attention to quality assurance.<br />We are therefore investigating a method for building a homepage collection eff-<br />ciently while assuring a given high quality, with the expectation that the investigated<br />method can be applicable to the collection of various categories of homepages.<br />　This dissertation consists of seven chapters. Chapter 1 gives the introduction,<br />and Chapter 2 presents the related work. Chapter 3 describes the objectives, the<br />overall performance goal of the investigated system, and the scheme of the system.<br />Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail<br />respectively. Chapter 6 discusses the method for reducing the processing cost of the<br />system, and Chapter 7 concludes the dissertation by summarizing it and discussing<br />future work.<br />　Chapter 3, taking into account the enormous size of the real web, introduces a<br />two-step-processing method comprising rough filtering and accurate classifica-<br />tion. The former is for narrowing down the amount of candidate pages effciently<br />with the required high recall and the latter is for accurately classifying the candidate<br />pages into three classes-assured positive, assured negative, and uncertain-while<br />　We present in detail the con?guration, the experiments, and the evaluation of<br />the rough filtering in Chapter 4. The rough filtering is a method for gathering<br />researchers' homepages (or entry pages) by applying our original, simple, and effec-<br />tive local page group models exploiting the mutual relations between the structure<br />and the content of a logical page group. It aims at narrowing down the candidates<br />with a very high recall. First, property-based keyword lists that correspond to<br />researchers' common properties are created and are grouped either as organization-<br />related or non-organization-related. Next, four page group models (PGMs)<br />taking into consideration the structure in an individual logical page group are intro-<br />duced. PGM_Od models the out-linked pages in the same and lower directories,<br />PGM Ou models the out-linked pages in the upper directories, PGM_I models<br />the in-linked pages in the same and the upper directories, and PGM_U models the<br />site top and the directory entry pages in the same and the upper directories.<br />　Based on the PGMs, the keywords are propagated to a potential entry page from<br />its surrounding pages to compose a virtual entry page. Finally, the virtual entry<br />pages that scored at least a threshold value are selected. Since the application of<br />PGMs generally causes a lot of noises, we introduced four modified PGMs with<br />two original techniques: the keywords are propagated based on PGM_Od only when<br />the number of out-linked pages in the same and lower directories is less than a<br />threshold value, and only the organization-related keywords are propagated based<br />on other PGMs. The four modified PGMs are used in combination in order to utilize<br />as many informative keywords as possible from the surrounding pages.<br />　The effectiveness of the method is shown by comparing it with that of a single-<br />page-based method through experiments using a 100GB web data set and a manually<br />created sample data set. The experiment results show that the output pages from<br />the rough filtering are less than 23% of the pages in the 100GB data set when the<br />four modified PGMs are used in combination under a condition that the recall is<br />more than 98%. Another experiment using a 1.36TB web data set with the same<br />rough filtering configuration shows that the output pages are less than 15% of the<br />pages in the corpus.<br />　In Chapter 5 we present in detail the configuration, the experiments, and the<br />evaluation of the accurate classification method. Using two types of component<br />classifiers (a recall-assured classifier and a precision-assured classifier) in<br />combination, we construct a three-way classifier that inputs the candidate pages<br />output by the rough filtering and classifies them to three classes: assured posi-<br />tive, assured negative, and uncertain. The assured positive output assures the<br />precision and the assured positive and uncertain output together assure the recall,<br />so only the uncertain output needs to be manually assessed in order to assure the<br />quality of the web data collection.<br />　We first devise a feature set for building the high-performance component clas-<br />sifiers using Support Vector Machine (SVM). We use textual features obtained<br />from each page and its surrounding pages. After the surrounding pages are grouped<br />according to connection types (in-link, out-link, and directory entry) and relative<br />URL hierarchy (same, upper, or lower in the directory hierarchy), an independent<br />feature subset is generated from each group. Feature subsets are further concate-<br />nated conceptually to compose the feature set of a classifier. We use two types of<br />textual features (plain-text-based and tagged-text-based). The classifier using<br />only the plain-text-based features in each page alone is used as the baseline. Various<br />feature sets are tested in the experiment using manually prepared sample data, and<br />the classifiers are tuned by two methods, one offset-based and the other c-j-option-<br />based. The results show that the performance obtained by using c-j-option-based<br />tuning method is statistically signi?cant at 95% confidence level. The F-measures<br />of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%<br />and show that the proposed method is evidently effective.<br />　To know the performances of the classiffers with the abovementioned feature sets<br />in more general cases, we experimented with our method on the Web->KB data<br />set, a test collection commonly used for the web page classi?cation task. It contains<br />seven categories and four of them-course, faculty, project, and student-are used<br />for comparing the performance. The experiment results show that our method out-<br />performed all seven of the previous methods in terms of macro-averaged F-measure.<br />We can therefore conclude that our method performs fairly well and is applicable not<br />only to researchers' homepages in Japanese but also to other categories of homepages<br />in other languages.<br />　By tuning the well-performing classifiers independently, we then build a recall-<br />assured classifier and a precision-assured classifier and compose a three-way classi-<br />fier by using them in combination. We estimated the numbers of the pages to be<br />manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and<br />98%/90%, using the output pages from a 100GB data set through the rough filter-<br />ing. The results show that the manual assessment cost can be reduced, compared<br />to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-<br />sification result examples, and the results show the effectiveness of the classifiers.<br />　In Chapter 6 the cascaded structure of the recall-assured classifiers, used in<br />combination with the rough filtering, is proposed for reducing the computer pro-<br />cessing cost. Estimation on the numbers of pages requiring feature extraction in the<br />accurate classification shows that the computer processing cost can be reduced<br />down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.<br />  In Chapter 7 we summarize our contributions. One of our unique contributions<br />is that we pointed out the importance of assuring the quality of web page collection<br />and proposed a framework for doing so. Another is that we introduced an idea of<br />local page group models (PGMs) and demonstrated its effective uses for filtering<br />and classifying web pages.<br />　We first presented a realistic framework for building a high-quality web page<br />collection with a two-step process, composing the rough filtering followed by the<br />accurate classification, in order to reduce the processing cost. In the rough filtering<br />we contributed two original key techniques used in the modified PGMs to reduce the<br />irrelevant keywords to be propagated. One is to introduce a threshold on the number<br />of out-linked pages in the same and lower directories, and the other is to introduce<br />keyword list types and propagate only the organization-related keyword lists from<br />the upper directories. In the accurate classification we contributed not only a original<br />method for exploiting features from the surrounding pages and concatenating the<br />features independently to improve web page classification performance but also a<br />way to use a recall-assured classifier and a precision-assured classifier in combination<br />as a three-way classifier in order to reduce the amount of pages requiring manual<br />assessment under the given quality constraints.<br />　We also discuss the future work: finding a more systematic way for modifying<br />the property set and property-based keywords for the rough ?ltering, investigating<br />ways to estimate the likelihood of the component pages and incorporate them for<br />the accurate classification, and further utilizing the information from the homepage<br />collection for practical applications.","subitem_description_type":"Other"}]},"item_1_description_18":{"attribute_name":"フォーマット","attribute_value_mlt":[{"subitem_description":"application/pdf","subitem_description_type":"Other"}]},"item_1_description_7":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_description":"総研大甲第1000号","subitem_description_type":"Other"}]},"item_1_select_14":{"attribute_name":"所蔵","attribute_value_mlt":[{"subitem_select_item":"有"}]},"item_1_select_8":{"attribute_name":"研究科","attribute_value_mlt":[{"subitem_select_item":"複合科学研究科"}]},"item_1_select_9":{"attribute_name":"専攻","attribute_value_mlt":[{"subitem_select_item":"17 情報学専攻"}]},"item_1_text_10":{"attribute_name":"学位授与年度","attribute_value_mlt":[{"subitem_text_value":"2006"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"WANG, Yuxin","creatorNameLang":"en"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"甲1000_要旨.pdf","filesize":[{"value":"394.0 kB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"要旨・審査要旨","url":"https://ir.soken.ac.jp/record/855/files/甲1000_要旨.pdf"},"version_id":"9fb578af-2f50-4356-bc3f-af1341d304f7"},{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"甲1000_本文.pdf","filesize":[{"value":"2.3 MB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"本文","url":"https://ir.soken.ac.jp/record/855/files/甲1000_本文.pdf"},"version_id":"5df2d64c-2e80-447b-97be-9081382713a1"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"},{"subitem_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures","subitem_title_language":"en"}]},"item_type_id":"1","owner":"1","path":["19"],"pubdate":{"attribute_name":"公開日","attribute_value":"2010-02-22"},"publish_date":"2010-02-22","publish_status":"0","recid":"855","relation_version_is_last":true,"title":["Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"],"weko_creator_id":"1","weko_shared_id":-1},"updated":"2023-06-20T16:10:32.152555+00:00"}