{"created":"2023-06-20T13:20:48.047897+00:00","id":855,"links":{},"metadata":{"_buckets":{"deposit":"55e3db86-b6a0-49a5-bea5-385c13ee4b41"},"_deposit":{"created_by":1,"id":"855","owners":[1],"pid":{"revision_id":0,"type":"depid","value":"855"},"status":"published"},"_oai":{"id":"oai:ir.soken.ac.jp:00000855","sets":["2:429:19"]},"author_link":["0","0","0"],"item_1_creator_2":{"attribute_name":"著者名","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"WANG, Yuxin"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_creator_3":{"attribute_name":"フリガナ","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"ウァン, ユジン"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_date_granted_11":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2006-09-29"}]},"item_1_degree_grantor_5":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"総合研究大学院大学"}]}]},"item_1_degree_name_6":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士(情報学)"}]},"item_1_description_12":{"attribute_name":"要旨","attribute_value_mlt":[{"subitem_description":" This disseration is devoted to investigate the method for building a high-quality
homepage collection from the web effciently by considering the page group struc-
tures. We mainly investigate in researchers' homepages and homepages of other
categories partly.
A web page collection with a guaranteed high quality (i.e., high recall and high
precision) is required for implementing high quality web-based information services.
Building such a collection demands a large amount of human work, however, be-
cause of the diversity, vastness and sparseness of web pages. Even though many
researchers have investigated methods for searching and classifying web pages, etc.,
most of the methods are best-effort types and pay no attention to quality assurance.
We are therefore investigating a method for building a homepage collection eff-
ciently while assuring a given high quality, with the expectation that the investigated
method can be applicable to the collection of various categories of homepages.
This dissertation consists of seven chapters. Chapter 1 gives the introduction,
and Chapter 2 presents the related work. Chapter 3 describes the objectives, the
overall performance goal of the investigated system, and the scheme of the system.
Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail
respectively. Chapter 6 discusses the method for reducing the processing cost of the
system, and Chapter 7 concludes the dissertation by summarizing it and discussing
future work.
Chapter 3, taking into account the enormous size of the real web, introduces a
two-step-processing method comprising rough filtering and accurate classifica-
tion. The former is for narrowing down the amount of candidate pages effciently
with the required high recall and the latter is for accurately classifying the candidate
pages into three classes-assured positive, assured negative, and uncertain-while
We present in detail the con?guration, the experiments, and the evaluation of
the rough filtering in Chapter 4. The rough filtering is a method for gathering
researchers' homepages (or entry pages) by applying our original, simple, and effec-
tive local page group models exploiting the mutual relations between the structure
and the content of a logical page group. It aims at narrowing down the candidates
with a very high recall. First, property-based keyword lists that correspond to
researchers' common properties are created and are grouped either as organization-
related or non-organization-related. Next, four page group models (PGMs)
taking into consideration the structure in an individual logical page group are intro-
duced. PGM_Od models the out-linked pages in the same and lower directories,
PGM Ou models the out-linked pages in the upper directories, PGM_I models
the in-linked pages in the same and the upper directories, and PGM_U models the
site top and the directory entry pages in the same and the upper directories.
Based on the PGMs, the keywords are propagated to a potential entry page from
its surrounding pages to compose a virtual entry page. Finally, the virtual entry
pages that scored at least a threshold value are selected. Since the application of
PGMs generally causes a lot of noises, we introduced four modified PGMs with
two original techniques: the keywords are propagated based on PGM_Od only when
the number of out-linked pages in the same and lower directories is less than a
threshold value, and only the organization-related keywords are propagated based
on other PGMs. The four modified PGMs are used in combination in order to utilize
as many informative keywords as possible from the surrounding pages.
The effectiveness of the method is shown by comparing it with that of a single-
page-based method through experiments using a 100GB web data set and a manually
created sample data set. The experiment results show that the output pages from
the rough filtering are less than 23% of the pages in the 100GB data set when the
four modified PGMs are used in combination under a condition that the recall is
more than 98%. Another experiment using a 1.36TB web data set with the same
rough filtering configuration shows that the output pages are less than 15% of the
pages in the corpus.
In Chapter 5 we present in detail the configuration, the experiments, and the
evaluation of the accurate classification method. Using two types of component
classifiers (a recall-assured classifier and a precision-assured classifier) in
combination, we construct a three-way classifier that inputs the candidate pages
output by the rough filtering and classifies them to three classes: assured posi-
tive, assured negative, and uncertain. The assured positive output assures the
precision and the assured positive and uncertain output together assure the recall,
so only the uncertain output needs to be manually assessed in order to assure the
quality of the web data collection.
We first devise a feature set for building the high-performance component clas-
sifiers using Support Vector Machine (SVM). We use textual features obtained
from each page and its surrounding pages. After the surrounding pages are grouped
according to connection types (in-link, out-link, and directory entry) and relative
URL hierarchy (same, upper, or lower in the directory hierarchy), an independent
feature subset is generated from each group. Feature subsets are further concate-
nated conceptually to compose the feature set of a classifier. We use two types of
textual features (plain-text-based and tagged-text-based). The classifier using
only the plain-text-based features in each page alone is used as the baseline. Various
feature sets are tested in the experiment using manually prepared sample data, and
the classifiers are tuned by two methods, one offset-based and the other c-j-option-
based. The results show that the performance obtained by using c-j-option-based
tuning method is statistically signi?cant at 95% confidence level. The F-measures
of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%
and show that the proposed method is evidently effective.
To know the performances of the classiffers with the abovementioned feature sets
in more general cases, we experimented with our method on the Web->KB data
set, a test collection commonly used for the web page classi?cation task. It contains
seven categories and four of them-course, faculty, project, and student-are used
for comparing the performance. The experiment results show that our method out-
performed all seven of the previous methods in terms of macro-averaged F-measure.
We can therefore conclude that our method performs fairly well and is applicable not
only to researchers' homepages in Japanese but also to other categories of homepages
in other languages.
By tuning the well-performing classifiers independently, we then build a recall-
assured classifier and a precision-assured classifier and compose a three-way classi-
fier by using them in combination. We estimated the numbers of the pages to be
manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and
98%/90%, using the output pages from a 100GB data set through the rough filter-
ing. The results show that the manual assessment cost can be reduced, compared
to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-
sification result examples, and the results show the effectiveness of the classifiers.
In Chapter 6 the cascaded structure of the recall-assured classifiers, used in
combination with the rough filtering, is proposed for reducing the computer pro-
cessing cost. Estimation on the numbers of pages requiring feature extraction in the
accurate classification shows that the computer processing cost can be reduced
down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.
In Chapter 7 we summarize our contributions. One of our unique contributions
is that we pointed out the importance of assuring the quality of web page collection
and proposed a framework for doing so. Another is that we introduced an idea of
local page group models (PGMs) and demonstrated its effective uses for filtering
and classifying web pages.
We first presented a realistic framework for building a high-quality web page
collection with a two-step process, composing the rough filtering followed by the
accurate classification, in order to reduce the processing cost. In the rough filtering
we contributed two original key techniques used in the modified PGMs to reduce the
irrelevant keywords to be propagated. One is to introduce a threshold on the number
of out-linked pages in the same and lower directories, and the other is to introduce
keyword list types and propagate only the organization-related keyword lists from
the upper directories. In the accurate classification we contributed not only a original
method for exploiting features from the surrounding pages and concatenating the
features independently to improve web page classification performance but also a
way to use a recall-assured classifier and a precision-assured classifier in combination
as a three-way classifier in order to reduce the amount of pages requiring manual
assessment under the given quality constraints.
We also discuss the future work: finding a more systematic way for modifying
the property set and property-based keywords for the rough ?ltering, investigating
ways to estimate the likelihood of the component pages and incorporate them for
the accurate classification, and further utilizing the information from the homepage
collection for practical applications.","subitem_description_type":"Other"}]},"item_1_description_18":{"attribute_name":"フォーマット","attribute_value_mlt":[{"subitem_description":"application/pdf","subitem_description_type":"Other"}]},"item_1_description_7":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_description":"総研大甲第1000号","subitem_description_type":"Other"}]},"item_1_select_14":{"attribute_name":"所蔵","attribute_value_mlt":[{"subitem_select_item":"有"}]},"item_1_select_8":{"attribute_name":"研究科","attribute_value_mlt":[{"subitem_select_item":"複合科学研究科"}]},"item_1_select_9":{"attribute_name":"専攻","attribute_value_mlt":[{"subitem_select_item":"17 情報学専攻"}]},"item_1_text_10":{"attribute_name":"学位授与年度","attribute_value_mlt":[{"subitem_text_value":"2006"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"WANG, Yuxin","creatorNameLang":"en"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"甲1000_要旨.pdf","filesize":[{"value":"394.0 kB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"要旨・審査要旨","url":"https://ir.soken.ac.jp/record/855/files/甲1000_要旨.pdf"},"version_id":"9fb578af-2f50-4356-bc3f-af1341d304f7"},{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"甲1000_本文.pdf","filesize":[{"value":"2.3 MB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"本文","url":"https://ir.soken.ac.jp/record/855/files/甲1000_本文.pdf"},"version_id":"5df2d64c-2e80-447b-97be-9081382713a1"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"},{"subitem_title":"Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures","subitem_title_language":"en"}]},"item_type_id":"1","owner":"1","path":["19"],"pubdate":{"attribute_name":"公開日","attribute_value":"2010-02-22"},"publish_date":"2010-02-22","publish_status":"0","recid":"855","relation_version_is_last":true,"title":["Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures"],"weko_creator_id":"1","weko_shared_id":-1},"updated":"2023-06-20T16:10:32.152555+00:00"}