ログイン
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 020 学位論文
  2. 複合科学研究科
  3. 17 情報学専攻

Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures

https://ir.soken.ac.jp/records/855
https://ir.soken.ac.jp/records/855
538dfea3-de1c-4eb7-b9d5-471e28faff95
名前 / ファイル ライセンス アクション
甲1000_要旨.pdf 要旨・審査要旨 (394.0 kB)
甲1000_本文.pdf 本文 (2.3 MB)
Item type 学位論文 / Thesis or Dissertation(1)
公開日 2010-02-22
タイトル
タイトル Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures
タイトル
タイトル Study on Building a High-Quality Homepage Collection from the Web Considering Page Group Structures
言語 en
言語
言語 eng
資源タイプ
資源タイプ識別子 http://purl.org/coar/resource_type/c_46ec
資源タイプ thesis
著者名 WANG, Yuxin

× WANG, Yuxin

WANG, Yuxin

Search repository
フリガナ ウァン, ユジン

× ウァン, ユジン

ウァン, ユジン

Search repository
著者 WANG, Yuxin

× WANG, Yuxin

en WANG, Yuxin

Search repository
学位授与機関
学位授与機関名 総合研究大学院大学
学位名
学位名 博士(情報学)
学位記番号
内容記述タイプ Other
内容記述 総研大甲第1000号
研究科
値 複合科学研究科
専攻
値 17 情報学専攻
学位授与年月日
学位授与年月日 2006-09-29
学位授与年度
値 2006
要旨
内容記述タイプ Other
内容記述 This disseration is devoted to investigate the method for building a high-quality<br />homepage collection from the web effciently by considering the page group struc-<br />tures. We mainly investigate in researchers' homepages and homepages of other<br />categories partly.<br /> A web page collection with a guaranteed high quality (i.e., high recall and high<br />precision) is required for implementing high quality web-based information services.<br />Building such a collection demands a large amount of human work, however, be-<br />cause of the diversity, vastness and sparseness of web pages. Even though many<br />researchers have investigated methods for searching and classifying web pages, etc.,<br />most of the methods are best-effort types and pay no attention to quality assurance.<br />We are therefore investigating a method for building a homepage collection eff-<br />ciently while assuring a given high quality, with the expectation that the investigated<br />method can be applicable to the collection of various categories of homepages.<br /> This dissertation consists of seven chapters. Chapter 1 gives the introduction,<br />and Chapter 2 presents the related work. Chapter 3 describes the objectives, the<br />overall performance goal of the investigated system, and the scheme of the system.<br />Chapters 4 and 5 discuss the two parts of our two-step-processing method in detail<br />respectively. Chapter 6 discusses the method for reducing the processing cost of the<br />system, and Chapter 7 concludes the dissertation by summarizing it and discussing<br />future work.<br /> Chapter 3, taking into account the enormous size of the real web, introduces a<br />two-step-processing method comprising rough filtering and accurate classifica-<br />tion. The former is for narrowing down the amount of candidate pages effciently<br />with the required high recall and the latter is for accurately classifying the candidate<br />pages into three classes-assured positive, assured negative, and uncertain-while<br /> We present in detail the con?guration, the experiments, and the evaluation of<br />the rough filtering in Chapter 4. The rough filtering is a method for gathering<br />researchers' homepages (or entry pages) by applying our original, simple, and effec-<br />tive local page group models exploiting the mutual relations between the structure<br />and the content of a logical page group. It aims at narrowing down the candidates<br />with a very high recall. First, property-based keyword lists that correspond to<br />researchers' common properties are created and are grouped either as organization-<br />related or non-organization-related. Next, four page group models (PGMs)<br />taking into consideration the structure in an individual logical page group are intro-<br />duced. PGM_Od models the out-linked pages in the same and lower directories,<br />PGM Ou models the out-linked pages in the upper directories, PGM_I models<br />the in-linked pages in the same and the upper directories, and PGM_U models the<br />site top and the directory entry pages in the same and the upper directories.<br /> Based on the PGMs, the keywords are propagated to a potential entry page from<br />its surrounding pages to compose a virtual entry page. Finally, the virtual entry<br />pages that scored at least a threshold value are selected. Since the application of<br />PGMs generally causes a lot of noises, we introduced four modified PGMs with<br />two original techniques: the keywords are propagated based on PGM_Od only when<br />the number of out-linked pages in the same and lower directories is less than a<br />threshold value, and only the organization-related keywords are propagated based<br />on other PGMs. The four modified PGMs are used in combination in order to utilize<br />as many informative keywords as possible from the surrounding pages.<br /> The effectiveness of the method is shown by comparing it with that of a single-<br />page-based method through experiments using a 100GB web data set and a manually<br />created sample data set. The experiment results show that the output pages from<br />the rough filtering are less than 23% of the pages in the 100GB data set when the<br />four modified PGMs are used in combination under a condition that the recall is<br />more than 98%. Another experiment using a 1.36TB web data set with the same<br />rough filtering configuration shows that the output pages are less than 15% of the<br />pages in the corpus.<br /> In Chapter 5 we present in detail the configuration, the experiments, and the<br />evaluation of the accurate classification method. Using two types of component<br />classifiers (a recall-assured classifier and a precision-assured classifier) in<br />combination, we construct a three-way classifier that inputs the candidate pages<br />output by the rough filtering and classifies them to three classes: assured posi-<br />tive, assured negative, and uncertain. The assured positive output assures the<br />precision and the assured positive and uncertain output together assure the recall,<br />so only the uncertain output needs to be manually assessed in order to assure the<br />quality of the web data collection.<br /> We first devise a feature set for building the high-performance component clas-<br />sifiers using Support Vector Machine (SVM). We use textual features obtained<br />from each page and its surrounding pages. After the surrounding pages are grouped<br />according to connection types (in-link, out-link, and directory entry) and relative<br />URL hierarchy (same, upper, or lower in the directory hierarchy), an independent<br />feature subset is generated from each group. Feature subsets are further concate-<br />nated conceptually to compose the feature set of a classifier. We use two types of<br />textual features (plain-text-based and tagged-text-based). The classifier using<br />only the plain-text-based features in each page alone is used as the baseline. Various<br />feature sets are tested in the experiment using manually prepared sample data, and<br />the classifiers are tuned by two methods, one offset-based and the other c-j-option-<br />based. The results show that the performance obtained by using c-j-option-based<br />tuning method is statistically signi?cant at 95% confidence level. The F-measures<br />of the baseline and the top two performed classifiers are 83.26%, 88.65%, and 88.58%<br />and show that the proposed method is evidently effective.<br /> To know the performances of the classiffers with the abovementioned feature sets<br />in more general cases, we experimented with our method on the Web->KB data<br />set, a test collection commonly used for the web page classi?cation task. It contains<br />seven categories and four of them-course, faculty, project, and student-are used<br />for comparing the performance. The experiment results show that our method out-<br />performed all seven of the previous methods in terms of macro-averaged F-measure.<br />We can therefore conclude that our method performs fairly well and is applicable not<br />only to researchers' homepages in Japanese but also to other categories of homepages<br />in other languages.<br /> By tuning the well-performing classifiers independently, we then build a recall-<br />assured classifier and a precision-assured classifier and compose a three-way classi-<br />fier by using them in combination. We estimated the numbers of the pages to be<br />manually assessed for the required precision/recall at 99.5%/98%, 99%/95%, and<br />98%/90%, using the output pages from a 100GB data set through the rough filter-<br />ing. The results show that the manual assessment cost can be reduced, compared<br />to the baseline, down to 77.6%, 57.3%, and 51.8%, respectively. We analyzed clas-<br />sification result examples, and the results show the effectiveness of the classifiers.<br /> In Chapter 6 the cascaded structure of the recall-assured classifiers, used in<br />combination with the rough filtering, is proposed for reducing the computer pro-<br />cessing cost. Estimation on the numbers of pages requiring feature extraction in the<br />accurate classification shows that the computer processing cost can be reduced<br />down to 27.5% for the 100GB data set and 18.3% for the 1.36TB data set.<br /> In Chapter 7 we summarize our contributions. One of our unique contributions<br />is that we pointed out the importance of assuring the quality of web page collection<br />and proposed a framework for doing so. Another is that we introduced an idea of<br />local page group models (PGMs) and demonstrated its effective uses for filtering<br />and classifying web pages.<br /> We first presented a realistic framework for building a high-quality web page<br />collection with a two-step process, composing the rough filtering followed by the<br />accurate classification, in order to reduce the processing cost. In the rough filtering<br />we contributed two original key techniques used in the modified PGMs to reduce the<br />irrelevant keywords to be propagated. One is to introduce a threshold on the number<br />of out-linked pages in the same and lower directories, and the other is to introduce<br />keyword list types and propagate only the organization-related keyword lists from<br />the upper directories. In the accurate classification we contributed not only a original<br />method for exploiting features from the surrounding pages and concatenating the<br />features independently to improve web page classification performance but also a<br />way to use a recall-assured classifier and a precision-assured classifier in combination<br />as a three-way classifier in order to reduce the amount of pages requiring manual<br />assessment under the given quality constraints.<br /> We also discuss the future work: finding a more systematic way for modifying<br />the property set and property-based keywords for the rough ?ltering, investigating<br />ways to estimate the likelihood of the component pages and incorporate them for<br />the accurate classification, and further utilizing the information from the homepage<br />collection for practical applications.
所蔵
値 有
フォーマット
内容記述タイプ Other
内容記述 application/pdf
戻る
0
views
See details
Views

Versions

Ver.1 2023-06-20 16:10:31.410579
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR 2.0
  • OAI-PMH JPCOAR 1.0
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3