{"created":"2023-06-20T13:20:56.757155+00:00","id":1013,"links":{},"metadata":{"_buckets":{"deposit":"9c56a12c-9bb3-40b1-a1bd-89865c32cad7"},"_deposit":{"created_by":1,"id":"1013","owners":[1],"pid":{"revision_id":0,"type":"depid","value":"1013"},"status":"published"},"_oai":{"id":"oai:ir.soken.ac.jp:00001013","sets":["2:430:20"]},"author_link":["0","0","0"],"item_1_creator_2":{"attribute_name":"著者名","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"Kryukov, Kirill"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_creator_3":{"attribute_name":"フリガナ","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"クリュコフ, キリル"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_1_date_granted_11":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2005-03-24"}]},"item_1_degree_grantor_5":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"総合研究大学院大学"}]}]},"item_1_degree_name_6":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士(理学)"}]},"item_1_description_12":{"attribute_name":"要旨","attribute_value_mlt":[{"subitem_description":"My PhD study belongs to the field of computational biology and is focusing on development of new methods for molecular biology data analysis. My PhD paper includes three chapters, that are focusing on computational methods for different stages of biological study.
In the first chapter, titled: MISHIMA: a new method of multiple sequence alignment, I explore a possibility of applying advanced computational techniques to the problem of multiple molecular sequence alignment. Sequence alignment is one of the central tasks in molecular biology DNA or protein sequences must be aligned before any comparison can be done between them. Although alignment of two sequences already reveals valuable information about sequence relationship, some studies require multiple sequences aligned together. Such studies include phylogenetic analysis, identification of conserved genome elements and protein secondary structure prediction.
Common methods of multiple sequence alignment are usually based on pairwise sequence comparison all pairs of sequences are compared separately and then multiple alignment is constructed through the progressive alignment procedure. This method works well for aligning relatively short sequences, but takes too long time to align genomic sequences, and also when the number of sequences is large. These days the continuously increasing amount of available genomic sequences of various organisms requires some more efficient techniques for aligning such huge data.
The new method of multiple sequence alignment, that I was developing during the last year MISHIMA (a Method for Identifying Sequence History In terms of Multiple Alignment) is an attempt to reduce the computational requirement of alignment procedure of multiple genomic sequences. This is achieved through the heuristic approach to the quick extraction of potential homology information from the sequences. After that sequences are aligned using the Divide and Conquer approach: regions of homology shared by multiple sequences are used as a points of splitting sequences into parts, which are aligned independently from each other by conventional alignment method. The partial alignments are then assembled together to construct the final multiple alignment.
The homology extraction step is the key part of this method. It is based on the observation that the chance of every sequence motif (short sequence fragment) to represent a homology signal is related with the frequency of this motif occurrence in the sequence dataset. Sequence motifs that are rare, or oppositely very abundant in the sequence dataset, are unlikely to happen in the region of homology. On the other hand, the motifs that are occurring exactly once in each of the input sequences have a good chance to belong to the conserved element, thus revealing the probable homology shared by multiple sequences.
The heuristic method of homology extraction used in MISHIMA depends on counting the number of occurrences of every sequence motif of up to K nucleotides long in the sequence dataset. The number of all sequence motifs of length K is very large (it is proportional to K4), so the important problem was to organize the information about motif frequencies. In MISHIMA method I use dictionary structure for storing the motif frequency data in efficient way, allowing information about motifs of up to 12 nucleotides long to be stored using about 0.5 GB of computer RAM.
MISHIMA alignment method was tested with several datasets, and compared with alternative methods. One of the datasets consisted of 10 complete mitochondrion genomic sequences of mammalian species. MISHIMA method could successfully construct the alignment for this dataset, taking about two minutes. ClustalW (most widely used multiple alignment software today) takes several hours to produce the alignment of the same data. Among the other test datasets was a set of 4 complete genomes of different strains of Streptococcus pyogenes, each about 2 MB long. MISHIMA method could align the dataset taking about 6 hours on Pentium 4 notebook machine with 1 GB of RAM. This test shows that this method can bring the possibility of large scale genomic multiple alignment experiment to the users of ordinary desktop or portable computers.
Second chapter of my work SMAP: Alignment with Reference Sequence is describing a technique for assisting a sequencing experiment. In a common whole genome shotgun-sequencing project a target species chromosome is divided into fragments, such as BAC (bacterial artificial chromosome), with length of several to one about hundred KB. These fragments are then sequenced, resulting in a number of sequence reads, usually less than 1 KB in length. These reads are assembled together to form contigs -a basic unit of resulting sequence. The location of each contig in the genome is not known at this stage.
The analysis of the set of contigs may be easier in case when a genome of a closely related species is already determined. In the process of sequencing genome of species A, genome of a closely related species B can be used as a reference, to supervise and assist the sequencing process. If A and B are close to each other most of the newly sequenced contigs will be found to be homologous to some part of the reference. This homology suggests their probable location in the A genome, that can be used to estimate the progress of sequencing process. Also this information can be used to assist the sequencing process, especially at the late stage of finishing the sequence. Comparison with reference sequence give the estimation of size and location of gaps -still unknown regions of target genome. Also reference sequence can help to assemble the contigs. In some cases the information about contig homology in reference sequence is enough to correctly assemble the continuous sequence of newly sequenced genome.
To implement this idea I developed SMAP -a software package for assisting a sequencing process with the help of the genomic DNA sequence of a closely related species. Its name came from the original idea -Sequence MAPping. BLAST local homology search tool is used for detecting homology between the original sequence fragments or contigs and the reference. SMAP then analyzes the result of BLAST search and performs the mapping and assembling of the set of contigs. SMAP was already applied in the process of chimpanzee clromosome 22 sequencing, when human chromosome 21 sequences were used as a reference.
Third part of my study Netview: Constructing and visually exploring phylogenetic networks is describing a new method for phylogenetic analysis. Phylogenetic relationship of a group of gene sequences is commonly represented as a tree. However a non-tree phylogenetic structure may be more appropriate in some cases. Such cases may result from recombination or horizontal gene transfer events. Also a non-tree structure may appear because of ambiguity in the sequence data. In this study I proposed a method to explore such non-tree structures, based on contradictions between the aligned sequence data and a phylogenetic tree topology constructed by using the neighbor-joining method.
The Netview method of network construction is based on a comparison of a multiple sequence alignment data and a phylogenetic tree, based on that alignment. Every alignment position can be characterized by a certain relation with the tree -it can either support tree topology or contradict to it. Alignment sites that support tree topology don t require further analysis, but alignment positions that contradict with the tree represent the data that may need some additional explanation. Such sites show a conflict between the sequence data and the tree, so a more complex topology, such as network, may be needed to explain the data. Netview method counts different patterns of conflicting data and constructs a network by introducing an additional dimension to the tree.
I developed a program Netview implementing this method. Netview implements a graphical interface that lets user select a particular pattern of incompatibility between the aligned sequence data and the phylogenetic tree. The network is then re-constructed for selected pattern. The sequence data, which is shown for each case, also plays important role in interpreting the observed network structure. Also Netview has a convenient 3-dimensional network viewing tool, that is useful for navigating and exploring a phylogenetic structure. It is convenient to be able to change the size and projection angle to examine the network carefully.","subitem_description_type":"Other"}]},"item_1_description_7":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_description":"総研大甲第871号","subitem_description_type":"Other"}]},"item_1_select_14":{"attribute_name":"所蔵","attribute_value_mlt":[{"subitem_select_item":"有"}]},"item_1_select_8":{"attribute_name":"研究科","attribute_value_mlt":[{"subitem_select_item":"生命科学研究科"}]},"item_1_select_9":{"attribute_name":"専攻","attribute_value_mlt":[{"subitem_select_item":"18 遺伝学専攻"}]},"item_1_text_10":{"attribute_name":"学位授与年度","attribute_value_mlt":[{"subitem_text_value":"2004"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"KRYUKOV, Kirill ","creatorNameLang":"en"}],"nameIdentifiers":[{"nameIdentifier":"0","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"甲871_要旨.pdf","filesize":[{"value":"419.6 kB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"要旨・審査要旨","url":"https://ir.soken.ac.jp/record/1013/files/甲871_要旨.pdf"},"version_id":"f5461f45-4d77-4328-a3c0-da390283834f"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"Development of new methods for evolutionary data analysis","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Development of new methods for evolutionary data analysis"},{"subitem_title":"Development of new methods for evolutionary data analysis","subitem_title_language":"en"}]},"item_type_id":"1","owner":"1","path":["20"],"pubdate":{"attribute_name":"公開日","attribute_value":"2010-02-22"},"publish_date":"2010-02-22","publish_status":"0","recid":"1013","relation_version_is_last":true,"title":["Development of new methods for evolutionary data analysis"],"weko_creator_id":"1","weko_shared_id":-1},"updated":"2023-06-20T16:09:22.825846+00:00"}