@misc{oai:ir.soken.ac.jp:00001003, author = {阿部, 貴志 and アベ, タカシ and ABE, Takashi}, month = {2016-02-17, 2016-02-17}, note = {With the increasing amount of available genomic sequences, novel tools are needed for comprehensive analysis of species-specific sequence characteristics for a wide variety of genomes. Self-Organizing Map (SOM), which was developed by Kohonen to study memory and recall/association mechanisms, can identify and associate similar types of information and localize such information in close vicinity on a two-dimensional map. SOM has been proven to be a powerful unsupervised algorithm and applied in various fields of science and technology (e.g., complex industrial processes, document and image databases, and financial applications) but rarely been applied to analysis of genome sequences. In this thesis study, on the basis of batch-learning SOM (BL-SOM), I modified the conventional SOM for genome informatics to make the learning process and resulting map independent on the order of data input. The initial weight vectors of the Kohonen’s conventional SOM were usually set by random values, but the vectors in my method were initialized by principal component analysis(PCA) to obtain the same result between different caluculations. I furthere modified BL-SOM to execute parallel processing with supercomputers and PC-clusters and thus could analyze a vast amount of available genomic sequences. In this thesis study, I used the modified SOM to analyze short oligonucleotide frequencies (di- to pentanucleotide frequency) in a wide variety of prokaryotic and eukaryotic genomes.
  When only fragments of genomic sequences (e.g., 10-kb sequences)from mixed genomes of multiple organisms are available, it would appear to be impossible to identify how many and what types of genomes are present in the collected sequences. However, I found that the modified SOM could classify the sequence fragments according to species without any information other than oligonucleotide frequencies. I constructed SOMs of di-, tri-, and tetranucleotide frequencies in 1- and 10-kb sequences from prokaryotic and eukaryotic genomes for which complete sequences are available. SOM recognized, in most 10-kb sequences, species-specific characteristics of oligonucleotide frequencies (key combinations of oligonucleotide frequencies), permitting species-specific classification of sequences without any information regarding species.
  A majority of environmental microorganisms, especially those living in extreme environments, are difficult to culture in the laboratory. Because conventional experimental approaches have been unsuccessful, these genomes have remained uncharacterized, and there is the possibility that such genomes contain a wide range of novel genes that would be of scientific and/or industrial interest. Metagenomics, which is genomic analysis of uncultured microorganisms, has been proposed to study microorganism diversity in a wide variety of environments and to identify novel and industrially useful genes. In the metagenome analysis of uncultured microorganisms, genome DNAs are extracted directly from an environmental specimen that contains multiple organisms, and the genomic fragments are then cloned and sequenced. With a simple collection of fragmental sequences, it appears to be impossible to predict what kinds and the ratios of species present in an environmental sample, to which lineages the species belong, and how the genomes are novel. To establish SOM as a methodology suitable to this purpose, I constructed SOMs of tetranucleotide frequencies in 1- and 5-kb sequences from approximately 80 bacterial genomes for which complete sequences are available. Sequences were clustered primarily according to species and to 11 major bacterial groups without any information regarding the species. With this SOM method, all sequences in DNA databases that were from unidentified or uncultured bacteria and longer than 1 kb were classified into 11 major bacterial groups. The result indicated that the method is useful also for survey of pathogenic microorganisms causing novel, unclear infectious diseases.
  Next, I analyzed tetra- and pentanucleotide frequencies in the human genome, and found that frequencies and distributions of oligonucleotide sequences involved in transcriptional regulation were often biased significantly from random occurrence. I could categorize occurrence patterns and frequencies of known signal sequences in the human genome. When known signal sequences from various species with sufficient experimental data are characterized and categorized systematically with SOMs, it should be possible to develop an in silico method to predict signal sequences, which is thought to be most useful for identification of signal sequences in genomes for which only sequence data are available. Because the number of such poorly characterized genomes becomes high,development of such an in silico method has become increasingly important. I have developed SOM as a methodology just suitable to this purpose.
  In addition to protein-coding sequences (CDSs), the flanking regions upstream of transcription start sites and the 5’ and 3’ untranslated regions(UTRs) have attracted attention because of their crucial roles in transcriptional and post-transcriptional regulation of gene expression. By combining analyses on cDNA and genomic sequences of human and mouse, I developed SOM to characterize the six functional regions, 5’ and 3’ UTRs, CDSs, introns, 5’ flanking regions, and ncRNAs, in these genomes and to identify hidden sequence characteristics in the functional regions. Because clustering power of SOM is very high, I propose that SOM can provide fundamental guidelines for understanding molecular processes and mechanisms that have established sequence characteristics of individual genomes and genomic regions during evolution., application/pdf, 総研大乙第127号}, title = {Development of a novel genome informatics strategy on the basis of Self-Organizing Map (SOM)}, year = {} }