Novel tools are needed for comprehensive comparisons of interspecies characteristics of massive amounts of genomic sequences currently available. An unsupervised neural network algorithm, Self-Organizing Map (SOM), is an effective tool for clustering and visualizing high-dimensional complex data on a single map. We modified the conventional SOM, on the basis of batch-learning SOM, for genome informatics making the learning process and resulting map independent of the order of data input. We generated the SOMs for tri-and tetranucleotide frequencies in 10-and 100-kb sequence fragments from 38 eukaryotes for which almost complete genome sequences are available. SOM recognized species-specific characteristics (key combinations of oligonucleotide frequencies) in the genomic sequences, permitting species-specific classification of the sequences without any information regarding the species. We also generated the SOM for tetranucleotide frequencies in 1-kb sequence fragments from the human genome and found sequences for four functional categories (5' and 3' UTRs, CDSs and introns) were classified primarily according to the categories. Because the classification and visualization power is very high, SOM is an efficient and powerful tool for extracting a wide range of genome information.
SOM that was constructed with oligonucleotide frequencies in 10-kb sequences from human genome sequences identified oligonucleotides with frequencies characteristically biased from random occurrence level, and 10-kb sequences rich in these biased oligonucleotides were self-organized on the map. Because these oligonucleotides often corresponded to functional signal sequences (e.g. binding sites for transcription factors) or their constituent elements, we categorized occurrence patterns and frequencies of such pentanucleotides in the human genome that are thought to regulate transcription. SOM analysis is dependent only on oligonucleotide frequencies and thus applicable even for the sequenced genomes with little additional experimental data. In order to know TSS, experimental data were required, but to know start sites of protein-coding sequences, such data were not required in most cases. When known signal sequences of various species with enough experimental data are characterized systematically, we can develop an in silico method of signal sequence prediction for a wide range of species. Recently, we have developed a novel bioinformatics tool for phylogenetic classification of genomic sequence fragments derived from uncultured microorganism mixtures in environmental and clinical samples