{"created":"2023-06-20T13:20:53.946170+00:00","id":965,"links":{},"metadata":{"_buckets":{"deposit":"ed62b8cf-bc48-4bf4-a0b8-0e818a38d1f3"},"_deposit":{"created_by":1,"id":"965","owners":[1],"pid":{"revision_id":0,"type":"depid","value":"965"},"status":"published"},"_oai":{"id":"oai:ir.soken.ac.jp:00000965","sets":["2:430:20"]},"author_link":["9790","9791","9789"],"item_1_creator_2":{"attribute_name":"著者名","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"中村, 保一"}],"nameIdentifiers":[{"nameIdentifier":"9789","nameIdentifierScheme":"WEKO"}]}]},"item_1_creator_3":{"attribute_name":"フリガナ","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"ナカムラ, ヤスカズ"}],"nameIdentifiers":[{"nameIdentifier":"9790","nameIdentifierScheme":"WEKO"}]}]},"item_1_date_granted_11":{"attribute_name":"学位授与年月日","attribute_value_mlt":[{"subitem_dategranted":"2001-03-23"}]},"item_1_degree_grantor_5":{"attribute_name":"学位授与機関","attribute_value_mlt":[{"subitem_degreegrantor":[{"subitem_degreegrantor_name":"総合研究大学院大学"}]}]},"item_1_degree_name_6":{"attribute_name":"学位名","attribute_value_mlt":[{"subitem_degreename":"博士(理学)"}]},"item_1_description_12":{"attribute_name":"要旨","attribute_value_mlt":[{"subitem_description":"Rapid, automated sequencing technologies with related advances in computational analysis and informatics have transformed the nature of biological research. The huge amounts of sequence data challenge the scientific community to understand and use this new information effectively. In this thesis, I describe the construction of data analysis and presentation systems for sequence information; these systems will assist in the identification of gene and implementation of a high-throughput genome sequencing era.

Chapter1
CUTG (codon usage tabulated from GenBank) is a comprehensive database of codon usage. To generate an electronic data set for codon usage for each gene and for codon choice trends in each genome, Ikemura et al. have compiled codon usage in genes encoding proteins contained within the international DNA sequence database. The data files are available on ftp sites at Kazusa DNA Research Institute, National Institute of Genetics and European Bioinformatics Institute.
The compilation is synchronized with major releases of GenBank. The latest data source available during the preparation of this thesis was NCBI-GenBank Flat File Release 120.0. The frequencies of each of the 382,241 complete protein-coding sequences (CDSs) was compiled from the taxonomic divisions of the DNA sequence database. The sum of the codons used by 11,388 organisms has also been calculated. A list of the codon usage of genes and the sum of the codons used by each organism can be viewed at http://www.kazusa.or.jp/codon/. A new WWW interface has been developed to provide data in a format compatible with that of the CodonFrequency output in the GCG Wisconsin PackageTM. Also, for each species, there is a query box to search for information in the comments for each gene. The user can choose CDSs by keyword and then generate codon usage tables from the selected genes. This tool provides researchers with the ability to examine intra-species variations in codon usage.
As an application of codon usage-based analysis for microbial genome sequencing efforts, I used only the sequences of the ribosomal protein genes as standards for calculation when I performed a modified codon adaptation index (CAI) analysis. This is in contrast to the traditional method of analysis, which relies on prior knowledge of the sequences of the most highly expressed genes.
To begin, I tabulated the patterns of codon-anticodon recognition in the following microorganisms whose genomes have been sequenced completely: Haemophilus influenzae Rd, Methanococcus jannaschii, and Synechocystis sp. strain PCC6803. For Escherichia coli, Mycoplasma genitalium, Mycoplasma pneumoniae, and Saccharomyces cerevisiae, the previously adopted codon-anticodon combination was used.
I then used a modified CAI (Sharp and Li, 1987) as a measure of synonymous codon bias. The original CAI value for each gene was measured with the codon preferences of the genes for highly expressed proteins such as ribosomal proteins and elongation factors, as a basis. To generalize this method to organisms for which only sequence information exists, I modified the procedure of extraction by simply taking into account the sequences of the ribosomal protein-coding genes, and the codon usage biases of the ribosomal protein genes of each of the seven microbial genomes was recalculated.
With these values, CAIrp, a CAI that depended on the codon biases of the ribosomal protein genes, was calculated for all of the protein-coding genes of the genome. Of the seven genomes examined, a clear correlation between the CAIrp score and the level of protein-coding gene expression was observed for all but the genes of M. genitalium. For the six genomes, elongation factors, and chaperonins, and ribosomal proteins had high CAI scores In contrast, genes for transposases and genes of prophage origin, which are expressed at lower levels, had the lowest CAI scores. This result indicates that codon usage analysis based on ribosomal protein gene sequences may be useful for predicting the expression levels of unknown genes. This method would be particularly useful for microbes where the entire genomes is being sequenced, since the DNA sequences of most, if not all, genes would be available.

Chapter2
A WWW database system that provides information for deduced protein-coding genes was constructed for the cyanobacteria sequencing project.
Cyanobacteria are prokaryotic microorganisms that carry a complete set of genes for oxygenic photosynthesis. In 1996, Kaneko et al. reported the complete 3.57 megabase (Mb) sequence of the genome of Synechocystis sp. strain PCC6803, which contains 3,168 potential protein-coding genes.
CyanoBase (http://www.kazusa.or.jp/cyano/) is an online resource for accessing genomic data for the cyanobacterium. The core portion of CyanoBase contains annotations for each of the 3,168 protein-coding genes deduced from the entire nucleotide sequence of the Synechocystis sp, strain PCC6803 genome. The annotation for each protein-coding gene is accessible through three menus on the main page of this database: map image, gene classification lists, and keyword and similarity search engines. The aim of this database is to provide detailed information on potential protein-coding genes through a user-friendly interface that includes clickable genome maps and a hypertext classification list.
The database also contains repository facilities that store and offer experimental information and proposed function of each gene. Of the 3,168 deduced genes on the Synechocystis genome, 1,722 are annotated as functionally unassigned, which included 1,270 putative genes, 418 genes similar to hypothetical ones, and 34 genes similar to expressed sequence tags (ESTs) of other genomes. To analyze the functions of these genes, systematic disruption of each gene and characterization of the resulting mutants is thought to be a promising strategy.
CyanoMutants (http://www.kazusa.or.jp/cyano/mutants/) is a cumulative database that allows users to stores and access mutant information through the WWW. Each entry in CyanoMutants contains three sections: identification of the mutated gene, information about the phenotype, and person to whom correspondence should be addressed. Each entry is linked to the corresponding annotation in CyanoBase. The corresponding page in CyanoBase contains a link to the page in CyanoMutants that provides mutant information.
These linked information will prevent unnecessary overlaps in experiments and promote communication among scientists to elucidate the functions of putative genes in cyanobacteria.
As of December 2000, CyanoMutants contained 431 mutant entries, 134 of which have phenotype description. The number of genes registered is expected to increase continuously since a large number of gene disruption experiments have been carried out since the release of the genomic sequence of Synechocystis sp. strain PCC6803.

Chapter3
A protocol to automate the execution of similarity searches and gene prediction programs was developed for the Arabidopsis thaliana genome sequencing project. High-throughput annotation of 27 Mb genomic sequences of A. thaliana has been carried out with the assistance of the system.
The 125 Mb genome of A. thaliana is organized into five chromosomes and contains an estimated 25,500 genes. To understand the entire genetic system in this plant, an international sequencing project of the A. thaliana genome has been initiated 1996, and currently it is in completion phase. Our research group is participating in sequencing the entire bottom arm and portions of the top arm of chromosome 5 and also the top arm of chromosome 3. During the process of annotating the genomic sequences of clones on the chromosomes, I have constructed a computer-aided system for high-throughput gene
identification.
In this system, nucleotide sequences are translated in six frames with use of the universal codon table, and each frame is subjected to a similarity search against the non- redundant protein database, nr, with use of the BLAST program. Each local alignment, that shows an E-value < 0.001 to known protein sequences, is extracted and stored. Potential exons for protein-coding genes are predicted with the computer programs Grail and GENSCAN. For localization of exon-intron boundaries, donor/acceptor sites for splicing are predicted by NetGene2 and SplicePredictor. To identify transcribed regions and
structural RNA genes, the BLAST program is used to compare nucleotide sequences with the EST and RNA gene data sets. For assignment of transfer RNA genes and transfer RNA structures, tRNA-scanSE is used.
All outputs are then parsed and stored in the General Feature Format (GFF). When required, the results are parsed and loaded into a WWW-based information display system called Arabidopsis Genome Displayer. This display system shows the positional relation of genome features along a genomic sequence. Simultaneously, an annotation composing interface allows manual editing of the gene model showing tentative nucleotide and protein sequences and exon-intron organization. The annotator performs similarity searches as needed on the working model during the gene-modeling process. After careful
editing, the most reasonable model of a genomic region is saved in the in-house database as a deduced gene.
In conclusion, 6, 124 potential protein-coding genes were assigned to the 27 Mb regions of Arabidopsis chromosomes 3 and 5 covered by 461 clones and gap-closing units. The average density of genes was estimated to be 1 gene per 4.4 kb. One hundred twenty- seven RNA genes were deduced by similarity searches and computer predictions. Of 6,124 deduced protein-coding genes, 2,808 carried EST sequences, indicating that 46% of the total genes in A. thaliana may be represented in the current EST databases.","subitem_description_type":"Other"}]},"item_1_description_18":{"attribute_name":"フォーマット","attribute_value_mlt":[{"subitem_description":"application/pdf","subitem_description_type":"Other"}]},"item_1_description_7":{"attribute_name":"学位記番号","attribute_value_mlt":[{"subitem_description":"総研大乙第87号","subitem_description_type":"Other"}]},"item_1_select_14":{"attribute_name":"所蔵","attribute_value_mlt":[{"subitem_select_item":"有"}]},"item_1_select_8":{"attribute_name":"研究科","attribute_value_mlt":[{"subitem_select_item":"生命科学研究科"}]},"item_1_select_9":{"attribute_name":"専攻","attribute_value_mlt":[{"subitem_select_item":"18 遺伝学専攻"}]},"item_1_text_10":{"attribute_name":"学位授与年度","attribute_value_mlt":[{"subitem_text_value":"2000"}]},"item_creator":{"attribute_name":"著者","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"NAKAMURA, Yasukazu","creatorNameLang":"en"}],"nameIdentifiers":[{"nameIdentifier":"9791","nameIdentifierScheme":"WEKO"}]}]},"item_files":{"attribute_name":"ファイル情報","attribute_type":"file","attribute_value_mlt":[{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"乙87_要旨.pdf","filesize":[{"value":"428.7 kB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"要旨・審査要旨 / Abstract, Screening Result","url":"https://ir.soken.ac.jp/record/965/files/乙87_要旨.pdf"},"version_id":"917c7847-cfe5-476f-ac60-89a26e9b3c75"},{"accessrole":"open_date","date":[{"dateType":"Available","dateValue":"2016-02-17"}],"displaytype":"simple","filename":"乙87_本文.pdf","filesize":[{"value":"5.4 MB"}],"format":"application/pdf","licensetype":"license_11","mimetype":"application/pdf","url":{"label":"本文","url":"https://ir.soken.ac.jp/record/965/files/乙87_本文.pdf"},"version_id":"a18a5707-28c3-420a-a0b6-8eafa57d21fc"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourcetype":"thesis","resourceuri":"http://purl.org/coar/resource_type/c_46ec"}]},"item_title":"Data analysis and presentation of large-scale nucleotide sequence information","item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Data analysis and presentation of large-scale nucleotide sequence information"},{"subitem_title":"Data analysis and presentation of large-scale nucleotide sequence information","subitem_title_language":"en"}]},"item_type_id":"1","owner":"1","path":["20"],"pubdate":{"attribute_name":"公開日","attribute_value":"2010-02-22"},"publish_date":"2010-02-22","publish_status":"0","recid":"965","relation_version_is_last":true,"title":["Data analysis and presentation of large-scale nucleotide sequence information"],"weko_creator_id":"1","weko_shared_id":1},"updated":"2023-06-20T14:43:48.556601+00:00"}