{"id":1048,"date":"2023-03-21T11:31:17","date_gmt":"2023-03-21T04:31:17","guid":{"rendered":"https:\/\/conf.icgbio.ru\/bgrs98\/?page_id=1048"},"modified":"2023-04-12T13:49:31","modified_gmt":"2023-04-12T06:49:31","slug":"098_infogene-a-database-of-known-gene-structures-and-predicted-genes-and-proteins-in-sequences-of-genome-sequencing-projects","status":"publish","type":"page","link":"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/abstract-list\/098_infogene-a-database-of-known-gene-structures-and-predicted-genes-and-proteins-in-sequences-of-genome-sequencing-projects\/","title":{"rendered":"INFOGENE: A DATABASE OF KNOWN GENE STRUCTURES AND PREDICTED GENES AND PROTEINS IN SEQUENCES OF GENOME SEQUENCING PROJECTS"},"content":{"rendered":"<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/authors-index\/#solovyev\">SOLOVYEV V.V.<\/a><sup>+<\/sup>,\u00a0<a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/authors-index\/#salamov\">SALAMOV A.A.<\/a><\/p>\n<p>The Sanger Centre, Hinxton, Cambridge, CB10 1SA, United Kingdom<\/p>\n<p>+Corresponding author<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/keywords-index\/\">Keywords<\/a>: database, gene structures, predicted genes, large scale genome sequencing<\/p>\n<p>Large scale genome sequencing projects currently produce hundreds of megabases each year. The major sequencing centers are in the process of scaling up their throughput over the next few years Shifting efforts toward sequencing gene-rich rather than random regions might provide the sequence of most of human genes during the next 3 years. Moreover, the initiative to create by 2001 a \u2018rough draft\u2019 of the human genome can allow other scientists to proceed more rapidly with discovering diseases genes. However, the sequence itself does not always provide the knowledge of gene coding regions, which are usually cover a pretty small fraction of genomic DNA. Also, we can not expect their rapid identification in near future by pure experimental approaches for such enormous volume of sequence data. The value of sequence information for biomedical community will strongly depend on availability of candidate genes computationally predicted in these sequences.<\/p>\n<p>The aim of this work was to create the information resource of known and predicted gene structures in major model organisms as Human, Mouse, Drosophila and Arabidopsis. The general scheme of the<i><b>\u00a0INFOGENE<\/b><\/i>\u00a0database is presented in Fig.1.<\/p>\n<p><a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis98_98.gif\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" class=\"alignnone wp-image-1049 size-full\" src=\"https:\/\/conf.icgbio.ru\/bgrs98\/wp-content\/uploads\/sites\/111\/2023\/03\/Thesis98_98.gif\" alt=\"\" width=\"639\" height=\"315\" \/><\/a><\/p>\n<p align=\"CENTER\">Figure 1. General scheme of Infogene component.<i><b><\/b><\/i><\/p>\n<p>&nbsp;<\/p>\n<p><i><b>INFOGENE\u00a0<\/b><\/i>is realized under th<i>e<b>\u00a0<\/b><\/i>Sequence Retrieval System (SRS) developed in European Bioinformatics Institute<i>\u00a0(Etzold et al., 1996<b>).\u00a0<\/b><\/i>This system provides a possibility to connect the database with the existing data resources (as TRRD, Transfac, Swissprot, GeneBank, etc.) and to make complex queries over several databases using WWW server.<i><b>\u00a0<\/b><\/i>In SRS any retrieval command, logical operations with sets that were obtained by previous queries, links between sets of different databanks, or a combination of all can be easily expressed by the SRS query language.<b><\/b><\/p>\n<p><strong>Known gene structures Database<\/strong><\/p>\n<p>Primary reasons for generating known gene structure databases are:<\/p>\n<ul>\n<li>To have collection of known gene structures with their main features presented in the form convenient for retrieval entries including some particular features<\/li>\n<li>Easily create subset of genes or exons with a given set of features<\/li>\n<li>Check availability of genes with particular features<\/li>\n<li>Have links to different informational databases providing regulatory site locations or other information for a particular gene (about polymorphism or mutations underlying inherited disease, for example).<\/li>\n<li>Possibility to make link between similar genes of different model organisms<\/li>\n<\/ul>\n<p>Today the problem of reliable gene prediction in human genomic DNA is still open. The best multiple gene prediction programs like GeneScan (Burge and Karlin,1997) and Fgenes, (Solovyev, 1998) were tested mostly on short sequences containing one gene. The recent test of these programs for 660 human genes shows that the programs can correctly predict about 80% of internal exons and just about 60% of 5\u2019-exons (Solovyev,1988). The prediction of multiple genes should be even less accurate. Therefore, it is important for developing the further gene prediction programs to have as much as possible information about the known genes and their functional signals, that will provide the learning and testing datasets.<\/p>\n<p>We have developed a GenBank parser\u00a0<i>GeneParse<\/i>\u00a0which produces a flat file with some description of genes and gene features including terms corresponding to exon types, regulatory elements, processes and characteristics of genes in a given GenBank sequence. To add this information to SRS we created several files with logical stricture of\u00a0<i><b>INFOGENE<\/b><\/i>\u00a0database components and files with the syntax of their entries. Using these files the information about gene structure was written to SRS with indexing of specific words in entries.<\/p>\n<p>We can use the query language and search\/retrieving software of SRS that will quickly extract sets of sequences with particular biological features. For example, genes where transcription start and stop sites are known or entries with multiple genes. The query language will provide an effective usage of database information in investigation of significant characteristics of genes and their regulatory elements and assist in development methods of their recognition. Currently it might take months to collect such information from the literature.<\/p>\n<p>One example of\u00a0<i><b>INFOGENE\u00a0<\/b><\/i>entr<i><b>y\u00a0<\/b><\/i>corresponding MMTNFAB locus of GenBank is presented in Fig. 2. We can see that this locus includes 2 neighbor genes, which exons and coding regions were described as well as the locations of start, TATA-box and stop of transcription. In the LFT (Locus Features) field we have description of this sequence by special keywords: mang (locus includes many genes), nasp (no alternative splicing), nmts (no multiple starts of transcription) , natp (no alternative promoters), yftr (yes full transcript), npse (no pseudogenes).<\/p>\n<p>For example, using ytss keyword we can easily observe that start of transcription is provided for 251 genes with completely sequenced coding regions.<\/p>\n<pre>LID    MMTNFAB      GenBank MOUSE_G\r\nDAT 19980713\r\nLCO     7208 bp    DNA             ROD       11-MAY-1993\r\nLDE  Mouse complete TNF locus (TNF=tumor necrosis factor).\r\nORG  Mus musculus\r\nLKE   B1 repetitive sequence; lymphotoxin; tumor necrosis factor.\r\nLFT  mang nasp ytss nmts natp yftr npse\r\nLGN    2  2  2     7208       0\r\nGID GMM000399 direct\r\nGPR  TNF-beta\r\nIND\r\nPSD  SWISS-PROT: P09225\r\nGFT  nasp natp ytss nmts yftr nex5 mexo fcds rsiz nsto\r\nGEC     4   3   0   0     916     609     202     609\r\nTSS     1193     3207  1  be\r\nTAT     1174  yTAT  c tata a  dba\r\nEXO     1193     1345  f  c    gt\r\nEXO     1709     1813  i  ag   gt\r\nEXO     1897     1996  i  ag   gt\r\nEXO     2221     3207  l  ag   c\r\nCDS     1718     1813  f atg  gt I\r\nCDS     1897     1996  i  ag  gt V\r\nCDS     2221     2633  l  ag tag\r\nPOA     3186 c aataaa c com\r\nSEQ   MMTNFAB     GMM000399\r\nGRE     0 nrep dba\r\nDWG GMM000400\r\nGUS     1669 ctccgctacacacacacactctctctctctctctcagcaggttctccaca\r\nGDS     2633 gattctaaagaaacccaagaattggattccaggcctccatcctgaccgtt\r\nGID GMM000400 direct\r\nGPR  tnf-alpha\r\nIND\r\nPSD  SWISS-PROT: P06804\r\nGFT  nasp natp ytss nmts yftr nnex mexo fcds rsiz nsto\r\nGEC     4   4   0   0    1691     708     235     708\r\nTSS     4371     6968  1  be\r\nTAT     4331  yTAT  c tata a  dba\r\nEXO     4371     4712  f  c    gt\r\nEXO     5225     5279  i  ag   gt\r\nEXO     5457     5504  i  ag   gt\r\nEXO     5799     6972  l  ag   &gt;\r\nCDS     4527     4712  f atg  gt I\r\nCDS     5225     5279  i  ag  gt V\r\nCDS     5457     5504  i  ag  gt I\r\nCDS     5799     6217  l  ag tga\r\nPOA     6967 a aataaa g dba\r\nSEQ   MMTNFAB     GMM000400\r\nGRE     0 nrep dba\r\nUPG GMM000399\r\nGUS     4478 ctttcactcactggcccaaggcgccacatctccctccagaaaagacacca\r\nGDS     6217 aagggaatgggtgttcatccattctctacccagcccccactctgacccct<\/pre>\n<p>Figure 2. An example of INFOGENE entry corresponding MMTNFAB GenBank locus.<\/p>\n<p><b>Database of predicted genes<\/b><\/p>\n<p>Primary reason for generating predicted gene structures database is:<\/p>\n<ul>\n<li>Provide positional cloners, gene hunters and others with the gene candidates observed in finished and unfinished genomic sequences.<\/li>\n<\/ul>\n<p>We use 2 programs GeneScan (Burge and Karlin,1997) and Fgenes (Solovyev,1998) to predict genes, because exons predicted by both programs is much more often correspond to real ones. The Blast (Altschul et al.,1977) search is used to check if some of predicted exons have similarity with known EST and protein sequences. Example of description of predicted genes is presented in Fig.3. We can see that Genescan predicted 5 coding exons (3 correct) and Fgenes predicted 5 exons (4 correct) and 1 partially correct. All exons predicted by both programs are correct.<\/p>\n<p>&nbsp;<\/p>\n<pre><small>LID   HSCPH70    GenBank    TEST\r\nDAT   Mon Jul 13 11:44:57 BST 1998\r\nLCO   6711 bp    0\r\nORG   Homo sapiens    22\r\nLKE   repeats genes protein EST\r\nLFT   oneg   ntss   mexn\r\nLGG   1  1   1  1\r\nLGF   1  1   1  1\r\nGID   GHS000099\r\nGFT   genescan direct  mexn   ntss   fcds\r\nGEC   5  5767bp  201aa\r\nCDS  584     632  atg   gt\r\nCDS  1625    1783   ag   gt\r\nHOP     pir|F14571|SSC5D12 SSC5D12 NID: g972046 - pig.                6e-07\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      3e-53\r\nCDS  4318    4406   ag   gt both\r\nHOP     pir|P05092|CYPH_HUMAN PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A   1e-12\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      3e-53\r\nCDS  4628    4800   ag   gt both\r\nHOP     pir|P10111|CYPH_RAT PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A     2e-29\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      2e-94\r\nCDS  6215    6350   ag  tga both\r\nHOP     pir|P05092|CYPH_HUMAN PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A   2e-19\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      2e-72\r\nPOA       6538 a aataaa a\r\nSEQ   HSCPH70   GHS000099\r\nGUS     584 ggcgtctctctaagatgcccaggctggtggccggtgtcgaactcctaaga\r\nGDS    6350 gtttgacttgtgttttatcttaaccaccagatcattccttctgtagctca\r\nGFT   fgenes  direct  mexn   ytss   fcds\r\nGEC   5  5111bp  305aa\r\nTSS     1615\r\nTAT     1585   yTATA\r\nCDS  1240    1728   atg   gt\r\nHOP     pir|F14571|SSC5D12 SSC5D12 NID: g972046 - pig                 6e-04\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      5e-58\r\nCDS  4173    4203    ag   gt\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      1e-10\r\nCDS  4318    4406    ag   gt both\r\nHOP     pir|P05092|CYPH_HUMAN PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A   1e-12\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      3e-53\r\nCDS  4628    4800    ag   gt both\r\nHOP     pir|P10111|CYPH_RAT PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A     2e-29\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      2e-94\r\nCDS  6215    6350    ag  tga both\r\nHOP     pir|P05092|CYPH_HUMAN PEPTIDYL-PROLYL CIS-TRANS ISOMERASE A   2e-19\r\nHOE     gnl|UG|Hs#S552867 Human cyclophilin gene for cyclophilin      2e-72\r\nSEQ   HSCPH70   GHS000099\r\nGUS    1240 aacggtcggaaggggcgtctctctaagatgctggctaattaccaggtaac\r\nGDS    6350 gtttgacttgtgttttatcttaaccaccagatcattccttctgtagctca\r\nLRE     5028     5314    AluSx    SINE\/Alu<\/small><\/pre>\n<p>Figure 3. Example of predicted gene description using HSCPH70 sequence.<\/p>\n<p>The Infogene database is available through WWW server of Computational Genomics Group at\u00a0http:\/\/genomic.sanger.ac.uk\/<\/p>\n","protected":false},"excerpt":{"rendered":"<p>SOLOVYEV V.V.+,\u00a0SALAMOV A.A. The Sanger Centre, Hinxton, Cambridge, CB10 1SA, United Kingdom +Corresponding author Keywords: database, gene structures, predicted genes, large scale genome sequencing Large scale genome sequencing projects currently produce hundreds of megabases each year. The major sequencing centers &hellip; <a href=\"https:\/\/conf.icgbio.ru\/bgrs98\/abstracts\/abstract-list\/098_infogene-a-database-of-known-gene-structures-and-predicted-genes-and-proteins-in-sequences-of-genome-sequencing-projects\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":13,"featured_media":0,"parent":97,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/1048"}],"collection":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/comments?post=1048"}],"version-history":[{"count":3,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/1048\/revisions"}],"predecessor-version":[{"id":1437,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/1048\/revisions\/1437"}],"up":[{"embeddable":true,"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/pages\/97"}],"wp:attachment":[{"href":"https:\/\/conf.icgbio.ru\/bgrs98\/wp-json\/wp\/v2\/media?parent=1048"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}