Lecture course outline » Young scientists school BGRS-2004

Prof. Thomas Casavant, Professor and Director The UI Center for Bioinformatics and Computational Biology,
Parallel Processing Laboratory,
Departments of Electrical and Computer and Biomedical Engineering, Genetics, Ophthalmology, and The Holden Comprehensive Cancer Center
University of Iowa
USA
“Grid Computing Approaches to Finding Distant Orthologs and Horizontal Gene Transfer Events”.
Abstract:
This talk describes and evaluates a coarse-grained parallel computational approach to identifying rare evolutionary events often referred to as “horizontal gene transfers”. Unlike classical genetic evolution, in which variations in genes accumulate gradually within and among species, horizontal transfer events result in a set of potentially important genes which “jump” directly from the genetic material of one species to another. Such genes, known as xenologs, appear as anomalies when phylogenetic trees are compared for normal and xenologous genes from the same sets of species. However, this has not been previously possible due to a lack of data and computational capacity. With the availability of large numbers of computer clusters, as well as genomic sequence from more than 2,000 species containing as many as 35,000 genes each, and trillions of sequence nucleotides in all, the possibility exists to examine “clusters” of genes using phylogenetic tree “similarity” as a distance metric. The full version of this problem requires years of CPU time, yet only makes modest IPC and memory demands; thus, it is an ideal candidate for a grid computing approach. This paper describes such a solution and preliminary benchmarking results that show a reduction in total execution time from approximately two years to less than two weeks. I will also report on several trade-off issues in various partitionings of the problem across WAN nodes, and LAN/WAN networks of tightly coupled computing clusters.
Prof. Vassily Lyubetsky, Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
“Reconstruction Of Evolutionary Events At Molecular Level And Inference Of Species Phylogeny”

Abstract:
Mathematic methods and models for comparative analysis of large sets of protein phylogenies are described. The processes modeled are gene duplication, loss, gain and horizontal transfer. Initially, a species tree is constructed as a consensus of corresponding gene trees using probabilistic distribution on source data. Algorithms are further implemented to identify vertices accounting for topological disparities between gene and species trees, with possibility to infer underlying evolutionary events. The analysis is illustrated on case studies of a prokaryotic protein family and a set of protein phylogenies deduced from families from the COGs database (NCBI). The potential of the described methods to infer phylogeny and gene evolution events is discussed.

Methods and algorithms described here are aimed at implementing two tasks: reconstruction of prokaryotic species trees and analyzing hypotheses about gene evolution. The main emphasis is placed on original algorithms and their performance, although, due to space limits, only general descriptions are provided along with necessary references. Events in gene evolution are usually viewed as gene divergence during species differentiation, gene duplication, gene gain, loss and horizontal gene transfer (HGT). Molecular data is protein sequences grouped according to their amino acid and functional similarity into clusters of orthologous groups of proteins (COGs) (Tatusov et al., 2001).
The general approach to reconstruct gene evolution events has long been defined (Goodman et al. 1979, Eulenstein et al. 1998). A protein gene family is selected, usually from among COGs, with subsequent assembling of multiple sequence alignment and reconstruction of gene tree G (also referred to as a protein tree or COG tree). Further analyzed is topological similarity and disparity between gene trees from set {Gi} in order to reconstruct the species tree and infer gene evolution events, respectively. Topological differences are reconciled to produce species tree S. Alternatively, when inferring gene evolution events, considerable topological differences between particular gene tree G (often pertaining to family {Gi}) and species tree S are the basis of the analysis.
Mathematic models of gene evolution are formulated to accommodate the observed differences, and optimization of model parameters is used as a tool to reconstruct evolutionary history of a microbial gene family. The evolutionary model is defined as a procedure of comparing gene and species trees, while its parameters are defined as sets of tree vertices with assigned evolutionary events. An optimized model has parameters corresponding to the extremes of relevant evolutionary characteristics.
Prof. Vassily Lyubetsky, Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
“A mathematical model for regulation of gene expression by formation of alternative RNA structures”.

The computer model of the regulation of genes expression in bacteria mediated by the dynamical formation of the RNA secondary structures is described in [Lyubetsky, Molecular Biology, 2005] and [Lyubetsky, Information processes, 2005].
The present version of the program implements the Monte-Carlo modeling of the regulation process beginning from the time of the RNA-polymerase binding to DNA chain and then the ribosome binding to the Shine-Dalgarno box of the leader peptide and until termination of transcription (or anti-termination). In the present version all microstates and macrostates of the secondary structure in the window between positions of the ribosome and the polymerase are constructed anew at each step of the process. This decreaswes the algorithm efficiency and makes modeling for long nucleotide sequences difficul. We plan to develop a new version of algorithm for which the set of states will be recounted from the previous set of states up to the change of positions of the ribosome and polymerase.
Testing. The developed version of the program was successfully tested in particular for S. venezuelae ISP5230, S. avermitilis MA-4680 и S. coelicolor A3(2). The results of modeling show good correspondence with biological data, that exists for the first case [C. Lin, A. Paradkar, L. Vining, Microbiology, 1998, 144, p. 1971-1980].
Algorithm. The RNA folding process is represented as a Markov process with states corresponding to RNA secondary structures and transition probabilities corresponding to transformations of a secondary structure caused by formation or deaintegration of a helix. Macrostate is defined as a set of secondary structures having the same topological (bracket) structure of bound helices. The transition probabilities (kinetic constants) for transitions between secondary structures (microstates) belonging to the same macrostate can be chosen arbitrary with the only condition that the equilibrium Gibbs distribution on this macrostate (as on the set of microstates) is invariant. The transition probabilities for transitions corresponding to formation or disintegration of a helix are chosen in a special way for which the principle of detailed balance is satisfied and the probability of disintegration of the helix depends only on binding energy of helices. Conversely the probability of helix formation depends only on the free energy of loops corresponding to the given set of helices. It should be noted that for several (not all) organisms the binding energy of a helix contains also the term depending on the total length of its loop. But in typical situations the binding energy is calculated just from experimental data on stacking. The free energy of the loop depends on its length, taking into account the Flory entropy term and the elastic energy term. The topological correction term which differentiates the end loops and the side loops is also included following experimental data. Then the averaged transition rates for transition between different macrostates (topological different secondary structures) are calculated. These rates define the part of the model which does not contain the processes of transcription or translation. The process of transcription is described by the transition probability (kinetic constant) for the change of the RNA-polymerase position on the DNA chain. The nominal value of this constant equals 40 sec-1, but in reality it strongly depends on the secondary structure of the RNA chain. There is experimental evidence for the influence of helices on the transcription rate. We propose the resonant type formula for the interaction between RNA hairpin and RNA-polymerase molecule. This formula roughly corresponds to the experimental data. The process of transcription can be terminated if the RNA-polymerase is located at a T-rich segment of DNA chain. We consider several physical mechanisms of the termination. All of them give the same dependence of the termination rate constant on the transcription rate. Again we use experimental data to evaluate the parameter value in the formula of that dependence. The last process incorporated in the model is translation. The role of this process in the termination is the following: the ribosome influences the secondary structure of RNA chain by destroying some helices. In the result the presence (or absence) of some helices determine the transcription rate and so the termination rate. The kinetic constant of the translation is 15 (sec-1) for translation of any non-regulatory codon, but for regulatory codons the translation rate depends on the concentration of the corresponding amino acid, or, more exactly, on the concentration of charged tRNA. But assuming the Michaelis-Menten type formula for all these dependences, we obtain the same type formula for the overall dependence of translation rate on the amino acid concentration in the medium. The corresponding Michaelis-Menten parameter has no direct physical sense, because it reflects the series of different processes of amino acid diffusion, amino acid binding with tRNA, tRNA binding with the ribosome and so on.
The result of the modeling procedure shows the dependence of the probability of termination of transcription on the concentration of the amino acid in the medium. These results for several cases are in good (at least qualitative) correspondence with experiments.
Prof. Maria Samsonova, St.Petersburg State Polytechnical University, St.Petersburg, RussiaAbstract:
Methods for the integration of distributed heterogeneous bioinformatics tools and data resources.
It is well known that bioinformatics has to cope with large amount of information in all knowledge domains. There are hundreds of resources and applications available to today biologist via either “command line” applications, databases, flat files, web forms or graphical user interfaces. These may be either local to the user, or provided by remote sites. Besides these resources are updated frequently and have different semantics.
Recently the technologies have began to appear that make it possible to move from an interactive to an automated approach in biological information management by provision of a distributed environment that supports in silico experimental process in bioinformatics. At the core of these technologies is the construction of workflows. Currently, there is considerable development in workflow tools, however still it is a broad area with many competing proposals and no accepted standards.
In my lecture I am going to present the technology which we have developed to understand the dynamical regulatory mechanisms controlling the expression of segmentation genes in fruit fly Drosophila ( Jaeger et al, (2004), Nature, 430 ). This technology was used to construct a Laboratory Information Management System (LIMS) known as PIPE. PIPE is easily extendable to deal with new data processing and analysis methods, flexible in specification and modification of these methods, scalable and supports distributed processing and analysis of data and images.
Prof. Dmitry Scherbakov,
Limnological Institute SB RAS, Irkutsk, Russia “Modeling of molecular evolution processes in different speciation scenarios. “

Individual-oriented modeling may be an efficient tool in study of the possible mechanisms of evolutionary process. A serious problem of this approach remains the complexity of experimental checking of the results. To facilitate the interaction of experimental and theoretical studies of microevolutionary process, we propose to include into the models objects that model accumulating neutral mutations or objects that are similar to proteins. As a result, it becomes possible to examine the question of wether sets of homologous segments of nucleic acids obtained in the process of experimental studies of populations can help to build a phylogenetic criterion for evolutionary hypothesis. We illustrate this approach by two examples – simulation of co-evolution of hosts and virtually transmitting and causing feminization of intracellular parasitic males, and also simulation of coordinated changes in protein sequences.
Prof. Alexis Ivanov, Institute of Biomedical Chemistry RAMS , Moscow, Russia
“3D modeling and molecular dynamics simulation of peripheral membrane proteins: case study with cytochrome b5 in explicit lipid bilayer”

Alexis S. Ivanov, Yulia Yu. Smolinskaya, Alexander V. Veselovsky, Alexander I. Archakov
Institute of Biomedical Chemistry RAMS, Moscow, Russia

Abstract:
Molecular dynamics (MD) simulations of full-length cytochrome b5 (b5) in membrane environment were carried out to investigate the structure and probable membrane topology of b5. All computations were executed on Linux cluster (32 CPU) running Gromacs 3.2 suite of programs. MD simulations were performed in complex system consisted of explicit dipalmitoylphosphatidylcholine (DPPC) bilayer (338 lipid molecules) and two water phases (about 15000 molecules each). Preliminary equilibration of this membrane system was done through 3.5 ns of MD at constant temperature and pressure. Some structural parameters of membrane model (thickness of bilayer, surface area and volume per lipid, ordering of the DPPC chains, etc.) reproduced quite well the available experimental data.
The obtained model system was validated as membrane environment for modeling and MD simulation of membrane proteins. The reference protein with known structure and peripheral membrane topology (monoamine oxidase A, MAO) was used. The comparison of crystal and MD equilibrium structures of MAO confirms the fitness of created membrane system for successful simulation of structure and membrane topology of peripheral membrane proteins. In order to study the probable membrane topology of full-length b5 we have performed a number of series of MD with time simulation up to 3.0 ns. Two hypothetical structures of b5 with transmembrane or loop membrane anchor were analyzed. Special attention has been focused on the interaction of membrane-bound part of protein with lipid bilayer. The results of the simulation demonstrate that b5 with both types of anchor can be stable in complex membrane environment which provide an explanation for known contradictory experimental data.
Dr. Dmitry Afonnikov, Institute of Cytology and Genetics, Novosibirsk, Russia
“Analysis of co-ordinated substitutions in protein sequences”

Recent results suggest that during evolution certain substitutions at protein sites may occur in a coordinated manner due to interactions between amino acid residues. Information about these coordinated substitutions may be helpful in analysis of protein structural/functional relationships. Here, I will consider coordinated amino acid substitutions as a model of protein evolution. Experimental evidence for the existence of dependent substitutions will be reevaluated. Methods for the detection and evaluation of coordinated substitutions available to the present time will be reviewed. Possible practical applications of information about coordinated substitutions to analysis of protein evolution and function will be described.
Dr. Luciano Milanesi, National Research Council – Institute of Biomedical Technology, Italy “Distributed Applications,Web Services, Tools and GRID Infrastructures for Bioinformatics”

Due to the increasing number of nucleotide and protein sequences produced by high throughput techniques, that have to be analyzed by bioinformatics tools, will be necessary to increase the actual calculation resources. Therefore, in order to face these new challenges successfully, it will be necessary to develop dedicated supercomputers, parallel computer based on clustering technologies and high performance distributed platforms like GRID.
Next generation of Grid Infrastructures, are trying to implement a distributed computing model where easy access to large geographical computing and data management resources will be provided to large multi/inter-disciplinary Virtual Organizations (VO) made of both research and user entities.
Indeed, computational and data Grids are “de facto” considered as the way to realise the concept of virtual places where scientists and researchers work together to solve complex problems in Bioinformatics, despite their geographic and organizational boundaries.
In these respects, then, Grid Computing is announcing another technological and societal revolution in high performance distributed computing as the World Wide Web has been since the last ten years for what concerns the meaning and the availability of global information. The aim is to operate this widely distributed computing environment as a uniform service, which looks after resource management, exploitation, and security independently of individual technology choices.
A general overview of the GRID technologies and computer cluster application to perform distributed bioinformatics applications for data mining, gene discovery, sequence similarity for searching of DNA and protein will be illustrated.