(Updated version, 02.07.2006)
July 12, 2006
Registration
(I) Introduction to high-performance computing. Part I.
The supercomputer parallel calculations find an ever increasing number of applications to solving numerous topical problems in the modern science and technology. Their use is governed by emergence of a new class of superlarge problems. The course of lections will include the history of development of supercomputing, review of the most powerful supercomputers, description of the classes of problems requiring parallel computations, and the main trends in development of technologies. Architectures of the supercomputers with shared and distributed memories will be described as well as distinctions between supercomputers and parallel clusters. Specific features of data storage during parallel computations will be considered as well as the technologies of parallel programming.
Lecture 1. Architecture of high performance computers. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
To start with, we will discuss the architectural principles of high performance computers and, in particular, of compute clusters. We will have a closer look to processors, interconnect technology, storage, and in particular, to the memory architecture. The latter defines the classes of shared and distributed memory computers. The lecture will also present some data from the current TOP500 list of the strongest computers in the world. Finally, an overview over operating system aspects will be presented.
Lecture 2. Parallel programming principles. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
We will now learn how parallel programs are characterized and how in principle we design and implement such programs. A good knowledge of compiler and hardware details is often necessary in order to get optimal performance of the program. The parallelization paradigm of data partitioning and message passing will be introduced. Two measures will be presented to evaluate the performance of the parallel program.
Seminar 1. Phylogenetic analysis of a protein family. Dr. Daniil G. Naumoff, State Institute for Genetics and Selection of Industrial Microorganisms, Moscow, Russia.
I am going to present a complete procedure of a protein family analysis, viz. from database searching to visualization of the phylogenetic tree. A special attention will be paid for solving problems of multi-domain protein structure and for clarifying the phylogenetic status of ‘atypical’ members of a protein family. Results of the phylogenetic analysis of several glycosidase families will be shown as examples. The methods and programs suggested can be applied to any protein family but they would work more effectively with globular solving proteins. Protein family analysis can be started using any protein sequence as a query. It does not matter if the protein has been studied enzymatically or corresponds to a biochemically uncharacterized ORF. A preliminary version of the lecture (in Russian) has been published on-line in Zbio journal (http://zbio.net/bio/001/003.html).
July 13, 2006
(II) Introduction to high-performance computing. Part II.
Lecture 3. Message passing with MPI. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
The first step into parallel programming will be done based on the Message Passing Interface (MPI). We will write a small program that distributes data to different compute nodes, calculates some data, and finally collects the results. A few basic library calls for message passing will be introduced, which are already sufficient to write a first parallel program. Problematic issues like debugging and performance analysis will be covered.
Lecture 4. Advanced issues with message passing. Prof. Thomas Ludwig, Ruprecht-Karls-Universitat Heidelberg, Institut fur Informatik, Heidelberg, Germany
MPI offers a huge number of library calls, most of which do just combine several basic calls and thus realize complicated activities in a single call. We will have a look at collective calls and sophisticated communication patterns. As bioinformatics is particularly data intensive, a first introduction to parallel input/output via MPI will be given. We will present an outlook onto advances features in the MPI-2 standard and what they are used for.
(III) Application of high-performance computations in the problems related to construction and analysis of phylogeny.
Construction of phylogenetic relationships is among the most important biological problems. Of great importance here are molecular data-DNA, RNA, and protein sequences. An active genomic sequencing and obtaining of a large number of sequences made construction of phylogenies involving large number of sequences (from 1000 and more) a topical problem. This part of the lecture course will detail algorithms for construction of phylogenetic trees and their comparison. A special attention will be paid to specific features of algorithm realizations using parallel architecture of computers. Problems of large phylogenies (from 100 to 4000 sequences) will be considered as examples.
Lecture 5. Computation of large phylogenetic trees: algorithmic and technical solutions. Dr. Alexandros Stamatakis, Swiss Federal Institute of Technology, Lausanne, Switzerland
The computation of ever larger as well as more accurate phylogenetic trees with the ultimate goal to compute the “tree of life” represents one of the grand challenges in high performance computing (HPC) Bioinformatics. Statistical methods of phylogenetic analysis such as maximum likelihood and Bayesian inference have proved to be the most accurate models for evolutionary tree reconstruction.
Unfortunately, the size of trees which can be computed in reasonable time is limited by the severe computational cost induced by these methods. There exist two orthogonal research directions to overcome this challenging computational burden: Firstly, the development of novel, faster, and more accurate heuristic algorithms. Secondly, the application of high performance computing techniques, the deployment of supercomputers, and Grid-computing to provide the required computational power, mainly in terms of CPU hours.
The field has witnessed significant algorithmic advances over the last 2-3 years which allow for inference of large phylogenetic trees containing 500-1000 sequences on a single PC processor within a couple of hours using maximum likelihood. On the other hand, the main problem which high performance computing implementations of maximum likelihood analyses faces is that technical development lags behind algorithmic development, i.e., programs are parallelized that do not represent the state-of-the-art algorithms any more.
Within this context, the talk initially aims to provide a brief overview of the computational challenges large-scale phylogenetic inference face concerning both algorithmic as well as supercomputing aspects. The benefits of simultaneous algorithmic and technical development are outlined by example of the program RAxML (Randomized Axelerated Maximum Likelihood). The sequential version of RAxML has been used to compute the largest maximum likelihood tree to date (comprising 25.000 organisms) on a single CPU.
In addition, recent algorithmic developments including novel genetic search algorithms and search techniques will be discussed. Finally, an overview over possible future HPC implementations of those novel algorithms is provided including Grid-based solutions, implementations for hybrid supercomputer architectures, and exploitation of vector-like peripheral processors like for example Graphics Processing Units (GPUs).
Seminar 2. The Parallelization of Bioinformatics Problems: A Tutorial. Yury Vyatkin, Institute of Cytology and Genetics, Novosibirsk, Russia.
In this tutorial we are going to follow the entire path from the serial program to its completely parallel version to learn how to use the features of modern high performance computing systems in full measure. This tutorial could be useful to everyone who knows C language a little bit and wants to learn how to solve bioinformatics problems with modern tools. We are going to cover the next topics:
(1) What is High Performance Computing?:
-Modern computers and supercomputers. Their types and features.
-What is parallelization and how to use it?
– Models of programming on supercomputers.
(2) Problems that could be solved on HPC systems.
– Is my problem worth parallelization and how to determine that?
– The usage of profiler tool.
– Sample Plato program.
(3) Parallelization with Message Passing Interface.
– The most frequently used places in programs to make parallelization.
– How to find a place in program to make parallelization?
– The way of parallelization.
– The most frequently used MPI operators.
– Let’s insert some code to Plato program.
(4) Further practice with Plato.
July 14, 2006
(IV) Computational modeling of biological macromolecules.
The problems of modeling of the structure and functions of biological macromolecules are among the most resource-intensive in bioinformatics. Therefore, high-throughput computations are intensively used for their solving. This part of the lecture course will brief the application of computer algorithms and programs to analysis of the structure and function of genetic macromolecules.
Lecture 6. Inhibitors of protein-protein interactions as lead compounds for new drugs generation. Prof. Alexis Ivanov, V.N. Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow, Russia
Protein-protein interactions represent a new and extremely attractive class of molecular targets for creation of essentially new drugs generation. The reason is that contact areas of protein molecules in complexes are very conservative regarding mutational changes and, hence, the probability of mutational drug resistance is low for drugs targeted to these areas.
Laboratory of authors works in the area of computer-aided design and experimental testing of inhibitors of protein-protein interactions. The computer technologies include methods of 3D molecular modeling, methods of molecular mechanics, molecular dynamics simulation, molecular docking, analysis of intermolecular interactions, virtual alanine screening, molecular database mining, de novo design, etc. The basic experimental approach is technology of intermolecular interactions analysis in vitro using optical biosensor Biacore-3000 utilizing the effect of surface plasmon resonance. Particular examples of approaches and results will be presented based on the study of tetramer of bacterial L-asparaginase and inhibitors of HIV-1 protease dimerization.
Lecture 7. Transcription and translation regulations of amino acid metabolism genes in Actinobacteria and intron-containing genes in chloroplasts of algae and plants. Prof. Vassily Lyubetsky, Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
Formation of alternative structures in mRNA in response to external stimuli, either direct or mediated by proteins or other RNAs, is a major mechanism of regulation of gene expression in bacteria. This mechanism has been studied in detail using experimental and computational approaches in proteobacteria and Firmicutes, but not in other groups of bacteria. Comparative analysis of amino acid biosynthesis operons in Actinobacteria resulted in identification of conserved regions upstream of several operons. Classical attenuators were predicted upstream of trp operons in Corynebacterium spp. and Streptomyces spp., and trpS and leuS genes in some Streptomyces spp. Candidate leader peptides with terminators were observed upstream of ilvB genes in Corynebacterium spp., Mycobacterium spp. and Streptomyces spp. Candidate leader peptides without obvious terminators were found upstream of cys operons in Mycobacterium spp. and several other species. A conserved pseudoknot (named LEU element) was identified upstream of leuA operons in most Actinobacteria. Finally, T-boxes likely involved in the regulation of translation initiation were observed upstream of ileS genes from several Actinobacteria. The metabolism of tryptophan, cysteine and leucine in Actinobacteria seems to be regulated on the RNA level. In some cases the mechanism is classical attenuation, but in many cases some components of attenuators are missing. The most interesting case seems to be the leuA operon preceded by the LEU element that may fold into a conserved pseudoknot or an alternative structure. A LEU element has been observed in a transposase gene from Bifidobacterium longum, but it is not conserved in genes encoding closely related transposases despite a very high level of protein similarity. One possibility is that the regulatory region of the leuA has been co-opted from some element involved in transposition. Analysis of phylogenetic patterns allowed for identification of ML1624 of M. leprae and its orthologs as the candidate regulatory proteins that may bind to the LEU element. T-boxes upstream of the ileS genes are unusual, as their regulatory mechanism seems to be inhibition of translation initiation via a hairpin sequestering the Shine-Dalgarno box.
A short description of the originally developed algorithms of searching for conservative protein-RNA binding sites will be provided. One of these algorithms is applied to analyze chloroplast genes. Candidate protein-RNA binding sites were detected upstream of atpF, petB, clpP, psaA, psbA and psbB genes in many chloroplasts of algae and plants. We surmise that some of these sites are involved in suppressing translation until splicing is completed.
The lecture includes results of the two original publications and describes several novel algorithms in bioinformatics.
(V) High Performance computing in systems biology: analysis of complex biological processes and data. Analysis of gene and metabolic networks is a newest field of bioinformatics, which was formed during last 10 years. Researchers in this field need to operate a tremendous volume of molecular genetic data (genes, proteins, metabolites) and at the same time take into account various interactions between these objects. This results in a growth in the number of parameters describing the behavior of gene networks and requires large computational resources for data processing and modeling. Storage of these data and quick access to them is also an important problem.
Lecture 8. Gene expression patterns: methods for visualization, processing, and quantification. Dr. Konstantin Kolzov, St. Petersburg State Polytechnic University, St. Petersburg, Russia
High-quality and high-resolution images of gene expression patterns become available for developmental biology due to confocal scanning microscopy technique. Extraction of quantitative information is important to get insights into underlying regulation, construct mathematical models, and plan new experiments. We introduce a new image processing software package ProStak integrated into distributed computing environment. ProStak includes all operations needed to extract quantitative information from 2D and 3D biological images. The chain of processing steps can be visually constructed using graphical user interface that provides convenient environment for digital image processing for all groups of scientists: beginners, non-programmers, and experts, for which the speed of the result acquisition is critical. All processing methods can be accessed by a user through the command line interface, as well as through shared and static libraries. The combination of features mentioned above distinguishes ProStak from other image processing packages such as commercial systems Matlab and VisiQuest, and freely available SIVIL, SCIRun, and TiViPe.
Seminar 3. Transcription and translation regulations of amino acid metabolism genes in Actinobacteria and intron-containing genes in chloroplasts of algae and plants (accompanying the Lecture 7) . Dr. Alexander Seliverstov, Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
July 15, 2006
Session of the presentations of young scientists.
Lecture 9. Distributed applications, web services, tools and GRID infrastructures for bioinformatics. Dr. Luciano Milanesi, National Research Council – Institute of Biomedical Technology, Italy
Due to the increasing number of nucleotide and protein sequences produced by high throughput techniques, that have to be analyzed by bioinformatics tools, will be necessary to increase the actual calculation resources. Therefore, in order to face these new challenges successfully, it will be necessary to develop dedicated supercomputers, parallel computer based on clustering technologies and high performance distributed platforms like GRID.
Next generation of GRID infrastructures, are trying to implement a distributed computing model where easy access to large geographical computing and data management resources will be provided to large multi/inter-disciplinary Virtual Organizations (VO) made of both research and user entities.
Indeed, computational and data Grids are “de facto” considered as the way to realize the concept of virtual places where scientists and researchers work together to solve complex problems in Bioinformatics, despite their geographic and organizational boundaries.
In these respects, then, Grid Computing is announcing another technological and societal revolution in high performance distributed computing as the World Wide Web has been since the last ten years for what concerns the meaning and the availability of global information. The aim is to operate this widely distributed computing environment as a uniform service, which looks after resource management, exploitation, and security independently of individual technology choices.
A general overview of the GRID technologies and computer cluster application to perform distributed bioinformatics applications for data mining, gene discovery, sequence similarity for searching of DNA and protein will be illustrated.
Seminar 4. The models of adaptive dynamics as tools for studying of neutral molecular evolution. Dr. Yury Bukin, Limnological Institute SB RAS, Irkutsk, Russia.
Closing Ceremony