VOROBIEV D.G., PONOMARENKO J.V., PODKOLODNAYA O.A.
Institute of Cytology and Genetics, (Siberian Branch of the Russian Academy of Sciences), 10 Lavrentieva ave., Novosibirsk, 630090 Russia
Keywords: database, transcription factor, binding sites, sequences, alignment
The majority of molecular processes in the cell connected with the storage, transmission, and realization of hereditary information is controlled by relatively short regions of the DNA molecule, the so-called functional sites. A great volume of data on their location, structure, and function has been so far accumulated in such databanks as EMBL and TRANSFAC. However, these databases contain only the data on individual sites, whereas no databases of site samples are available. Thus, we believe that development of a specialized database aimed to provide the data for recognition methods is a timely objective.
We have developed the database SAMPLES compiling sets of 42 transcription factor binding sites using the information contained in TRANSFAC, TRRD, and EMBL databases. We have included the homological sequences if they were extracted from different EMBL entries. Names of the sites represented in SAMPLES, their synonyms, number of sequences in each set with and without 100% homologies, and characteristics of the corresponding footprints are listed in Table. Locations of the sites were determined according to TRANSFAC and TRRD, that is, only the sites with the location determined experimentally were taken into account. The sites were aligned relative to the center of the footprints or TRANSFAC sequence elements so that the total length of the sequences was equal to 120 bp.
Table. The list of DNA functional sites compiled in the database SAMPLES.
Factor name |
Synonymous factor name |
Number of sequences in set with/without 100% homologies |
Average size of footprint or TRANSFAC sequence |
||
with |
without |
element, bp |
|||
1 |
AP-1 |
PEA1; (Jun)2; AP1; Fos/Jun; yAP1; PAR1; PDR4; SNQ3 |
74 |
69 |
14.9 |
2 |
c-Fos |
p55(c-fos) |
21 |
19 |
13.9 |
3 |
c-Jun |
JunA; p39; p39 c-jun |
32 |
28 |
14.6 |
4 |
NF-E2 |
NFE2; NF-E2 p45; nuclear factor erythroid 2 p45 |
12 |
12 |
10.7 |
5 |
CRE-BP1 |
XBP4; ATF-2; HB16; TREB-7 |
26 |
22 |
12.7 |
6 |
ATF |
yATF; ATF-1; ATF-2; ATF-3; ATF-4; ATF-5; ATF-6; ATF-7; ATF-8; ATF-a; ATF-adelta; 75 kDa, 77 kDa; TREB-36; ATF/CREB; ATF-3deltaZIP |
28 |
25 |
12.9 |
7 |
CREB |
ATF-47; CREB-341; CREB-B; CREBalpha; CREB-2 |
46 |
37 |
13.6 |
8 |
C/EBP |
slbo; slow border cells; DmC/EBP; AcC/EBP; C/EBPalpha; CBP; EBP20;BPc |
126 |
108 |
19.1 |
9 |
NF-IL6 |
C/EBPbeta; LAP1; NF-M; AGP/EBP; ANF-2; CRP2; H-APF-2; IL-6DBP; LAP; |
23 |
21 |
19.9 |
10 |
MyoD | XMyoD; Myf-3; MEF1; MyoD1; CMD1; |
17 |
16 |
18.5 |
11 |
E2F | EIIF; E2F+E4; E2F-BF; E2F-I; E2F-1; E2F-2; E2F-3; E2F-4; E2F-5 |
12 |
9 |
13.2 |
12 |
USF | UEF; MLTF; SpF1; gamma-factor; yMRF; MRF; pf51; USF43; USF1; eUSF; |
28 |
25 |
14.1 |
13 |
NF-1 |
NF-I; CTF; TGGCA-binding protein; RPF-A (hamster); NF1/L; NF-1A1.1 (chick); NF-I/L; NF-1/Red1; RPF-B; NFI-B1; NF-1B1 (chick); |
101 |
101 |
17.2 |
14 |
RF-X |
EF-C; RFX; RFX1; enhancer factor C; |
12 |
12 |
17.0 |
15 |
CP1 |
NF-Y; CBF;alpha-CP1; |
51 |
33 |
17.2 |
16 |
ER |
estrogen receptor; |
32 |
25 |
17.5 |
17 |
GR |
GCR; glucocorticoid receptor; GR-DBD;GR alpha; GR beta |
64 |
54 |
12.6 |
18 |
PR |
progesterone receptor; PR B; progesterone receptor form B; PR A |
23 |
20 |
11.7 |
19 |
RAR |
retinoic acid receptor; RAR-alpha1; RAR-alpha1; RAR-alpha2; RAR-beta; RAR-beta1; RAR-beta2; RAR-beta3; RAR-beta4; RAR-gamma; RAR-gamma1; RAR-gamma2; |
16 |
16 |
24.6 |
20 |
RXR |
RXR-alpha; retinoid X receptor alpha; RXR-beta ;H2RIIBP; RXR-gamma; RXR-beta2 |
21 |
21 |
20.9 |
21 |
T3R |
thyroid hormone receptor; c-ErbA; T3R-alpha; c-ErbAalpha; T3R-beta; betac-ErbA; c-ErbA-beta; T3R-beta1; T3R-alpha1; T3R-alpha2; c-ErbA-T; ear-7-2; T3R-beta2; |
22 |
21 |
20.5 |
22 |
COUP |
COUP-TF; ear3; COUP-alpha; |
18 |
17 |
20.6 |
23 |
GATA-1 |
NF-E1; NF-E1a; Eryf1; EF1; EFgamma; Eryf-1; GF-1 |
102 |
76 |
16.1 |
24 |
Sp1 |
197 |
176 |
13.1 |
|
25 |
YY1& |
NF-E1, UCR, UCRBP, CF1, CSBP1, d , F-ACT1 |
27 |
27 |
13.6 |
26 |
GAGA |
Trl; Trithorax-like |
7 |
7 |
16.0 |
27 |
GAL4 |
16 |
16 |
14.7 |
|
28 |
EN |
Engrailed; En-1; Engrailed 1; Gg-en.1; Hu-en.1 (human); Mo-en.1 (mouse); En-2; Engrailed 2; Gg-en.2; Hu-en.2 (human); Mo-en.2 (mouse); |
12 |
12 |
11.4 |
29 |
HNF-1 |
HNF-1A; APF; HNF1; HNF-1alpha; LF-B1; |
42 |
38 |
21.0 |
30 |
TTF-1 |
T/EBP; thyroid transcription factor 1; thyroid nuclear factor 1; |
7 |
7 |
18.0 |
31 |
OCT |
Oct-1; Oct-1A; Oct-2.1; Oct-2; Oct-4; Oct-5; Oct-8; Oct-9; Oct-2B; oct-B2; oct-B3; Oct-2C; Oct-1B; Oct-1C; Oct-2.1; Oct-2.3; Oct-2.4; Oct-2.6; Oct-2.7; Oct-2.8 |
101 |
73 |
14.9 |
32 |
HNF-3 |
HNF3; HNF-3A; HNF-3alpha; HNF-5; |
15 |
10 |
24.6 |
33 |
HSF |
HSTF; HTF; HSF1; mHSF1; heat shock transcription factor 1; |
7 |
7 |
27.7 |
34 |
c-Myb |
19 |
19 |
18.9 |
|
35 |
Ets |
c-Ets-1; p54; Ets1; c-Ets-2; Ets2; Ets-2; c-Ets-2 58-64; p58-64c-Ets-2; v-Ets; v-Myb/v-Ets; |
16 |
15 |
20.9 |
36 |
IRF-1 |
ISGF2 |
11 |
7 |
16.1 |
37 |
NF-kappaB |
p50/p65 |
39 |
36 |
13.9 |
38 |
MEF-2 |
MEF-2A; SL-2; |
12 |
12 |
17.3 |
39 |
SRF |
SRFx; p67; p67SRF; CArG-binding factor; CBF (3) |
29 |
29 |
15.7 |
40 |
E2 |
20 |
20 |
15.2 |
|
41 |
TCF-1 |
T cell factor 1; TCF-1alfa; TCF-1A; TCF-1B; TCF-1C |
6 |
6 |
14.7 |
42 |
GATA-1 (from TRRD) |
81 |
81 |
16.5 |
Fig.1. Example of a SAMPLES card.
Fig. 2. Example of a card from the ALIGNED database.
The database SAMPLES has a EMBL-like format, is under the SRS5, and is available via the Internet at URL: http://sgi.sscc.ru/srs5/. An example of the SAMPLES entry is shown in Fig. 1. The card includes the following fields: FI (short name of the set); NM (full name of the set supplemented with explanation of its biological meaning); DA and LU (dates of creation and last update); ST (formalized description of the sites of a given type including the names of the site and the corresponding transcription factor); WW (hyperlink to the corresponding X-ray structure image); WA, WP, and WF (hyperlinks to the databases on significant contextual, conformational, and physico-chemical properties of the sites Matrix, Consensus, Features, and Profiles). A description of the site sequence includes identifier (ID); accession number (AC); phylogenetic classification information (OS and OC); references to source databases (DR); boundary positions of the site in the sequence, its orientation, and the method, experimental or computer-assisted, used to identify this site (FT). Abbreviations EXP. (experimental) and GBS (Gibbs sampler) are used to specify the method used.
To reveal the contextual peculiarities of the sites, the sequences from SAMPLES were aligned by a modified method of local multiple alignment Gibbs sampler (Lawrence, 1993). Only the sequences of footprints with three additional nucleotides from either sides were employed to increase the accuracy of alignment. The results of this alignment are compiled in the database ALIGNED (URL: http://sgi.sscc.ru/srs5/). The format of the database ALIGNED is similar to the format of SAMPLES (Fig. 2).
Further development of the databases SAMPLES and ALIGNED will be performed by increasing the number of both the functional sites involved and the sequences in each set of the already described sites. In addition, we plan to supplement SAMPLES with the tool for automatic extraction of necessary data from the EMBL, TRRD, and other relevant databases via the query language MGL (at the moment, the SAMPLES database is filled manually). To resolve problems with homological sequences we plan to include into the database interface an on-line tool for homolog eliminating. We also plan to develop the tools providing an Internet user the on-line input of his own information, its automatic aligning, and generation of knowledge (programs for site recognition).
We thank Grigoriy Kolesov for the program for alignment kindly provided by him.
References
- Lawrence Ch. E., Altshul S. F., Boguski M. S., Liu S. L., Neuwald A. F., Wootton J. C., Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, V. 262, p. 208-214