SAMPLES AND ALIGNED: DATABASES FOR FUNCTIONAL SITE SEQUENCES

VOROBIEV D.G.PONOMARENKO J.V.PODKOLODNAYA O.A.

Institute of Cytology and Genetics, (Siberian Branch of the Russian Academy of Sciences), 10 Lavrentieva ave., Novosibirsk, 630090 Russia

Keywords: database, transcription factor, binding sites, sequences, alignment

 

The majority of molecular processes in the cell connected with the storage, transmission, and realization of hereditary information is controlled by relatively short regions of the DNA molecule, the so-called functional sites. A great volume of data on their location, structure, and function has been so far accumulated in such databanks as EMBL and TRANSFAC. However, these databases contain only the data on individual sites, whereas no databases of site samples are available. Thus, we believe that development of a specialized database aimed to provide the data for recognition methods is a timely objective.

We have developed the database SAMPLES compiling sets of 42 transcription factor binding sites using the information contained in TRANSFAC, TRRD, and EMBL databases. We have included the homological sequences if they were extracted from different EMBL entries. Names of the sites represented in SAMPLES, their synonyms, number of sequences in each set with and without 100% homologies, and characteristics of the corresponding footprints are listed in Table. Locations of the sites were determined according to TRANSFAC and TRRD, that is, only the sites with the location determined experimentally were taken into account. The sites were aligned relative to the center of the footprints or TRANSFAC sequence elements so that the total length of the sequences was equal to 120 bp.

 

Table. The list of DNA functional sites compiled in the database SAMPLES.

Factor name

Synonymous factor name

Number of sequences in set with/without 100% homologies

Average size of footprint or TRANSFAC sequence

with

without

element, bp

1

AP-1

PEA1; (Jun)2; AP1; Fos/Jun; yAP1; PAR1; PDR4; SNQ3

74

69

14.9

2

c-Fos

p55(c-fos)

21

19

13.9

3

c-Jun

JunA; p39; p39 c-jun

32

28

14.6

4

NF-E2

NFE2; NF-E2 p45; nuclear factor erythroid 2 p45

12

12

10.7

5

CRE-BP1

XBP4; ATF-2; HB16; TREB-7

26

22

12.7

6

ATF

yATF; ATF-1; ATF-2; ATF-3; ATF-4; ATF-5; ATF-6; ATF-7; ATF-8; ATF-a; ATF-adelta; 75 kDa, 77 kDa; TREB-36; ATF/CREB; ATF-3deltaZIP

28

25

12.9

7

CREB

ATF-47; CREB-341; CREB-B; CREBalpha; CREB-2

46

37

13.6

8

C/EBP

slbo; slow border cells; DmC/EBP; AcC/EBP; C/EBPalpha; CBP; EBP20;BPc

126

108

19.1

9

NF-IL6

C/EBPbeta; LAP1; NF-M; AGP/EBP; ANF-2; CRP2; H-APF-2; IL-6DBP; LAP;

23

21

19.9

10

MyoD XMyoD; Myf-3; MEF1; MyoD1; CMD1;

17

16

18.5

11

E2F EIIF; E2F+E4; E2F-BF; E2F-I; E2F-1; E2F-2; E2F-3; E2F-4; E2F-5

12

9

13.2

12

USF UEF; MLTF; SpF1; gamma-factor; yMRF; MRF; pf51; USF43; USF1; eUSF;

28

25

14.1

13

NF-1

NF-I; CTF; TGGCA-binding protein; RPF-A (hamster); NF1/L; NF-1A1.1 (chick); NF-I/L; NF-1/Red1; RPF-B; NFI-B1; NF-1B1 (chick);

101

101

17.2

14

RF-X

EF-C; RFX; RFX1; enhancer factor C;

12

12

17.0

15

CP1

NF-Y; CBF;alpha-CP1;

51

33

17.2

16

ER

estrogen receptor;

32

25

17.5

17

GR

GCR; glucocorticoid receptor; GR-DBD;GR alpha; GR beta

64

54

12.6

18

PR

progesterone receptor; PR B; progesterone receptor form B; PR A

23

20

11.7

19

RAR

retinoic acid receptor; RAR-alpha1; RAR-alpha1; RAR-alpha2; RAR-beta; RAR-beta1; RAR-beta2; RAR-beta3; RAR-beta4; RAR-gamma; RAR-gamma1; RAR-gamma2;

16

16

24.6

20

RXR

RXR-alpha; retinoid X receptor alpha; RXR-beta ;H2RIIBP; RXR-gamma; RXR-beta2

21

21

20.9

21

T3R

thyroid hormone receptor; c-ErbA; T3R-alpha; c-ErbAalpha; T3R-beta; betac-ErbA; c-ErbA-beta; T3R-beta1; T3R-alpha1; T3R-alpha2; c-ErbA-T; ear-7-2; T3R-beta2;

22

21

20.5

22

COUP

COUP-TF; ear3; COUP-alpha;

18

17

20.6

23

GATA-1

NF-E1; NF-E1a; Eryf1; EF1; EFgamma; Eryf-1; GF-1

102

76

16.1

24

Sp1

197

176

13.1

25

YY1&

NF-E1, UCR, UCRBP, CF1, CSBP1, d , F-ACT1

27

27

13.6

26

GAGA

Trl; Trithorax-like

7

7

16.0

27

GAL4

16

16

14.7

28

EN

Engrailed; En-1; Engrailed 1; Gg-en.1; Hu-en.1 (human); Mo-en.1 (mouse); En-2; Engrailed 2; Gg-en.2; Hu-en.2 (human); Mo-en.2 (mouse);

12

12

11.4

29

HNF-1

HNF-1A; APF; HNF1; HNF-1alpha; LF-B1;

42

38

21.0

30

TTF-1

T/EBP; thyroid transcription factor 1; thyroid nuclear factor 1;

7

7

18.0

31

OCT

Oct-1; Oct-1A; Oct-2.1; Oct-2; Oct-4; Oct-5; Oct-8; Oct-9; Oct-2B; oct-B2; oct-B3; Oct-2C; Oct-1B; Oct-1C; Oct-2.1; Oct-2.3; Oct-2.4; Oct-2.6; Oct-2.7; Oct-2.8

101

73

14.9

32

HNF-3

HNF3; HNF-3A; HNF-3alpha; HNF-5;

15

10

24.6

33

HSF

HSTF; HTF; HSF1; mHSF1; heat shock transcription factor 1;

7

7

27.7

34

c-Myb

19

19

18.9

35

Ets

c-Ets-1; p54; Ets1; c-Ets-2; Ets2; Ets-2; c-Ets-2 58-64; p58-64c-Ets-2; v-Ets; v-Myb/v-Ets;

16

15

20.9

36

IRF-1

ISGF2

11

7

16.1

37

NF-kappaB

p50/p65

39

36

13.9

38

MEF-2

MEF-2A; SL-2;

12

12

17.3

39

SRF

SRFx; p67; p67SRF; CArG-binding factor; CBF (3)

29

29

15.7

40

E2

20

20

15.2

41

TCF-1

T cell factor 1; TCF-1alfa; TCF-1A; TCF-1B; TCF-1C

6

6

14.7

42

GATA-1 (from TRRD)

81

81

16.5

 

Fig.1. Example of a SAMPLES card.

 

Fig. 2. Example of a card from the ALIGNED database.

 

The database SAMPLES has a EMBL-like format, is under the SRS5, and is available via the Internet at URL: http://sgi.sscc.ru/srs5/. An example of the SAMPLES entry is shown in Fig. 1. The card includes the following fields: FI (short name of the set); NM (full name of the set supplemented with explanation of its biological meaning); DA and LU (dates of creation and last update); ST (formalized description of the sites of a given type including the names of the site and the corresponding transcription factor); WW (hyperlink to the corresponding X-ray structure image); WA, WP, and WF (hyperlinks to the databases on significant contextual, conformational, and physico-chemical properties of the sites Matrix, Consensus, Features, and Profiles). A description of the site sequence includes identifier (ID); accession number (AC); phylogenetic classification information (OS and OC); references to source databases (DR); boundary positions of the site in the sequence, its orientation, and the method, experimental or computer-assisted, used to identify this site (FT). Abbreviations EXP. (experimental) and GBS (Gibbs sampler) are used to specify the method used.

To reveal the contextual peculiarities of the sites, the sequences from SAMPLES were aligned by a modified method of local multiple alignment Gibbs sampler (Lawrence, 1993). Only the sequences of footprints with three additional nucleotides from either sides were employed to increase the accuracy of alignment. The results of this alignment are compiled in the database ALIGNED (URL: http://sgi.sscc.ru/srs5/). The format of the database ALIGNED is similar to the format of SAMPLES (Fig. 2).

Further development of the databases SAMPLES and ALIGNED will be performed by increasing the number of both the functional sites involved and the sequences in each set of the already described sites. In addition, we plan to supplement SAMPLES with the tool for automatic extraction of necessary data from the EMBL, TRRD, and other relevant databases via the query language MGL (at the moment, the SAMPLES database is filled manually). To resolve problems with homological sequences we plan to include into the database interface an on-line tool for homolog eliminating. We also plan to develop the tools providing an Internet user the on-line input of his own information, its automatic aligning, and generation of knowledge (programs for site recognition).

We thank Grigoriy Kolesov for the program for alignment kindly provided by him.

References

  1. Lawrence Ch. E., Altshul S. F., Boguski M. S., Liu S. L., Neuwald A. F., Wootton J. C., Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 1993, V. 262, p. 208-214