GenomeTraFaC Help

Contents

Overview

Searching the database

Choosing different views

References

Troubleshooting/FAQs

GenomeTraFaC is a database of conserved regulatory elements obtained by systematically analyzing the set of genes occurring in the mouse and human genomes, which are highly similar. It mainly focuses on all of the high-quality mRNA entries of mouse and human genes in the Reference Sequence (RefSeq) database of the NCBI.

The identification of conserved potential cis-regulatory regions was done in a computational pipeline fashion using an advanced version of our earlier developed TraFaC server. The availability of putative regulatory information for most of the well annotated genes can also greatly facilitate analyses of groups of co-expressed or functionally related genes for the occurrence of ortholog-conserved shared transcriptional machinery.

Using the TraFaC (Jegga et al., 2002), PipMaker (Schwartz et al., 2000) and MatInspector (Quandt et al., 1995) suite of programs, we have aligned and analyzed more than 12,000 human and mouse orthologous gene pairs that had a validated RefSeq ID from the Reference Sequence database of NCBI (Pruitt et al., 2003). The genomic sequences with flanking (upstream to 5’ and downstream to 3’) 40 kb base pairs were downloaded from the UCSC genome browser (Human May 2004 and March 2006 assemblies, and Mouse Aug 2005 and February 2006 assemblies). Sequence alignment was done using the BlastZ algorithm of PipMaker, while the transcription factor binding sites were found using MatInspector, which utilizes the position weight matrices (PWM) library for the binding sites. TraFaC server was used to identify the common cis-elements within the evolutionarily conserved regions in human-mouse sequence alignment.

The GenomeTraFaC database has cis-regulatory analysis results for more than 12,000 RefSeq annotated human and mouse gene pairs. We are in the process of updating the database as the new RefSeq orthologous gene pairs become available. However, if you are interested in any particular gene (that has human and mouse RefSeq annotations, specifically a "NM" accession number), please mail us the accession numbers or gene symbols or the sequences. We will let you know when the results are available.

Searching the database

From the homepage of GenomeTraFaC, you have three options for searching the database:

Cis-element clusters within BlastZ alignments

To search for cis-element clusters within BlastZ aligned pairs, go to the GenomeTraFaC homepage and click Cis-element clusters within BlastZ Aligments:

From this screen, you can peform a basic search, or search by disease, gene ontology, pathway, gene family or custom group.

Basic search

1. From the drop-down menu at the top of the screen, select one of the following options, and enter related search terms in the text box:

2. Click Search:

A table of search results appears in the lower half of the screen.

3. Select the check box for each sequence you want to view:

4. Click Submit:

The BlastZ alignments page opens:

The first two columns show the sequence information for the human and mouse sequences.

The third column (Timestamp) indicates the date of entry.

From the last column (Action), you have two viewing options:

For more information about these and other views, see Choosing different views.

Search by disease, pathway, ontology, phenotype, gene family, or custom group

1. Go to the lower section of the search screen, and select the type of query you wish to perform.

2. Define your query in one of the following ways:

3. Verify that the correct query type is selected.

4. To process your query, click Search.

5. Follow steps 3 and 4 under Basic Search to select sequences for visualization.

Cis-element clusters between any gene pair

To search for cis-element clusters in a gene segment pair of your choice, complete the following steps:

1. On the GenomeTraFaC homepage, click Cis-element clusters shared between any gene pair:

2. Search for your first sequence by using the basic search options in the top half of the screen or the options for searching by disease, pathway, ontology, phenotype, gene family, or custom group in the bottom half of the screen.

A table of search results appears in the lower half of the screen.

3. Go to the table, and select your first sequence:

4. Click Submit:

5. Select your second sequence, repeating step 2 if necessary, and click Submit.

6. Use the TraFaC query page to modify any parameters you wish, as indicated below:

  1. Use the Sequence Filter to modify or change the sequence coordinates.
  2. Use the Matrix Filter to select which matrices to view and which to block; currently, up to four different matrices are supported.
  3. Use these options to adjust the image size and quality, to combine similar matrices and show them as a family, and to highlight regions of cis-element clusters within a selected base pair window.
    Note:
    If you want to set one or more individual matrix filters, be sure to clear the Combine same-family matrices check box.

7. When you are finished modifying parameters, click Submit.

An image depicting the shared transcription factor binding sites appears. We call this a TraFaC image. To learn more about this image and others, see Choosing different views.

Conserved cis-element scanner

To select one or more transcription binding sites and search all database genes for clusters containing the selected site(s), complete the following steps:

1. On the GenomeTraFaC homepage, click Conserved Cis-element Scanner:

2. Go to the top of the screen, and customize the search region if you wish. By default, the system searches the 10 KB region upstream of the first exon of all genes.

Example
Typing 15000 for (a), selecting Downstream for (b), and selecting Last Exon for (c) instructs the system to search the region 15000 base pairs downstream of the last exon of all genes.

3. Go to the list of binding sites in the bottom section of the screen, and select one or more sites to include in your search:

4. Click Search:

The search results page appears:

  1. These columns identify the human and mouse sequences in which the selected binding sites were found.
  2. These columns indicate the location of each cis-cluster in the human and mouse sequences.
  3. This column lists all sites in the cis-cluster, including the sites you selected in step 3.
  4. Clicking this link takes you to a TraFaC image, or graph of the shared binding sites in the two sequences.
  5. Clicking this link takes you to a regulogram, or a cis-element hit density graph in the context of sequence similarity.
  6. By going to the first column of the table, selecting one or more cis-clusters and clicking Show Binding Site Positions, you can view the start and end positions of all binding sites in each cluster.
  7. By clicking Download, you can download the entire contents of the screen in Microsoft Excel format.
  8. By clicking Modify query, you can return to the search screen and modify your original search range and/or binding site selections. Or by clicking New Search, you can start over with a blank search screen.

Return to top

Choosing different views

As you search the database, you can view the data in several different ways, depending on which search option you chose. Click the following views to learn more about them:

Local alignment view

The Local Alignment page consists of the following table, which is essentially a summary of the alignment results of the orthologous (mostly, human and mouse) genomic sequences. It is based upon the BlastZ sequence alignment uploaded to the GenomeTraFaC database.

  1. By going to the top of the page and clicking Show Hits, you will go to a display of the same table but with additional columns that show the number of shared cis-elements (individual matrix-wise or matrix family-wise). By "hits," we mean the transcription factor binding sites. The putative TF binding sites were obtained using locally installed MatInspector, which utilizes the TRANSFAC database (a database on eukaryotic transcription factors, their genomic binding sites and DNA-binding profiles). The results generated are uploaded to the GenomeTraFaC database for a comparative graphical display on the transcription factor binding sites homology page.
  2. By going to the last column and clicking View, you will go to a Concise alignment view page. This page is the same as the local alignment view page but also has the shared TF binding sites displayed as number of "hits," family-wise or individual matrix-wise.

Concise alignment view

The Concise Alignment page consists of the following table, which is essentially the same as the Local alignment view page but has additional columns for the display of the shared TF binding sites ("hits"), family-wise or individual matrix-wise.

The numbers in the Hits columns indicate the number of shared TF binding sites in a window of 200 base pairs between the two sequences that are compared. By clicking these numbers, you will go to the TraFaC image (shared transcription factor binding sites image), a graphical display of the shared TF binding sites between the two sequences that are compared.

TraFaC image

The shared transcription factor binding sites image, or TraFaC Image, indicates the TF binding sites occurring in both the sequences. Here is an example, along with numbered annotations explaining its parts:

  1. The two gray vertical bars are the two genes that are compared. The numbers represent the nucleotide positions with respect to the sequences used.
  2. The TF binding sites occurring in both the genes are highlighted as colored bars drawn across the two genes. Click the image to zoom in on a site of interest. The TraFaC image can be viewed based upon the individual matrices of the TF binding sites or the matrix families.
  3. Indicates the names of the TF matrices. Click them to learn more about them. Note that these links work only if you have an account with the genomatix (http://www.genomatix.de).
  4. A table describing the putative sites displayed in the image. For each site, the start and end positions are listed along with the sequence string.
  5. Click Show Only Parallel Sites to display "ordered hits." Ordered hits would limit the shared cis-elements to only those that are positionally conserved (or are almost evenly spread and equidistant in both the ortholog genes). This feature helps in clearing or filtering out the cluttered or complex regions. Cis-clusters that have constituent cis-elements occurring parallel in the orthologous genes frequently tend to be involved in regulatory function.
  6. Click Show Query Parameters to view or modify the query parameters.

Regulogram

The Regulogram, or Cis-element Hit Density Image, depicts a moving-window average of the number of shared cis-elements occurring in phylogenetically conserved regions. Here is an example, along with numbered annotations explaining its parts:

  1. The grey horizontal bars are the nucleotide sequences of the two orthologous genes compared. The numbers represent the coordinates of the sequences used. The red bars are the exons.
  2. The green blocks plotted parallel to the genomic sequences are the repeat regions identified by RepeatMasker.
  3. The different-colored polygons stretching from one sequence to the other indicate the sequence similarity regions between the two genes.
  4. The Hits scale on the lower-left side refers to the number of shared cis-elements between the two sequences occurring in a sequence-conserved region.
  5. The TF BS Freq in the upper half of the left side refers to the frequencies of the binding sites in both the sequences separately.
  6. Percent Identical refers to the percent similarity between the two sequences.
  7. To view hits based on individual transcription factor binding site matrices or just the matrix family wise, select Combine unordered same-family matrices, and click Refresh.
  8. To modify the default size of 850 X 412, type a different value for the width, and click Refresh.
  9. To zoom in for more clarity, select the radio button next to the Zoom drop-down menu, select a different value from the drop-down menu (by default, 10x magnification is selected), and click the image window.
  10. To look at the actual TF binding sites (constituent elements of hits), select the radio button next to the drop-down menu to the top-left of the regulogram, select a value other than the default window size of 200 bp if you wish, and click any point on the hits graph. The TraFaC image for this point is displayed.
  11. A new feature we have added includes the option for plotting "ordered hits." Ordered hits would limit the shared cis-elements to only those that are positionally conserved (or are almost evenly spread and equidistant in both the ortholog genes). This feature helps in clearing or filtering out the cluttered or complex regions. Cis-clusters that have constituent cis-elements occurring parallel in the orthologous genes frequently tend to be involved in regulatory function.

Return to top

References

Jegga,A.G., Sherwood, S.P., Carman, J.W., Pinski, A.T., Phillips, J.L., Pestian, J.P. and Aronow, B.J. (2002) Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res. 12, 1408-1417.

Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2003) NCBI Reference Sequence project: update and current status. Nucleic Acids Res. 31, 34-37.

Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995) MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878-4884.

Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. (2000) PipMaker-A web server for aligning two genomic DNA sequences. Genome Res. 10, 577-586.

Return to top

Troubleshooting/FAQs

What kinds of queries and retrievals are possible in GenomeTraFaC?
GenomeTraFaC can be searched with an approved gene symbol (HGNC for human and MGI for mouse). The database currently has precomputed regulatory region analysis results for more than 12,000 Reference Sequence genes that have a "NM" mRNA entry for human and mouse genes. All mutliple entries (gene symbols or accession numbers) should be comma separated.

I have entered a RefSeq accession number but GenomeTraFaC returned no results.
Since we used a single mRNA accession number for downloading the genomic sequence, there's a possibility that in all cases where there are alternate transcripts (therefore multiple accession numbers), your search results may return nothing. Try using the approved gene symbol instead.

I have entered gene symbol p53 against sequence name but no entries were found.
The HGNC/MGI approved symbol for p53 in human and mouse is TP53. As mentioned earlier, in the current version, the query supports approved symbols only. However, in the next version, you should be able to search with the aliases too. To get an approved symbol, use the NCBI's LocusLink database.

The start site and the promoter region of the human gene is mapping with the upstream or intronic region of the mouse gene.
This could result either because of the incorrect exon annotations, especially the first exon in one of the species, or could be because of the presence of an alternate transcript. This shouldn't pose a problem unless the first intron is larger than 40 kb because for all genes in GenomeTraFaC, we have added 40 kb flanking regions 5' and 3'.

What is the basis for the 40 kb flanking regions?
Earlier we used 10 kb flanking regions. But regulatory regions are known to occur further upstream. We thought 40 kb was a reasonable flanking sequence space to search for potential conserved cis-clusters. However, that doesn't preclude the fact that there are instances where regulatory regions are known to occur as far as 100 kb upstream of the start site.

In the TraFaC image, I don't find the binding sites that have been experimentally validated or have literature references.
There could be two principal reasons for this: the binding site may not be an ortholog conserved one. TraFaC shows the cis-elements which are conserved between two orthologous genes and occurring in a sequence conserved region. In such cases, try searching for binding sites in individual sequences separately by clicking on the graphs TF BS in the top frame of the regulogram image. The other reason could be the binding site may not be in the TRANSFAC library.

How can I copy the images (Regulogram and/or TraFaC image)?
If you are using a PC, right-click the image and use the Save As option to save as a JPEG image. If using a Mac, you can drag the image to your desktop and save it. For TFBS tables and exon tables, you can copy them and paste in an Excel worksheet.

Return to top