****** Generic Genome Browser: A Tutorial ****** ***** Author: Lincoln Stein, 11 March 2008 ***** This is an extensive tutorial to take you through the main features and gotchas of GBrowse. This tutorial assumes that you have successfully setup Perl, GD, BioPerl and the other GBrowse dependencies. During most of the tutorial, we will be using the "in-memory" GBrowse database (no relational database required!) Later we will show how to set up a genome size database using the berkeleydb and MySQL adaptors. For a tutorial that uses the older Bio::DB::GFF adaptor, see Using_GBrowse_with_Bio::DB::GFF. ***** Table of Contents ***** 1. The_Basics 1. The_Data_File 2. Defining_Tracks 3. Adding_Descriptions_to_a_Feature 4. Adjusting_GBrowse_Name_Searches 5. Linking 6. Adding_Popup_Balloons_to_Tracks 1. Customizing_Balloons 2. Displaying_Common_Types_of_Features 1. Multi-segmented_features 2. Protein-Coding_Genes 1. Simpler_Genes 3. Reading_Frames 4. Grouped_Features 5. Quantitative_Data_(basic) 6. Quantitative_Data_(advanced) 7. DNA_and_3-frame_translations 8. ESTs_and_Other_Alignments 1. Adding_DNA_to_Alignments 9. Trace_Data 3. GBrowse_Enhancements 1. Adding_a_"Region"_Panel 2. Putting_Features_into_the_Overview_&_Regionview 3. Semantic_Zooming 4. Grouping_Tracks 5. Grouping_Tracks_into_a_Table 6. Using_Plugins 4. Adding_Features_from_External_Sources 1. Uploading_an_Annotation_File 2. Sharing_an_Annotation_File 3. Using_GBrowse_as_a_DAS_Server_or_Client 1. Combining_Databases_with_DAS 2. Exporting_DAS_Tracks_to_Ensembl_and_other_Genome_Browsers Running_GBrowse_off_DAS_Entirely 5. Using_Other_Backends 1. The_Berkeleydb_Backend 1. The_bp_seqfeature_load.pl_script 2. The_MySQL_Backend 3. Other_Backends 6. Conclusion ***** 1. The Basics ***** We will be working with simulated Volvox genome annotation data. The database will be named "volvox" and GBrowse will be invoked with this URL: http://localhost/cgi-bin/gbrowse/volvox These directories contain data files used during the tutorial: data_files DNA and features files to load into the local database. conf_files GBrowse configuration files for you to take and modify. To introduce you to the system we will be using a file-based database which allows GBrowse to run directly off text files. To prepare this database for use, find the GBrowse databases directory which was created in your Apache web server directory at the time of installation. It should be located at /var/www/html/ gbrowse/databases, but check to make sure. Similarly, check that you can find the gbrowse.conf configuration directory. It should be located at /etc/httpd/conf/gbrowse.conf and contain the configuration file "yeast_chr1.conf." Now you will change the permissions of the database and configuration directories so that you can write to them without root privileges. This is only an issue on Unix systems, and Windows users can safely ignore this step. % su Password: ********* # chown my_user_name /var/www/html/gbrowse/databases # chown my_user_name /etc/httpd/conf/gbrowse.conf # exit % (Be sure to replace "my_user_name" with your login name!) Now look around inside the databases directory. There should be a single subdirectory named "yeast_chr1." The yeast subdirectory is where the example yeast chromosome 1 data set is stored. You will create an empty volvox subdirectory, and make it world writable. On Unix systems: % cd /var/www/html/gbrowse/databases % mkdir volvox % chmod go+rwx volvox NOTE: The "%" sign in these examples is the command-line prompt. On Windows systems, the command-line prompt is something like C:\Program Files\Apache Group\Apache2\htdocs\gbrowse\databases>. Unix systems are more variable, but the prompt usually ends with a "%" or a "#". In all the examples in this tutorial, what you type is rendered in boldface, while prompts and command-line results are shown in medium typeface. On Windows systems, use the file manager ("Explorer") to create a new folder named "volvox." If you are using Windows NT, 2000 or XP, right click on the new folder and grant write privileges to all. You'll now put the first of several data files into the volvox database directory. In the data_files subdirectory of this tutorial you will find the file volvox1.gff3. Copy this into the volvox database directory. On Unix systems: % cd /var/www/html/gbrowse % cp tutorial/data_files/volvox1.gff3 databases/volvox On Windows systems, use Explorer to copy the file into the volvox database directory. Now we will need a GBrowse config file to tell GBrowse how to render this data set. In the subdirectory conf_files, you will find a sample configuration file named volvox.conf. Copy this into your GBrowse configuration directory (/etc/httpd/conf/ gbrowse.conf). You should now be able to view the data set. Point your web browser at http:// localhost/cgi-bin/gbrowse/volvox and type in "ctgA" in the search box. The result is shown in Figure 1. [figures/basics1.gif] Figure 1: volvox1.gff3 data with volvox.conf config file. **** If You are Having Problems... **** If for some reason you get a blank page or an "Internal server error," there are a couple of things to check. First, open the file volvox.conf with a text editor ("Notepad" on Windows systems, emacs, pico or vi on Unix systems) and confirm that the path to the volvox database directory in this section is correct: db_adaptor = Bio::DB::SeqFeature::Store db_args = -adaptor memory -dir '/var/www/html/gbrowse/databases/volvox' If there is a space in "/var/www/html" then you must be certain to put single quotes around the path as shown in the example above. Next check that the volvox1.gff file does exist inside the volvox database directory and that it is readable by all users on your system. Similarly, check that the volvox.conf configuration file is in the same directory as yeast_chr1.conf, and that it is readable by all users on your system. Microsoft Windows has an unpleasant tendency to add a ".txt" extension to files without warning. If something seems to be wrong with the config or GFF file and you can't figure out what, check that the file extension hasn't been modified. To avoid this phenomenon, I suggest that you select "All File Types" from the popup menu in the File Save dialog. You might also want to configure your Folder display to show known file extensions. If you're still having no luck, check the bottom of the Apache server error log for error messages. This file is located in various places depending on how Apache is installed. Look for the file error_log, typically located in /usr/ local/apache/logs, C:\Program Files\Apache Group\Apache2\logs, /var/log/www, or /var/log/httpd. The error message will usually point you in the right direction. If this doesn't fix the problem, please stop the tutorial and send an e-mail to GBrowse support at gmod-gbrowse@lists.sourceforge.net. Someone will be happy to assist you. **** 1.1 The Data File **** Let's look at the data file we loaded in detail now. If you open the volvox1.gff3 file in a text editor, you will see that it contains a series of 15 genome "features" that look like this: ctgA example contig 1 50000 . . . Name=ctgA ctgA example remark 1659 1984 . + . Name=f07;Note=This is an example ctgA example remark 3014 6130 . + . Name=f06;Note=This is another example ctgA example remark 4715 5968 . - . Name=f05;Note=Ok! Ok! I get the message. ctgA example remark 13280 16394 . + . Name=f08 ... Each feature has a "source" of "example", a type of "remark", and occupies a short range (roughly 1.5k) on a contig named "ctgA." In addition to the features themselves, there is an entry for the contig itself (type "contig"). This entry is needed to tell GBrowse what the length of ctgA is. The load file uses a standard known as GFF3_(General_Feature_Format_version_3). Each line of the file corresponds to a feature on the genome, and the nine columns are separated by tabs. The 9 columns are as follows: 1. reference sequence This is the name of the feature that will be used to establish the coordinate system for the annotation. This is usually the name of a chromosome, a clone, or a contig. In our example, the reference sequence is "ctgA". A single GFF file can refer to multiple reference sequences. 2. 3. source The source of the annotation. This field describes how the feature was derived. In the example, the source is "example" for want of a better description. Many people find the source as a way of distinguishing between similar features that were derived by different methods, for example, gene calls derived from different prediction software. You can leave this column blank by replacing the source with a single dot ("."). 4. 5. type This column describes the feature type. Although, you can choose anything you like to describe the feature type, you are strongly encouraged to use well-recognized sequence ontology (SO) terms such as "gene", "repeat_region", "exon", and "CDS." You can find a list of the recognized SO terms at the_Sequence_Ontology_Project_web_site. For lack of a better name, the features in the volvox example are of type "remark." Another 6. 7. start position The position that the feature starts at, relative to the reference sequence. The first base of the reference sequence is position 1. 8. 9. end position The end of the feature, again relative to the reference sequence. End is always greater than or equal to start. 10. 11. score For features that have a numeric score, such as sequence similarities, this field holds the score. Score units are arbitrary, but most people use the expectation value for similarity features. You can leave it blank by replacing the column with a dot. 12. 13. strand For features that are strand-specific, this field is the strand on which the annotation resides. It is "+" for the forward strand, "-" for the reverse strand, or "." for annotations that are not stranded. If you are unsure of whether a feature is stranded, it won't hurt to use a "+" here. 14. 15. phase For CDS features that encode proteins, this field describes where the next codon starts. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature in order to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the next codon begins at the third base of this region. This information is used by the "cds" glyph to show how the reading frame changes across splice sites. For all other feature types, use a dot here. 16. 17. attributes A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. These tags have predefined meanings: ID Gives the feature a unique identifier. Useful when grouping features together (such as all the exons in a transcript). Name Display name for the feature. This is the name to be displayed to the user. Alias A secondary name for the feature. It is suggested that this tag be used whenever a secondary identifier for the feature is needed, such as locus names and accession numbers. Note A descriptive note to be attached to the feature. This will be displayed as the feature's description. Alias and Note fields can have multiple values separated by commas. For example: Alias=M19211,gna-12,GAMMA-GLOBULIN Other good stuff can go into the attributes field, as we shall see later. It is very important to have a full-length entry (such as the one for ctgA) for each reference sequence mentioned in the first column of the GFF3 file. However, the reference sequence can have any source and type you choose. Commonly used types are "clone", "chromosome" and "contig." **** 1.2. Defining Tracks **** Now we'll look at the configuration file in more detail. Using a text editor, open the volvox.conf file from its location in the gbrowse.conf configuraton directory. (If you mess up, you can always copy a fresh version from volvox.conf in the tutorial directory). Ignore all the stuff in the top 90% of the file, and focus on the last little bit, which starts with the line: ### TRACK CONFIGURATION ###: [ExampleFeatures] feature = remark glyph = generic stranded = 1 bgcolor = blue height = 10 key = Example Features This is a "stanza" that describes one of the tracks displayed by GBrowse. The track has an internal name of "ExampleFeatures" which you can use in the URL to turn the track on. The internal name is enclosed by square brackets. Following the track name are a series of options that configure the track. The "feature" option indicates what feature type(s) to display inside the track. It's currently set to display the "remark" feature type. The "glyph" option specifies the shape of the rendered feature. The default is "generic", which is a simple filled box, but there are dozens of glyphs to choose from. The "stranded" option tells the generic glyph to try to display the strandedness of the feature -- this is what creates the little arrow at the end of the box. "bgcolor" and "height" control the background color and height of the glyph respectively, and "key" assigns the track a human-readable label. Let's experiment with changing the track definition. First, let's change the color of the glyph. With your text editor, change the bgcolor option from blue to "orange", save it, and reload the page. The features should change immediately as shown in Figure 2 [figures/basic_conf1.gif] Figure 2: A Feature of a Different Color Note: Many of the screenshots in this tutorial are from earlier versions of GBrowse and may not look exactly the same as the current version. Please experiment with other changes! Try changing the height to 5, the key to "Skinny features" and the stranded option to 0 (which means "false"). Just by changing a few options, you can create a very distinctive track. Now let's try changing the glyph. One of the standard glyphs was designed to show PCR primer pairs and is called "primers". Change "glyph = generic" to "glyph = primers" and reload the page. Depending on other changes that you might have made earlier, the result will look something like Figure 3. [figures/basic_conf2.gif] Figure 3: Using the primers Glyph We'll see other examples of glyphs later on. To get a list of the most popular glyphs and the options that are available for them, see the file CONFIGURE_HOWTO.txt, located in the docs/ subdirectory of the GBrowse distribution. Or for the gory and bleeding edge details, run the command: % perldoc Bio::Graphics::Panel This produces copious documentation on the Perl interface to all the glyphs, including some amazingly obscure ones, from which you should be able to deduce the GBrowse equivalents. **** 1.3. Adding Descriptions to a Feature **** By default, GBrowse will display the name of the feature above its glyph provided that there is sufficient space to do this. Optionally, you can also attach some descriptive text to the feature. This text will be displayed below the feature, and can also be searched. You can place descriptions, notes and other comments into the ninth column of the GFF load file. The example file volvox2.gff3 shows how this is done. An excerpt from the top of the file looks like this: ctgA example polypeptide_domain 11911 15561 . + . Name=m11;Note=kinase ctgA example polypeptide_domain 13801 14007 . - . Name=m05;Note=helix loop helix ctgA example polypeptide_domain 14731 17239 . - . Name=m14;Note=kinase ctgA example polypeptide_domain 15396 16159 . + . Name=m03;Note=zinc finger This defines several new features of type "polypeptide_domain". The ninth column, in addition to giving each of the motifs names adds a "Note" attribute to each feature. As described earlier, each attribute is a name=value pair separated by semicolons. The attribute named Note is automatically displayed and made searchable. To see this work, add volvox2.gff3 to the volvox database. You can do this just by copying the file into /var/www/html/gbrowse/databases/volvox so that the directory contains both the original volvox1.gff3 and the new volvox2.gff3 files. To display this newly-loaded data set, open up volvox.conf and add the following new stanza to the config file: [Motifs] feature = polypeptide_domain glyph = span height = 5 description = 1 key = Example motifs This defines a new track whose internal name is "Motifs." The corresponding feature type is "motif" and it uses the "span" glyph, a graphic that displays a horizontal line capped by vertical endpoints. The height is set to five pixels, and the human-readable key is set to "Example motifs." A new option, "description" is a flag that tells GBrowse to display the Note attribute, if any. Any non-zero value means true. After updating the configuration file, you will need to reload the browser page and turn on the "Example motifs" checkbox below the main image. The result is shown in Figure 4. [figures/descriptions1.gif] Figure 4: Showing the Notes attribute A copy of this config file is also available for you to use in volvox2.conf. To show that GBrowse will search the notes for keyword matches, try typing in "kinase." You will be presented with a list of all the motifs whose Note attribute contains the word "kinase." **** 1.4. Adjusting GBrowse Name Searches **** GBrowse has a very flexible search feature. You can type in the name of a reference sequence, such as "ctgA", and it will display the entire thing, or you can type in a range in the format "ctgA:start..stop". Try "ctgA:5000..8000" to see this at work. In addition, GBrowse can search for features by name. Anything that has a Name or Alias attribute in the GFF3 file can be searched for by name. For example, try searching for "f10" or even "f1*". The only drawback to this is that you may have name collisions. For example, some research communities distinguish genes from their products using differences in capitalization, for example hga and HGA. However, GBrowse's searches are case insensitive. To avoid name collisions, you can give each type of feature a distinctive naming prefix, for example "Gene:hga" and "Protein:HGB". To illustrate how this works, have a look at volvox2b.gff3: ctgA example remark 1000 2000 . . . Name=Remark ctgA example protein_coding_primary_transcript 1100 2000 . + . Name=Gene:hga ctgA example polypeptide 1200 1900 . + . Name=Protein:HGA ctgA example protein_coding_primary_transcript 1600 3000 . - . Name=Gene:hgb ctgA example polypeptide 1800 2900 . - . Name=Protein:HGB Copy this file into the databases/volvox folder. Note that as you add new files to the database folder, you may need to disable caching in to see the new features show up immediately. To do this, simply scroll down to the "Display Settings" portion of the GBrowse display and unselect "Cache tracks." Now add the following configuration stanza to volvox.conf to create a track that displays both protein_coding_primary_transcript and polypeptide features: [NameTest] feature = protein_coding_primary_transcript polypeptide glyph = generic stranded = 1 bgcolor = green height = 10 key = Name test track This stanza creates a new track named "Name test track" and displays features of type "protein_coding_primary_transcript" and "polypeptide" using green generic glyphs that are 10 pixels high. When you look at the data file, you'll see that there are three things potentially named "HGA", a remark which uses the unqualified name, a gene which uses the qualified name "Gene:hga", and a polypeptide region which uses the qualified name "Protein:HGA." There is also a protein_coding_primary_transcript named "Gene:hgb" and a protein named "Protein:HGB." (Note, in this track we are using slightly awkward sequence ontology terms, like "protein_coding_primary_transcript," rather than more natural terms like "gene" in order to avoid these example features from appearing in the real "gene" track that we create later on in this tutorial.) To see how GBrowse searches for names, type "HGA" (either upper or lowercase) in the search textbox and press "Search." Because the search term matches the remark whose unqualified name is HGA, GBrowse will bring up the region between 1000..2000 and highlight the HGA remark. Now search for "Protein:HGA." Because you searched with the qualified name, GBrowse will find and highlight the protein feature. Now try to search for "HGB." This search fails because HGB only exists in qualified form in the database. You can still, however, search for "Gene:HGB" or "Protein:Hgb" (capitalization doesn't matter). This may or may not be the behavior that you desire. If you would like GBrowse to search through qualified names when the user types the unqualified version, you can configure this easily by adding the following line to volvox.conf under the [General] section: automatic classes = Gene Protein This option directs GBrowse to search for the unqualified name first, followed by names prefixed with "Gene:" and then names prefixed with "Protein:". Whichever is found first will be displayed. Now searching for "HGB" will find "Gene:hgb". Swapping the order of Gene and Protein on this line will cause the "Protein:HGB" to be found. Another way to approach this is to make liberal use of the Alias attribute. For example: ctgA example remark 1000 2000 . . . Name=Remark:HGA;Alias=hga ctgA example protein_coding_primary_transcript 1100 2000 . + . Name=Gene:hga;Alias=hga ctgA example polypeptide 1200 1900 . + . Name=Protein:HGA;Alias=hga ctgA example protein_coding_primary_transcript 1600 3000 . - . Name=Gene:hgb;Alias=hga ctgA example polypeptide 1800 2900 . - . Name=Protein:HGB;Alias=hga This assigns the alias of "hga" to each of the three HGA features, and an alias of "hgb" to each of the two HGB features. This keeps the identities of these features distinct so that you can find particular ones by typing in the fully qualified name ("Gene:hga"), but find all candidates when you type in the unqualified name. For instance, when you search with "hga", GBrowse will now offer you three matches: [figures/aliases.gif] Figure 5: Searching for aliases **** 1.5. Linking **** The next topic we'll cover in this tutorial is configuring GBrowse's outgoing links. When the user clicks on a glyph in the details image, he will be taken to another page by following a URL. The URL to follow is generated from the link option. The default link option is located in the [TRACK DEFAULTS] section of the config file; you can specify track-specific links by placing a link option in one or more of the individual track stanzas. The volvox.conf track defaults looks like this: [TRACK DEFAULTS] glyph = generic height = 10 bgcolor = lightgrey fgcolor = black font2color = blue label density = 25 bump density = 100 # where to link to when user clicks in detailed view link = AUTO In this case, we've been using a special link URL of "AUTO." This generates an automatic link to a helper script named "gbrowse_details." If you click on some of the features in the current volvox page you'll get an idea of what this script displays. Try clicking on a motif, a spliced transcript, the EDEN gene, and an EST. When you click on the spliced transcript, notice that the content of the "Gene" attribute is displayed. By adding attributes like this one, you can build up a very modest web-browsable database of facts about your features. We're going to override the default link rule for the motif track. There's nothing sensible to link to, so we'll link to Google using first the motif's name, and then the motif's description. Go to the [Motifs] stanza in the volvox.conf config file and modify it so that it looks like this: [Motifs] feature = polypeptide_domain glyph = span height = 5 description = 1 link = http://www.google.com/search?q=$name key = Example motifs The only change we've made is to add a "link" option to the stanza, where the value is a Google search URL. "$name" is a Perl variable. GBrowse will fill in this variable with the name of the motif. Reload the page and click on a motif to see that this works as advertised ("m01," "m02" and the other example motifs are similar to the names for galactic clusters, so be prepared for some astronomy hits). It would be more sensible to link to the description of the motif, for example "helix loop helix." Fortunately we can do that too. Just change the link option to: link = http://www.google.com/search?q=$description There are a large number of possible variables that you can use inside link rules. See the CONFIGURE_HOWTO document in the GBrowse distribution for the full list. You can also construct links using Perl callbacks as described in the section on displaying_ESTs. This gives you the ability to generate any arbitrary URL. If you want nothing to happen when the user clicks on a feature, just set link to empty ("link = "). The last thing we'll do is to change the behavior of the [Motif] track so that: 1. a new window pops up with the google search rather than replacing the contents of the current window 2. when the user mouses over a motif, a hints box will appear telling him that clicking there will initiate a google search These changes are easy: [Motifs] feature = polypeptide_domain glyph = span height = 5 description = 1 link = http://www.google.com/search?q=$description link_target = _blank title = Search Google for $description. key = Example motifs There's now a link_target option. This contains the name of a browser window in which to load the content when the user clicks on the feature. If there's no window of that name, the browser will create a new window and give it the desired name. Choose an ordinary name like "Google" if you want the Google content to be loaded into the same window each time, or choose "_blank" as we've done here in order to pop up a new fresh window each time the user clicks. The title option contains a bit of text that will be displayed whenever the user hovers the mouse over the feature for a second or two. The same variable substitution rules apply, so when the user mouses over feature "m06", a hints window will pop up that says "Search Google for SUSHI repeat." Give it a try! **** 1.6. Adding Popup Balloons to Tracks **** GBrowse can display popup balloons when the user hovers over or clicks on a feature. The balloons can display arbitrary HTML, either provided in the config file, or fetched remotely via a URL. You can use this feature to create multiple choice menus when the user clicks on the feature, to pop up images on mouse hovers, or even to create little embedded query forms. See http:// mckay.cshl.edu/balloons.html for examples. In the config file for the database you wish to modify, set ``balloon tips'' to a true value: [GENERAL] ... balloon tips = 1 Then add ``balloon hover'' and/or ``balloon click'' options to the track stanzas that you wish to add buttons to. You can also place these options in [TRACK DEFAULTS] to create a default balloon. ``balloon hover'' specifies HTML or a URL that will be displayed when the user hovers over a feature. ``balloon click'' specifies HTML or a URL that will appear when the user clicks on a feature. The HTML can contain images, formatted text, and even controls. Examples: balloon hover =

Gene $name

balloon click =

Gene $name

Search Google
Search NCBI
For example, to add a balloon to the motifs track of the Volvox browser, add "balloon tips = 1" near the top of the volvox.conf file, and then add balloon hover and balloon click options like this: [Motifs] feature = polypeptide_domain glyph = span height = 5 description = 1 category = Proteins balloon hover =

Gene $name

balloon click =

Gene $name

Search Google
Search NCBI
key = Example motifs Alternatively, you can populate the balloon using data from an HTML page or dynamic CGI script running on the same server as GBrowse. This uses AJAX; it can often speed up page loading by reducing the amount of text that must be downloaded by the client. To dynamically load the balloon contents from the server, use a balloon hover or balloon click option like this: balloon click = /cgi-bin/get_gene_data?gene=$name In this case, when the user clicks on the feature, it creates a balloon whose content contains the HTML returned by the CGI script ``get_gene_data''. GBrowse knows that this is a URL rather than the contents of the balloon by looking for the leading slash. However, to reduce ambiguity, we recommend that you prefix the URL with ``url:'' as so: balloon click = url:/cgi-bin/get_gene_data?gene=$name This also allows you to refer to relative URLs: balloon click = url:../../get_gene_data?gene=$name It is also possible to fill the balloon with content from a remote source. Simply specify a full URL beginning with ``http:'' ``https:'' or ``ftp:'' balloon hover = http://www.wormbase.org/db/get?name=$name;class=gene Note that the balloon library uses an internal