|-----------------------------------------|
| Specification of the GSuite file format |
|-----------------------------------------|
GSuite version: 0.9
Document version: 0.2
Date: May 27, 2015
Authors: Sveinung Gundersen, Boris Simovski, Abdulrahman Azab, Diana Domanska,
Eivind Hovig, Geir Kjetil Sandve
----------------
Contents
----------------
* Introduction and background
- Overview
- Suites of tracks
- Location of tracks
- Preprocessing of tracks into a binary format (BTrack)
- Track types
* Example GSuite files
* Syntax of the GSuite format
- Introduction to the line types
i. Empty lines
ii. Comment lines
1. Header lines
2. Column specification line
3. Track lines
* References
* Change log
Introduction and background
-----------------------------------
--------
Overview
--------
GSuite is a simple tabular text format for use in specifying a suite (ie. a
set or collection) of genomic annotation tracks (simply called tracks in this
document). A GSuite file does not contain any genomic data as such, but
provides metadata necessary to locate the track contents, info on whether the
track has been preprocessed in a manner suitable for analysis (see the BTrack
file format), some basic information on how to analyze the data (see the track
type concept), as well as the reference genome build that the track
coordinates are based upon. In addition to this, the user may add as many
custom metadata columns he/she needs.
----------------
Suites of tracks
----------------
Central to the concept of track suites is the idea that tracks which take part
in a GSuite file should be somewhat related in contents and format. Although
the GSuite format allows heterogeneous tracks to be banded together in a
single file, such files will typically not be useful for analysis purposes, as
one would almost always need to restrict the contents and/or format as
required by the analysis tools. For instance, a tool that finds the
intersection of base pairs covered by all tracks in a suite would require all
tracks to be of type "points" or "segments", not "function", as tracks of that
type cover all base pairs (see section Track Types below for more info). For
this reason, the GSuite file format specifies a set of four header variables
that could (and should) be stated in the beginning of the file. These header
variables function as a summary over the tracks in the file, providing a
specific value if all the tracks are in accordance with each other. If the
different tracks varies on this particular aspect, the header variable is set
to the reserved keyword "multiple". This typically indicates that the
collection of tracks is not yet focused enough to be usable as a suite of
tracks for analysis purposes.
------------------
Location of tracks
------------------
In order to analyze multiple tracks, one obviously needs to acquire such
tracks. Some tracks one might have acquired directly from sequencing endeavors
(e.g. ChIP-seq peaks), but often one needs to fetch such tracks from public
repositories and databases such as those provided by the ENCODE [1] and
Roadmap Epigenomics [2] projects. GSuite supports the specification of suites
of tracks before the actual track files has been retrieved from a server. In
such cases, the location of the tracks are termed "remote" and the GSuite file
would typically contain an HTTP or FTP address to the remote location. Tracks
that have been retrieved and is stored at the same place as the GSuite file
are termed as "local".
-----------------------------------------------------
Preprocessing of tracks into a binary format (BTrack)
-----------------------------------------------------
As part of the implementation of an analysis tool, one would typically need a
track to be translated, or preprocessed, into a binary format before analyses
takes place, as this greatly improves analysis speed. Often this is done
behind-the-scenes inside the analysis tool. In the GSuite format, however, the
concept of preprocessed binary versions of a track has been included
explicitly as part of the format. The reason for this is that preprocessing
typically takes some time per track, and when one works with multiple tracks
(often hundreds) this step will thus consume a significant amount of time.
Carrying out the preprocessing step as a one-time process, instead of every
time one runs an analysis tool, will thus save much time for the user.
Analysis tools therefore typically require the tracks in a GSuite to be
preprocessed in advance.
Preprocessing of a GSuite file results in the tracks being stored in the
BTrack format. BTrack is a binary format for genomic tracks that allows for
fast retrieval and efficient analyses by the storage of data columns as
numeric arrays. An analogue to the BTrack format in the domain of sequence
alignment is the BAM format, which is a binary version of the textual SAM
format. BTrack is thus the binary version of the previously published GTrack
format [3].
BTrack is the new name for the previously unnamed internal track storage
format used in the Genomic HyperBrowser [3,4]. The BTrack format has seen
several major updates as part of the HyperBrowser code base, and will now soon
be released as a separate binary format that allows multiple tracks to be
stored in a single binary file (currently unpublished). The GSuite format is
intimately linked to the BTrack format, as a BTrack file would be able to
store both a GSuite file together with the actual track contents.
-----------
Track types
-----------
The concept of track types has been examined in detail in a previous
publication [2]. Briefly, a track type is a characterization of a the
geometrical/mathematical properties of a track. A track is typically
envisioned as data somehow located along the DNA sequence of a particular
reference genome. The simplest track type is "points", which refer to single
base pairs scattered along the genome, e.g. SNPs. "Segments" are the more
common ones, which represents regions of the DNA, e.g. genes. With the
addition of values and/or cross-genomic links, a total of 15 track types was
delineated in [3]. The main usage scenario of track types is to limit which
tracks it makes sense to use as input to a particular analysis tool. For
example, an analysis of the base pair overlap of two tracks would typically
require the tracks to be of type "segments". When it comes to the analysis of
multiple tracks, one would typically require the tracks to be analyzed to be
of the same track type. The GSuite format thus supports "track type" as one of
the main header variables (see below). The following is a list of all the
supported 15 track types, as delineated in [3]:
Points (P)
Valued Points (VP)
Segments (S)
Valued Segments (VS)
Genome Partition (GP)
Step Function (SF)
Function (F)
Linked Points (LP)
Linked Valued Points (LVP)
Linked Segments (LS)
Linked Valued Segments (LVS)
Linked Genome Partition (LGP)
Linked Step Function (LSF)
Linked Function (LF)
Linked Base Pairs (LBP)
----------------------------
Example GSuite files
----------------------------
Before going into the details of the GSuite format, one should be able to get a
quick overview of the format by looking at these example files:
# Example 1: List of URLs
http://www.server.com/path/to/file.bed
http://www.server.com/path/to/file2.bed
http://www.server2.com/path/to/other_file.bed
ftp://www.server3.com/path/to/new_file.wig
# Example 2: List of URLs with header lines
##location: remote
##file format: primary
##track type: segments
##genome: hg38
http://www.server.com/path/to/file.bed
http://www.server.com/path/to/file2.bed
http://www.server2.com/path/to/other_file.bed
ftp://www.server3.com/path/to/new_file.gff
# Example 3: List of URLs with header lines, comments, extra columns, and
# GSuite-specific URI (Uniform Resource Identifier) schemes
##location: multiple
##file format: multiple
##track type: segments
##genome: hg38
###uri title p-values
http://www.server.com/path/to/file.bed track_1 0.002
http://www.server.com/path/to/file2.bed track_2 0.1
http://www.server2.com/path/to/other_file.bed track_3 1.0
ftp://www.server3.com/path/to/new_file.gff track_4 0.8
galaxy:/abcd1234abcd;bed track_5 0.012
hb:/my/track/name track_6 .
-----------------------------------
Syntax of the GSuite format
-----------------------------------
------------------------------
Introduction to the line types
------------------------------
GSuite is a tabular text file format. All GSuite filenames should end with
".gsuite". The GSuite format consists of 5 different line types, distinguished
by the leading characters and numbered here by order of appearance in the
file:
i. Empty lines
ii. Comment lines
1. Header lines
2. Column specification line
3. Track lines
Note: The arabic number preceding each line type defines the order in which
the lines must be present. I.e. column specification must follow the header
lines. Roman numbers indicate comments and emtpy lines, which may be present
anywhere.
--------------
i. Empty lines
--------------
- Leading characters: none
- Syntax:
only whitespace characters (space, tab, newline, return)
- Usage: optional
- Description:
Empty lines are allowed anywhere in the GSuite file. These will be ignored
by the parsers
-----------------
ii. Comment lines
-----------------
- Leading characters: # (a single hash character)
- Example:
# this is a comment
- Usage: optional
- Description:
Comments are allowed anywhere and will be ignored by parsers. Note that a
comment line following a track line is considered to be a comment for that
track and can for instance be used by tools that creates GSuite files to
present track-specific error messages to the user.
---------------
1. Header lines
---------------
- Leading characters: ##
- Syntax:
##variable:[ ]*value
where
variable = Header variable name
[ ]* = Optional space characters
value = Header variable value
- Example:
##location: local
##file format: preprocessed
##track type: segments
##genome: hg38
- Usage:
optional in an input GSuite file, but auto-generated when a GSuite is
created as output from a tool
- Description:
A header variable contains information that relates to the whole of the
GSuite file, and is thus a summary over all the tracks in the file. The
header variables names are limited to a set of reserved keywords, each
with a restricted set of values. The header variables are related to
reserved columns of the track lines (see the section "Column specification
line" below).
- Parser notes:
If a header variable is missing, it will be auto-generated from the track
lines. If a header variable is present, but with a value that is
inconsistent with the track lines, the parser will return an error. Note
that all header variable lines except for the "genome" variable allow a
mix of lower- and uppercase characters.
The following logic for the values "unknown" and "multiple" will hold for
all header variables:
Unknown: if at least one track has "unknown" as it value, the value of the
GSuite header variable will also be "unknown", regardless of the values
for the other tracks.
Multiple: if at least one track has a different value than the others, the
value of the GSuite header variable will be "multiple" (unless the value
for one of the tracks is "unknown", in which case that keyword takes
precedence).
Reserved header variable names
------------------------------
- Location:
Specifies whether the data contents of all tracks in the GSuite are found at
remote locations on the Internet, or if they have been downloaded locally to
the service parsing the GSuite file (see section "Location of tracks"
above). Note that the service parsing the GSuite may itself be located on
e.g. a web server, but the tracks of the GSuite is still considered as local
if they are on the same server as the service.
The location header is a summary of the different types of URI schemes
present in the "uri" column in the track lines (see the section "Column
specification line" below). All supported types of URIs are thus defined as
either remote or local.
Allowed values: unknown, remote, local, multiple
- File format:
Specifies whether all tracks have been preprocessed into the binary format
BTrack, which is a prerequisite for most analysis tools. The "file format"
header variable is a summary of the contents of the "file_format" column in
the track lines (see the section "Column specification line" below).
Allowed values: unknown, primary, preprocessed, multiple
- Track type:
Specifies the track type common for all the tracks in the GSuite file, if
any. See the section "Track types" above for more information. The "track
type" header variable is a summary of the contents of the "track_type"
column in the track lines (see the section "Column specification line"
below).
Note that if the track types of the tracks are different, but based upon the
same basic type, the common track type of the GSuite file is set to the
simplest track type that can used to describe all tracks, if any. E.g. if
two tracks have the types "valued segments" and "linked segments",
respectively, the track type of the GSuite file is "segments". If there is
no such simple track type, the keyword "multiple" is used.
Allowed values: unknown, points, valued points, segments, valued segments,
genome partition, step function, function, linked points, linked valued
points, linked segments, linked valued segments, linked genome partition,
linked step function, linked function, linked base pairs, multiple
- Genome:
Specifies the reference genome for all the tracks in the GSuite file. The
"genome" header variable is a summary of the contents of the "genome" column
in the track lines (see the section "Column specification line" below. The
actual keyword for the genome build is dependent on the implementation of
the analysis tools that will make use of the information. The GSuite format
accepts any string as the genome.
Allowed: unknown, multiple, any other string specifying a reference genome
----------------------------
2. Column specification line
----------------------------
- Leading characters: ###
- Syntax:
###col1 col2 col3...
where
col1, col2, col3 = Column names
" " = tab character
- Example:
###uri title file_format track_type genome description p-value
(with tabs instead of spaces)
- Default value:
###uri
- Usage:
Optional, but if not defined the column specification line retains the
default value. This means that a list of URI's is a valid GSuite file.
- Description:
The column specification line is a tab-separated list of column names. The
GSuite specification defines a set of five reserved column names:
uri, title, file_format, track_type, genome
In addition, any number of custom column names can be specified.
- Parser notes:
Column names are treated as case insensitive. All column names must also
be unique. The columns can be ordered in any way, but it is recommended for
readability to use "uri" and "title" as the first two rows, if defined.
Reserved column variable names
------------------------------
- URI:
A unique identifier following the Universal Resource Identifier format [5].
GSuite supports the following standard URI schemes for data residing at a
remote location:
ftp, http, https, rsync
Examples:
ftp://ftp.server.com/path/to/file.bed
http://www.server.com:8080/index?filename=track.wig
rsync://server.com/path/to/file
For local files, the standard "file" URI scheme is also supported, e.g.:
file:///path/to/file/bed
Note that the "file" scheme does not support files residing other places
than "localhost". The host part of the URI is thus uneeded, hence the
triple '/' characters.
Two more specifically specified URIs schemes are supported by GSuite:
"galaxy" and "hb"
The "Galaxy" scheme uniquely identifies a Galaxy dataset, but currently only
works for the local installation of the Galaxy analysis framework that is
set up with GSuite support, i.e. one cannot (yet) provide an URI to a remote
Galaxy installation [6]. The syntax is as follows:
galaxy:/dataset_key[/directory/structure/to/file]
Multiple files can be stored within one Galaxy history element using the
directory structure syntax.
The "HB" scheme identifies a track stored as the BTrack format within the
local installation of GSuite HyperBrowser. The syntax is as follows:
hb:/track/name/hierarchy
Note that for all the URI schemes except the "HB" one, GSuite supports the
additional specification of file suffix after a semicolon, as in this example:
ftp://ftp.server.com/path/to/file;bed
This usable if the file path itself does not contain the suffix, and hence
does not contain any information on the actual file format of the track.
- Parser notes:
Note that services available from e.g. the web should disable the "file"
scheme, as this is inherently insecure.
- Title:
The title of the track, as specified by the user. Each track title must be
unique within a specific GSuite, so that one may use the title as a key to
uniquely reference specific tracks in a GSuite.
Allowed values: *any*
- File_format:
Specifies whether the track has been preprocessed into the binary format
BTrack or not, as described in the section "Header lines" above.
If the GSuite parser understands the file suffix to be an un-preprocessed
format, file format is automatically set to "primary". Similarly, tracks in
the BTrack format (including those with "HB" as URI) automatically gets
"preprocessed" as "file_format".
Allowed values: unknown, primary, preprocessed
Default value: unknown
- Track_type:
Specifies the track type of the track, as described in the section "Header
lines" above. If the track is preprocessed into a BTrack file, the value of
the "track_type" is automatically collected from the BTrack file(s)
themselves.
Allowed values: unknown, points, valued points, segments, valued segments,
genome partition, step function, function, linked points, linked valued
points, linked segments, linked valued segments, linked genome partition,
linked step function, linked function, linked base pairs
Default value: unknown
- Genome:
Specifies the reference genome build used as basis of the track, as
described in the section "Header lines" above.
Allowed: unknown, any other string specifying a reference genome
Default value: unknown
- Custom columns
Any number of custom columns can be added. Any string can be used as value
for each track, so there are little or no rules on the content defined
within the GSuite format. Missing values for custom columns are denoted with the
period character: '.'
Optional columns
----------------
If the value in the "file_format" column is the same for all tracks in a
GSuite, the column can be removed, leaving only the value of the "file
format" header variable to speak for all individual tracks. The same logic
holds also for the columns "track_type" and "genome".
--------------
3. Track lines
--------------
- Leading characters: none
- Syntax:
val1 val2 val3...
where
val1, val2, val3 = column values
" " = tab character
- Example:
###uri title p-value
http://www.server.com/path/to/file.bed My cool track 0.00013
(with tabs instead of spaces)
- Usage
Track lines are optional. If no track lines are specified, the GSuite file
represents an empty collection of tracks.
- Description
Each track is specified as a tab-separated list of metadata values, as
defined by the column specification line. See the section "Column
specification line" for a more detailed discussion on the allowed values.
------------------
References
------------------
[1] ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in
the human genome." Nature 489.7414 (2012): 57-74.
[2] Kundaje, Anshul, et al. "Integrative analysis of 111 reference human
epigenomes." Nature 518.7539 (2015): 317-330.
[3] Gundersen, Sveinung, et al. "Identifying elemental genomic track types and
representing them uniformly." BMC Bioinformatics 12.1 (2011): 1.
[4] Sandve, Geir K., et al. "The Genomic HyperBrowser: inferential genomics at
the sequence level." Genome biology 11.12 (2010): 1-12.
[5] Uniform Resource Identifier (URI): Generic Syntax
(https://tools.ietf.org/html/rfc3986)
[6] Goecks, Jeremy, Anton Nekrutenko, and James Taylor. "Galaxy: a comprehensive
approach for supporting accessible, reproducible, and transparent
computational research in the life sciences." Genome Biol 11.8 (2010): R86.
------------------
Change log
------------------
v0.1 - 2015.07.06:
* Initial version of the GSuite specification document.
v0.2 - 2016.07.06:
* Fixed typos and cleaned up text several places. Ready for initial
submission og the GSuite HyperBrowser manuscript.