|-----------------------------------------| | Specification of the GSuite file format | |-----------------------------------------| GSuite version: 0.9 Document version: 0.2 Date: May 27, 2015 Authors: Sveinung Gundersen, Boris Simovski, Abdulrahman Azab, Diana Domanska, Eivind Hovig, Geir Kjetil Sandve ---------------- Contents ---------------- * Introduction and background - Overview - Suites of tracks - Location of tracks - Preprocessing of tracks into a binary format (BTrack) - Track types * Example GSuite files * Syntax of the GSuite format - Introduction to the line types i. Empty lines ii. Comment lines 1. Header lines 2. Column specification line 3. Track lines * References * Change log Introduction and background ----------------------------------- -------- Overview -------- GSuite is a simple tabular text format for use in specifying a suite (ie. a set or collection) of genomic annotation tracks (simply called tracks in this document). A GSuite file does not contain any genomic data as such, but provides metadata necessary to locate the track contents, info on whether the track has been preprocessed in a manner suitable for analysis (see the BTrack file format), some basic information on how to analyze the data (see the track type concept), as well as the reference genome build that the track coordinates are based upon. In addition to this, the user may add as many custom metadata columns he/she needs. ---------------- Suites of tracks ---------------- Central to the concept of track suites is the idea that tracks which take part in a GSuite file should be somewhat related in contents and format. Although the GSuite format allows heterogeneous tracks to be banded together in a single file, such files will typically not be useful for analysis purposes, as one would almost always need to restrict the contents and/or format as required by the analysis tools. For instance, a tool that finds the intersection of base pairs covered by all tracks in a suite would require all tracks to be of type "points" or "segments", not "function", as tracks of that type cover all base pairs (see section Track Types below for more info). For this reason, the GSuite file format specifies a set of four header variables that could (and should) be stated in the beginning of the file. These header variables function as a summary over the tracks in the file, providing a specific value if all the tracks are in accordance with each other. If the different tracks varies on this particular aspect, the header variable is set to the reserved keyword "multiple". This typically indicates that the collection of tracks is not yet focused enough to be usable as a suite of tracks for analysis purposes. ------------------ Location of tracks ------------------ In order to analyze multiple tracks, one obviously needs to acquire such tracks. Some tracks one might have acquired directly from sequencing endeavors (e.g. ChIP-seq peaks), but often one needs to fetch such tracks from public repositories and databases such as those provided by the ENCODE [1] and Roadmap Epigenomics [2] projects. GSuite supports the specification of suites of tracks before the actual track files has been retrieved from a server. In such cases, the location of the tracks are termed "remote" and the GSuite file would typically contain an HTTP or FTP address to the remote location. Tracks that have been retrieved and is stored at the same place as the GSuite file are termed as "local". ----------------------------------------------------- Preprocessing of tracks into a binary format (BTrack) ----------------------------------------------------- As part of the implementation of an analysis tool, one would typically need a track to be translated, or preprocessed, into a binary format before analyses takes place, as this greatly improves analysis speed. Often this is done behind-the-scenes inside the analysis tool. In the GSuite format, however, the concept of preprocessed binary versions of a track has been included explicitly as part of the format. The reason for this is that preprocessing typically takes some time per track, and when one works with multiple tracks (often hundreds) this step will thus consume a significant amount of time. Carrying out the preprocessing step as a one-time process, instead of every time one runs an analysis tool, will thus save much time for the user. Analysis tools therefore typically require the tracks in a GSuite to be preprocessed in advance. Preprocessing of a GSuite file results in the tracks being stored in the BTrack format. BTrack is a binary format for genomic tracks that allows for fast retrieval and efficient analyses by the storage of data columns as numeric arrays. An analogue to the BTrack format in the domain of sequence alignment is the BAM format, which is a binary version of the textual SAM format. BTrack is thus the binary version of the previously published GTrack format [3]. BTrack is the new name for the previously unnamed internal track storage format used in the Genomic HyperBrowser [3,4]. The BTrack format has seen several major updates as part of the HyperBrowser code base, and will now soon be released as a separate binary format that allows multiple tracks to be stored in a single binary file (currently unpublished). The GSuite format is intimately linked to the BTrack format, as a BTrack file would be able to store both a GSuite file together with the actual track contents. ----------- Track types ----------- The concept of track types has been examined in detail in a previous publication [2]. Briefly, a track type is a characterization of a the geometrical/mathematical properties of a track. A track is typically envisioned as data somehow located along the DNA sequence of a particular reference genome. The simplest track type is "points", which refer to single base pairs scattered along the genome, e.g. SNPs. "Segments" are the more common ones, which represents regions of the DNA, e.g. genes. With the addition of values and/or cross-genomic links, a total of 15 track types was delineated in [3]. The main usage scenario of track types is to limit which tracks it makes sense to use as input to a particular analysis tool. For example, an analysis of the base pair overlap of two tracks would typically require the tracks to be of type "segments". When it comes to the analysis of multiple tracks, one would typically require the tracks to be analyzed to be of the same track type. The GSuite format thus supports "track type" as one of the main header variables (see below). The following is a list of all the supported 15 track types, as delineated in [3]: Points (P) Valued Points (VP) Segments (S) Valued Segments (VS) Genome Partition (GP) Step Function (SF) Function (F) Linked Points (LP) Linked Valued Points (LVP) Linked Segments (LS) Linked Valued Segments (LVS) Linked Genome Partition (LGP) Linked Step Function (LSF) Linked Function (LF) Linked Base Pairs (LBP) ---------------------------- Example GSuite files ---------------------------- Before going into the details of the GSuite format, one should be able to get a quick overview of the format by looking at these example files: # Example 1: List of URLs http://www.server.com/path/to/file.bed http://www.server.com/path/to/file2.bed http://www.server2.com/path/to/other_file.bed ftp://www.server3.com/path/to/new_file.wig # Example 2: List of URLs with header lines ##location: remote ##file format: primary ##track type: segments ##genome: hg38 http://www.server.com/path/to/file.bed http://www.server.com/path/to/file2.bed http://www.server2.com/path/to/other_file.bed ftp://www.server3.com/path/to/new_file.gff # Example 3: List of URLs with header lines, comments, extra columns, and # GSuite-specific URI (Uniform Resource Identifier) schemes ##location: multiple ##file format: multiple ##track type: segments ##genome: hg38 ###uri title p-values http://www.server.com/path/to/file.bed track_1 0.002 http://www.server.com/path/to/file2.bed track_2 0.1 http://www.server2.com/path/to/other_file.bed track_3 1.0 ftp://www.server3.com/path/to/new_file.gff track_4 0.8 galaxy:/abcd1234abcd;bed track_5 0.012 hb:/my/track/name track_6 . ----------------------------------- Syntax of the GSuite format ----------------------------------- ------------------------------ Introduction to the line types ------------------------------ GSuite is a tabular text file format. All GSuite filenames should end with ".gsuite". The GSuite format consists of 5 different line types, distinguished by the leading characters and numbered here by order of appearance in the file: i. Empty lines ii. Comment lines 1. Header lines 2. Column specification line 3. Track lines Note: The arabic number preceding each line type defines the order in which the lines must be present. I.e. column specification must follow the header lines. Roman numbers indicate comments and emtpy lines, which may be present anywhere. -------------- i. Empty lines -------------- - Leading characters: none - Syntax: only whitespace characters (space, tab, newline, return) - Usage: optional - Description: Empty lines are allowed anywhere in the GSuite file. These will be ignored by the parsers ----------------- ii. Comment lines ----------------- - Leading characters: # (a single hash character) - Example: # this is a comment - Usage: optional - Description: Comments are allowed anywhere and will be ignored by parsers. Note that a comment line following a track line is considered to be a comment for that track and can for instance be used by tools that creates GSuite files to present track-specific error messages to the user. --------------- 1. Header lines --------------- - Leading characters: ## - Syntax: ##variable:[ ]*value where variable = Header variable name [ ]* = Optional space characters value = Header variable value - Example: ##location: local ##file format: preprocessed ##track type: segments ##genome: hg38 - Usage: optional in an input GSuite file, but auto-generated when a GSuite is created as output from a tool - Description: A header variable contains information that relates to the whole of the GSuite file, and is thus a summary over all the tracks in the file. The header variables names are limited to a set of reserved keywords, each with a restricted set of values. The header variables are related to reserved columns of the track lines (see the section "Column specification line" below). - Parser notes: If a header variable is missing, it will be auto-generated from the track lines. If a header variable is present, but with a value that is inconsistent with the track lines, the parser will return an error. Note that all header variable lines except for the "genome" variable allow a mix of lower- and uppercase characters. The following logic for the values "unknown" and "multiple" will hold for all header variables: Unknown: if at least one track has "unknown" as it value, the value of the GSuite header variable will also be "unknown", regardless of the values for the other tracks. Multiple: if at least one track has a different value than the others, the value of the GSuite header variable will be "multiple" (unless the value for one of the tracks is "unknown", in which case that keyword takes precedence). Reserved header variable names ------------------------------ - Location: Specifies whether the data contents of all tracks in the GSuite are found at remote locations on the Internet, or if they have been downloaded locally to the service parsing the GSuite file (see section "Location of tracks" above). Note that the service parsing the GSuite may itself be located on e.g. a web server, but the tracks of the GSuite is still considered as local if they are on the same server as the service. The location header is a summary of the different types of URI schemes present in the "uri" column in the track lines (see the section "Column specification line" below). All supported types of URIs are thus defined as either remote or local. Allowed values: unknown, remote, local, multiple - File format: Specifies whether all tracks have been preprocessed into the binary format BTrack, which is a prerequisite for most analysis tools. The "file format" header variable is a summary of the contents of the "file_format" column in the track lines (see the section "Column specification line" below). Allowed values: unknown, primary, preprocessed, multiple - Track type: Specifies the track type common for all the tracks in the GSuite file, if any. See the section "Track types" above for more information. The "track type" header variable is a summary of the contents of the "track_type" column in the track lines (see the section "Column specification line" below). Note that if the track types of the tracks are different, but based upon the same basic type, the common track type of the GSuite file is set to the simplest track type that can used to describe all tracks, if any. E.g. if two tracks have the types "valued segments" and "linked segments", respectively, the track type of the GSuite file is "segments". If there is no such simple track type, the keyword "multiple" is used. Allowed values: unknown, points, valued points, segments, valued segments, genome partition, step function, function, linked points, linked valued points, linked segments, linked valued segments, linked genome partition, linked step function, linked function, linked base pairs, multiple - Genome: Specifies the reference genome for all the tracks in the GSuite file. The "genome" header variable is a summary of the contents of the "genome" column in the track lines (see the section "Column specification line" below. The actual keyword for the genome build is dependent on the implementation of the analysis tools that will make use of the information. The GSuite format accepts any string as the genome. Allowed: unknown, multiple, any other string specifying a reference genome ---------------------------- 2. Column specification line ---------------------------- - Leading characters: ### - Syntax: ###col1 col2 col3... where col1, col2, col3 = Column names " " = tab character - Example: ###uri title file_format track_type genome description p-value (with tabs instead of spaces) - Default value: ###uri - Usage: Optional, but if not defined the column specification line retains the default value. This means that a list of URI's is a valid GSuite file. - Description: The column specification line is a tab-separated list of column names. The GSuite specification defines a set of five reserved column names: uri, title, file_format, track_type, genome In addition, any number of custom column names can be specified. - Parser notes: Column names are treated as case insensitive. All column names must also be unique. The columns can be ordered in any way, but it is recommended for readability to use "uri" and "title" as the first two rows, if defined. Reserved column variable names ------------------------------ - URI: A unique identifier following the Universal Resource Identifier format [5]. GSuite supports the following standard URI schemes for data residing at a remote location: ftp, http, https, rsync Examples: ftp://ftp.server.com/path/to/file.bed http://www.server.com:8080/index?filename=track.wig rsync://server.com/path/to/file For local files, the standard "file" URI scheme is also supported, e.g.: file:///path/to/file/bed Note that the "file" scheme does not support files residing other places than "localhost". The host part of the URI is thus uneeded, hence the triple '/' characters. Two more specifically specified URIs schemes are supported by GSuite: "galaxy" and "hb" The "Galaxy" scheme uniquely identifies a Galaxy dataset, but currently only works for the local installation of the Galaxy analysis framework that is set up with GSuite support, i.e. one cannot (yet) provide an URI to a remote Galaxy installation [6]. The syntax is as follows: galaxy:/dataset_key[/directory/structure/to/file] Multiple files can be stored within one Galaxy history element using the directory structure syntax. The "HB" scheme identifies a track stored as the BTrack format within the local installation of GSuite HyperBrowser. The syntax is as follows: hb:/track/name/hierarchy Note that for all the URI schemes except the "HB" one, GSuite supports the additional specification of file suffix after a semicolon, as in this example: ftp://ftp.server.com/path/to/file;bed This usable if the file path itself does not contain the suffix, and hence does not contain any information on the actual file format of the track. - Parser notes: Note that services available from e.g. the web should disable the "file" scheme, as this is inherently insecure. - Title: The title of the track, as specified by the user. Each track title must be unique within a specific GSuite, so that one may use the title as a key to uniquely reference specific tracks in a GSuite. Allowed values: *any* - File_format: Specifies whether the track has been preprocessed into the binary format BTrack or not, as described in the section "Header lines" above. If the GSuite parser understands the file suffix to be an un-preprocessed format, file format is automatically set to "primary". Similarly, tracks in the BTrack format (including those with "HB" as URI) automatically gets "preprocessed" as "file_format". Allowed values: unknown, primary, preprocessed Default value: unknown - Track_type: Specifies the track type of the track, as described in the section "Header lines" above. If the track is preprocessed into a BTrack file, the value of the "track_type" is automatically collected from the BTrack file(s) themselves. Allowed values: unknown, points, valued points, segments, valued segments, genome partition, step function, function, linked points, linked valued points, linked segments, linked valued segments, linked genome partition, linked step function, linked function, linked base pairs Default value: unknown - Genome: Specifies the reference genome build used as basis of the track, as described in the section "Header lines" above. Allowed: unknown, any other string specifying a reference genome Default value: unknown - Custom columns Any number of custom columns can be added. Any string can be used as value for each track, so there are little or no rules on the content defined within the GSuite format. Missing values for custom columns are denoted with the period character: '.' Optional columns ---------------- If the value in the "file_format" column is the same for all tracks in a GSuite, the column can be removed, leaving only the value of the "file format" header variable to speak for all individual tracks. The same logic holds also for the columns "track_type" and "genome". -------------- 3. Track lines -------------- - Leading characters: none - Syntax: val1 val2 val3... where val1, val2, val3 = column values " " = tab character - Example: ###uri title p-value http://www.server.com/path/to/file.bed My cool track 0.00013 (with tabs instead of spaces) - Usage Track lines are optional. If no track lines are specified, the GSuite file represents an empty collection of tracks. - Description Each track is specified as a tab-separated list of metadata values, as defined by the column specification line. See the section "Column specification line" for a more detailed discussion on the allowed values. ------------------ References ------------------ [1] ENCODE Project Consortium. "An integrated encyclopedia of DNA elements in the human genome." Nature 489.7414 (2012): 57-74. [2] Kundaje, Anshul, et al. "Integrative analysis of 111 reference human epigenomes." Nature 518.7539 (2015): 317-330. [3] Gundersen, Sveinung, et al. "Identifying elemental genomic track types and representing them uniformly." BMC Bioinformatics 12.1 (2011): 1. [4] Sandve, Geir K., et al. "The Genomic HyperBrowser: inferential genomics at the sequence level." Genome biology 11.12 (2010): 1-12. [5] Uniform Resource Identifier (URI): Generic Syntax (https://tools.ietf.org/html/rfc3986) [6] Goecks, Jeremy, Anton Nekrutenko, and James Taylor. "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences." Genome Biol 11.8 (2010): R86. ------------------ Change log ------------------ v0.1 - 2015.07.06: * Initial version of the GSuite specification document. v0.2 - 2016.07.06: * Fixed typos and cleaned up text several places. Ready for initial submission og the GSuite HyperBrowser manuscript.