|-----------------------------------------|
| Specification of the GTrack file format |
|-----------------------------------------|

GTrack version: 1.0
Document version: 1.0.5
Date: 07 Apr 2012
Authors: Sveinung Gundersen, Matus Kalas, Osman Abul, Arnoldo Frigessi,
         Eivind Hovig, Geir Kjetil Sandve


----------------
    Contents
----------------

* Reading the specification
* What is GTrack?
* Example GTrack files
* Basic specification
    i.  Comments
    1.  Header lines
    2.  Column specification line
    3a. Bounding region specification line
    3b. Data lines
    -   BED compatibility
    -   Compression
    -   Detailed specification of character usage
* Extended specification
    -   Redefining column names
    -   WIG compatibility
    -   FASTA compatibility
    -   Defining GTrack subtypes
* References
* Change log



Reading the specification
---------------------------------

This document contains the complete specification of the GTrack format. As the
document contains many details, we here present some reading recommendations:

- Skip the "Developer notes" sections if you are not planning to develop parsers
  of the GTrack format.

- The "Restrictions" section after each main type of GTrack lines contains
  detailed descriptions that can be skipped by most readers.

- The section "Detailed specification of character usage" contains very detailed
  information and can be skipped by most readers.

- The sections under "Extended specification" describes extensions that do not
  add any new types of information to a GTrack file, only alternative ways of
  expressing the same information, in addition to functionality for defining
  GTrack subtypes. These sections are thus not required for basic use.

A HTML version of this specification is available at [3]. In the HTML version,
the sections described above are hidden by default.


-----------------------
    What is GTrack?
-----------------------

GTrack is short for both "Genomic Track" and "Generic Track". GTrack is a
general purpose, tabular file format for representing data in the form of
genomic tracks, that is, as elements associated to positions along a reference
(genome) sequence, or a set of sequences.

GTrack emphasizes preciseness, flexibility, and simple parsing. This is achieved
by allowing flexible column specification and declaring syntactic properties at
the beginning of the file (allowing parsers to cleanly restrict support to a
subset of the GTrack specification).

A main contribution by the format is the unified and optimized formalization of
sequence level genomic data into one of fifteen track types, as developed in
[1]:

Points (P)
Valued Points (VP)
Segments (S)
Valued Segments (VS)
Genome Partition (GP)
Step Function (SF)
Function (F)
Linked Points (LP)
Linked Valued Points (LVP)
Linked Segments (LS)
Linked Valued Segments (LVS)
Linked Genome Partition (LGP)
Linked Step Function (LSF)
Linked Function (LF)
Linked Base Pairs (LBP)

These fifteen track types encompass most of the existing file formats, while
providing support for, among other things, genomic data of a three-dimensional
nature. The primary goals of the GTrack format are to support all track types
systematically, simplify parsing and manipulation, allow custom extensions, and
provide efficient storage.


----------------------------
    Example GTrack files
----------------------------

Before delving into the details, it is recommended that you examine these
examples of simple GTrack files. You may return to them while reading the rest
of the specification, if needed. The first example is the simplest version of
GTrack, without any specification lines. It shows a data set of a couple of
genomic segments, and the track type is simply Segments (S).


#
# GTrack example file 1
#
# A GTrack file without headers is handled as three-column BED [2]
#
chr1  121  201
chr2  486 1240


The second example contains all GTrack specification lines (header line, column
specification line and bounding region specification line) and shows a dataset
of genomic segments with additional associated information in extra columns. One
of these is selected as the main "value" of the segments, which are then of type
Valued Segments (VS). The example also shows how to add custom columns.


#
# GTrack example file 2
#
# Note: tech is a custom column and not part of the GTrack specification
#
##Track type: valued segments
###seqid  tech       start  end   value  strand
####genome=hg19
chr1      ChIP-seq   1047   1165  0.625  -
chr2      ChIP-chip  2002   2450  .      +
chr2      ChIP-chip  3033   3246  0.355  +


The third example is more advanced, showing a Step Function dataset, that is, a
dataset where every base pair in the domain has an associated value, but where
this value is constant, or approximated, over larger regions (250-500 bps). The
domain is, in this case, composed of two bounding regions. In addition, some of
the regions are linked by edges to other regions in the genome. This example
file is thus of type Linked Step Function (LSF).


#
# GTrack example file 3
#
##Track type: linked step function
##Edge weights: true
##Undirected edges: true
###id  end   value  edges

####seqid=chr1; start=1000; end=2250
1      1250  10     4=0.4
2      1500  7      .
3      2000  2      .
4      2250  6      1=0.4;6=0.3

####seqid=chr1; start=3000; end=4000
5      3250  7      .
6      3500  4      4=0.3
7      4000  6      .


(Note that, for readability issues, spaces are used instead of tab characters in
these example files. They will therefore not work "out of the box". All example
files are available as working GTrack files from [3].)


---------------------------
    Basic specification
---------------------------

GTrack is a tabular text file format. All GTrack filenames should end with
".gtrack". The GTrack format consists of 5 different line types, distinguished
by the leading characters and numbered here by order of appearance in the file:

  i. Comments
  1. Header lines
  2. Column specification line
  3a. Bounding region specification line
  3b. Data lines

Note: The number preceding each line type defines the order in which the lines
must be present, i.e. column specification must follow the header lines, but
comments may be present anywhere. Note that a bounding region specification line
must be followed by a data line, but that a file may have multiple bounding
region specifications with data lines in between.

A GTrack validator is available at [3].

  -----------
  i. Comments
  -----------

  - Leading characters: #
  - Example

      #This is a comment!

  - Usage: Optional

  Comments are ignored by parsers and may be present anywhere in the file.


  ---------------
  1. Header lines
  ---------------

  - Leading characters: ##

  - Format

      ##VARIABLE:[ ]*VALUE

      where
        VARIABLE = Header variable name
        [ ]* = Optional space characters
        VALUE = Header variable value

  - Example

      ##gtrack version: 1.0
      ##track type: valued points
      ##value type: category
      ##1-indexed:  False
      ##end inclusive:True

  - Usage

      Optional, but any header variables not declared regain their default
      values.

  - Restrictions


  Header lines provide structural information readable by both humans and
  automatic parsers. The GTrack format defines a reserved set of header
  variables, each with a default value. If a header variable is not declared in
  the header lines, the default value is used. We encourage the use of header
  lines even when they contain default values as this adds to the clarity of the
  file and helps reduce parsing errors. The order of the header lines is
  unimportant.

  Developer notes


  Reserved header variables
  -------------------------

  - GTrack version

      The version of the GTrack specification used for the file.

      Default value: 1.0

  - Track type*
      one of:
        points
        valued points
        segments
        valued segments
        genome partition
        step function
        function
        linked points
        linked valued points
        linked segments
        linked valued segments
        linked genome partition
        linked step function
        linked function
        linked base pairs

      Defines the track type of a GTrack file. Each track type defines a set of
      core columns to be used. See the section "Column specification line" for
      more details.

      Default value: segments

  - Value type

      one of:
        number
        binary
        character
        category

      Only used if the "value" column is defined. Defines the kind of content
      accepted in the value column. See the section "Column specification line"
      for more details.

      Default value: number

  - Value dimension

      one of:
        scalar
        pair
        vector
        list

      Only used if the "value" column is defined. Defines the dimension of the
      content accepted in the value column. See the section "Column
      specification line" for more details.

      Default value: scalar

  - Undirected edges*

      Only used if the "edges" column is defined. True if all edges specified in
      the GTrack file are undirected, else false. Note that undirected edges
      between two track elements must still be specified in both data lines,
      using the same weights.

      Default: false

  - Edge weights*

      Only used if the "edges" column is defined. True if weights are specified
      for edges, else false. If true, all edges must have a weight
      specification, if false, no edges must specify weight.

      Default value: false

  - Edge weight type

      one of:
        number
        binary
        character
        category

      Only used if the "edges" column is defined and the "Edge weights" header
      variable is set to true. Defines the kind of content accepted as edge
      weights. See the section "Column specification line" for more details.

      Default value: number

  - Edge weight dimension

      one of:
        scalar
        pair
        vector
        list

      Only used if the "edges" column is defined and the "Edge weights" header
      variable is set to true. Defines the dimension of the content accepted as
      edge weights. See the section "Column specification line" for more
      details.

      Default value: scalar

  - Uninterrupted data lines*

      True if it is guaranteed that the data lines are not interrupted by
      bounding region specification lines (i.e. that more than one bounding
      region is specified), comments or blank lines, else false. This is used to
      help simple parsers.

      Default value: false

  - Sorted elements*

      True if it is guaranteed that all bounding regions and track elements come
      in sorted order. Bounding regions must be sorted first, and the track
      elements in each bounding region block second. Regions are sorted by the
      following fields, in ascending order (using only the ones that are
      defined): genome, seqid, start, end.

      Default: false

  - No overlapping elements*

      Only used for tracks of type Points and Segments, and the variations of
      these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or
      Valued Segments (VS/LS/LVS). True if it is guaranteed that no two track
      elements overlap, else false.

      Default: false

  - Circular elements*

      True if any track element or bounding region cross the coordinate borders
      of a circular sequence, i.e. that the "end" value is smaller than the
      "start" value.

      Default: false

  - 1-indexed

      True if the coordinates start at 1, false if the coordinates start at 0.

      Default value: false

  - End inclusive

      True if the end coordinates should be included in intervals, else false.
      For example, if "End inclusive" is true, the position 10 is included in
      the interval [0,10], if false, the interval ends with 9.

      Default value: false

      Developer notes

  (Note that the section "Extended specification" includes more reserved header
  variables.)

  *   Some header lines include redundant information compared to the rest of
      the file. These are marked with * in the listing above. The redundant
      header lines are still explicitly defined for several reasons. First, in
      order for a human reader to easily find out which features are used in a
      file. Second, as a way for simple parsers that only use a subset of the
      specification to check whether they can parse a particular file. Third, it
      enables automatic validation of whether a file contains the information in
      the way the author intended. These header lines can be automatically
      extracted from the rest of a GTrack file by the "Expand GTrack headers"
      tool, available at [3].


  ----------------------------
  2. Column specification line
  ----------------------------

  - Leading characters: ###

  - Format

      ###COL1  COL2  COL3...

      where
        COL1, COL2, COL3 = Column names
        "  " = tab character

  - Example

      ###genome  seqid  start  end  strand  geneId  score  id  edges
      (with tabs instead of spaces)

  - Default value

      ###seqid  start  end
      (with tabs instead of spaces)

  - Usage

      Optional, but if not defined, retains the default value.

  - Restrictions


  The column specification line is a tab-separated list of column names.

  The GTrack specification defines a set of eight reserved column names. Four of
  these are associated with the four core informational properties: gaps,
  lengths, values and interconnections. The specific set of core columns present
  defines the track type (see [1] for more details). The GTrack format also
  defines 4 reserved columns that, although they do not define track type, have
  reserved meanings. The associations between the reserved columns and track
  types are shown in the following table:


                    Column name:  genome seqid start end value strand id edges
                 Type of column:     N     N     C    C    C      N    N   C
 Track type:
  Points                   (P)       ?     !     X    .    .      ?    ?   .
  Segments                 (S)       ?     !     X    X    .      ?    ?   .
  Genome Partition         (GP)      ?     !     .    X    .      ?    ?   .

  Valued Points            (VP)      ?     !     X    .    X      ?    ?   .
  Valued Segments          (VS)      ?     !     X    X    X      ?    ?   .
  Step Function            (SF)      ?     !     .    X    X      ?    ?   .
  Function                 (F)       ?     !     .    .    X      ?    ?   .

  Linked Points            (LP)      ?     !     X    .    .      ?    X   X
  Linked Segments          (LS)      ?     !     X    X    .      ?    X   X
  Linked Genome Partition  (LGP)     ?     !     .    X    .      ?    X   X

  Linked Valued Points     (LVP)     ?     !     X    .    X      ?    X   X
  Linked Valued Segments   (LVS)     ?     !     X    X    X      ?    X   X
  Linked Step Function     (LSF)     ?     !     .    X    X      ?    X   X
  Linked Function          (LF)      ?     !     .    .    X      ?    X   X

  Linked Base Pairs        (LBP)     ?     !     .    .    .      ?    X   X

  C - Core reserved column (defines track type)
  N - Non-core reserved column (reserved, but does not define track type)
  X - Column is mandatory
  ? - Column is optional
  . - Column is not allowed
  ! - Property must be present, either as a column or in a bounding region
      specification (see below)

  Table 1: Overview of the eight reserved columns in the GTrack format and their
           associations to track type.


  Reserved columns
  ----------------

  - genome

      The genome assembly of the track element (e.g. hg19, mm9). The GTrack
      format has no explicit requirements on the syntax or semantics of the
      genome specification; the interpretation is up to the particular
      parsers/tools. Elements from different genomes are allowed in the same
      GTrack file.

      Specifying the genome of a track element is optional. The genome may be
      specified either as a separate column in the data lines, or in a preceding
      bounding region specification line (see below), or both. If genome is
      specified both in a bounding region specification and as a column, the
      values must be equal.

  - seqid

      A sequence identifier, i.e. an identifier of the underlying sequence of
      the particular track element. Usually defined as chromosome (e.g. chr3,
      chrY, chr2_random) or scaffold (e.g. scaffold10671), as defined in the
      genome assembly. As with the "genome" column, the GTrack format has no
      explicit requirements on the syntax or semantics of the "seqid" column;
      the interpretation is up to the particular parsers/tools. Some parsers may
      for instance allow chromosome arms (e.g. chr1p) as seqid.

      All track elements in a GTrack file must have a seqid, either as a
      separate column in the data lines, or in a preceding bounding region
      specification line (see below), or both. If seqid is specified both in a
      bounding region specification and as a column, the values must be equal.

  - start

      The start position of the track element, using the indexing system defined
      in the header (0- or 1-based).

      Developer notes

  - end

      The end position of the track element, using the indexing system (0- or
      1-based) and "End inclusive" property as defined in the header.

      Developer notes

  - strand

      The strand of the track element. "+" for positive, "-" for negative
      strand, and "." when strand information is missing or irrelevant.

  - value

      The value or score of the track element. The character "." denotes that
      the track element has a missing value. The basic type of the contents
      follow the "Value type" header variable as follows:

        number

          One floating point number, e.g. -1.23, 12 or 3.1e-4. English decimal
          notation is used, including scientific e notation, with the period
          character representing the decimal separator, but with no spacing.
          Note that integer numbers are a subset of floating point numbers, and
          should use "number" as the value type.

        binary

          One binary value. If this value is used to denote case and control,
          the following notation must be used: 1 for case, 0 for control.

        character

          One ASCII character, e.g. A, T, C. See the section "Detailed
          specification of character usage" for restrictions.

        category

          A string defining a category. The set of all category values over all
          track elements form a category set, e.g: {gene, exon, promoter}. See
          the section "Detailed specification of character usage" for
          restrictions.

      In addition, the "Value dimension" header variable may define that the
      value contains more than one instance of the basic value type, as follows:

        list

          A list of values, following the basic type defined in the "Value type"
          header variable. Lists of numbers and categories are delimited by
          comma, e.g. 1.23,2.34,3.45,4,5 or exon,gene,CDS,gene. Lists of binary
          values and characters use no delimiter, e.g. 1011011010 or ATGCTCGACG.
          Lists that combine different basic types are not allowed. The length
          of lists may vary between track elements.

          The missing element character, ".", is allowed as list values, and a
          single missing element character denotes a zero-length list.

        vector

          A vector of values, similarly defined as a list, with the only
          difference that vectors must have the same length throughout the
          GTrack file.

          The missing element character, ".", is allowed as vector values, but a
          single missing element character for the entire vector is not allowed.
          A vector with 3 missing numbers should thus be denoted ".,.,.".

        pair

          A pair of values, similarly defined as a vector, with the only
          limitation that the length is exactly 2.

        scalar

          A single value, following the basic type defined in the "Value type"
          header variable, e.g. 1.23, 0, g or exon, respectively.

      Developer notes

  - id

      An unique string identifying each track element (data line). Can be in any
      format, e.g. 1, aab or uc002ico.1. See the section "Detailed specification
      of character usage" for restrictions.

  - edges

      A semicolon-separated list of id's, representing edges from the track
      element in the current line to the track elements which the id's identify.
      A "." character denotes that the track element has no edges. An edge is by
      default directed.

      If the header variable "Edge weights" is set to true, each edge must have
      a weight value directly following, after an equals sign. The format of the
      weight value follows the "Edge weight type" and the "Edge weight
      dimension" header variables in the same way as the "value" format follows
      the "Value type" and "Value dimension" header variables (see above). Note
      that no space characters are allowed after the semicolon.

      Example:

      ###seqid  start  end  id   edges
      chr1      0      100  aaa  aab=1.2;aac=.
      chr1      200    350  aab  aaa=1.1
      chr1      450    500  aac  .

      Here, the aaa node is connected to the aab node with two directed edges,
      with the edge from aaa to aab having higher weight than the one in the
      other direction. Note that undirected edges must still be specified in
      both directions, using the same weights. This adds redundancy, but
      simplifies parsing. If all edges in a GTrack file are undirected, the
      header variable "Undirected edges" should be set to true.


  --------------------------------------
  3a. Bounding region specification line
  --------------------------------------

  - Leading characters: ####

  - Format

      Type A) ####genome=VAL1

        or

      Type B) ####[genome=VAL1;[ ]*]seqid=VAL2[;[ ]*start=VAL3][;[ ]*end=VAL4]

      where
        [x] = "x" is optional
        [ ]* means optional space characters
        genome, seqid, start, end = reserved attribute names
        VAL1, VAL2, VAL3, VAL4 = attribute values

  - Example

      ####genome=hg18; seqid=chr1;start=100;  end=10000

  - Usage

      Type B is mandatory for GTrack files of one of the following track types:
        Genome Partition (GP)
        Step Function (SF)
        Function (F)
        Linked Genome Partition (LGP)
        Linked Step Function (LSF)
        Linked Function (LF)
        Linked Base Pairs (LBP)

      For all other track types, bounding region specification lines are
      optional.

  - Restrictions


  A bounding region specifies a genomic interval encompassing the data lines
  that follow. A bounding region should be thought of as constituting the domain
  of the following track elements, i.e. the region where we have information
  about the properties modeled by the track elements. The set of all bounding
  regions of a track then constitutes the domain of the track.

  Note that, in the case of Points and Segments (and the variations of these,
  i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments
  (VS/LS/LVS), see Table 1), lack of elements is also considered information. A
  bounding region is then, in this case, a region where we know that the lack of
  data means something. Areas of the genome that has not been investigated (such
  as centromeres) should be left outside the bounding regions. For track types
  other than Points and Segments (and their variations), the track elements do
  by definition fill the entire domain. For example, a Function has, by
  definition, a value for all base pairs in the domain. A bounding region is
  then just the smallest region encompassing the track elements that follow. For
  more details, see [1].

  The bounding region specification comes in two flavors:

  A)
      The bounding region specifies the genome assembly for the following track
      elements, using the same format as for the "genome" column (see the
      "Column specification line" section). The domain of the track is then the
      set of sequences constituting the genome, e.g. all chromosomes of the
      genome. If a track contains several genomes, the domain of the track is
      the collected set of sequences constituting all the specified genomes.

  B)
      The bounding region specifies a single sequence, or part of this sequence,
      as the domain of the following track elements. The format is a set of
      attribute pairs separated by semicolon and optional space characters. For
      each attribute pair, the attribute name and the value are separated by the
      equals sign. The attributes may appear in any order. The allowed
      attributes are the following:

      - genome

          The genome assembly of the bounding region(e.g. hg19, mm9). The format
          of the genome attribute is the same as for the "genome" column (see
          the section "Column specification line"). The "genome" attribute is
          optional.

      - seqid

          A sequence id, e.g. the id of the underlying sequence of the bounding
          region. The format of the seqid attribute is the same as for the
          "seqid" column (see the section "Column specification line"). The
          "seqid" attribute is mandatory for a bounding region specification
          line of type B.

          Note that if type B bounding region specifications are not defined,
          the "seqid" column must be included in the column specification line.

      - start

          The start position of the bounding region, using the indexing system
          defined in the header (0- or 1-based). The "start" attribute is
          optional.

          Developer notes

      - end

          The end position of the bounding region, using the indexing system (0-
          or 1-based) and "End inclusive" property as defined in the header. The
          "end" attribute is optional.

          Developer notes


  --------------
  3b. Data lines
  --------------

  - Leading characters:

  - Format

      VAL1  VAL2  VAL3...

      where
        VAL1, VAL2, VAL3 = column values
        "  " = tab character

  - Example

      chr21  304  997  -  FOOGENE  423  1  .
      (with tabs instead of spaces)

  - Usage

      Data lines are optional.

  - Restrictions


  Each data line is a tab-separated list of values, as defined by the column
  definition line. If there is a missing value in either of the "value" and
  "edges" columns, the period character, ".", may be used. See the section
  "Column specification line" for more details.


  -----------------
  BED compatibility
  -----------------

  Note that a simple BED file only using the three columns chr, start and end is
  directly compatible with the GTrack format. This is because the default track
  type of a GTrack file is Segments (S), which defines the same three core
  columns as a simple BED file (see Table 1). One may thus only rename the file
  ending of such a file from ".bed" to ".gtrack" and run it through a GTrack
  parser. If a UCSC custom track definition line or other headers are present,
  they must be commented out. More complex BED files must be converted.
  Converters to common file formats are available at [3].


  -----------
  Compression
  -----------

  As genomic tracks may contain large amounts of data, we require that fully
  compliant GTrack parsers support the expansion of tabular files compressed
  with the gzip compression algorithm [4]. Such GTrack files should have the
  suffix ".gtrack.gz".


  -----------------------------------------
  Detailed specification of character usage
  -----------------------------------------

------------------------------
    Extended specification
------------------------------

  The extended part of the GTrack specification consists of the following header
  variables:

    - Value column
    - Edges column
    - Fixed length
    - Fixed gap size
    - Fixed-size data lines
    - Data line size
    - GTrack subtype
    - Subtype version
    - Subtype URL
    - Subtype adherence

  These header variables are redundant compared to the basic GTrack
  specification, that is, they do not allow any extra types of information to be
  represented. They do, however, allow existing information to be represented in
  more practical ways, in addition to supporting standardized ways of extending
  the GTrack format by defining GTrack subtypes.


  -----------------------
  Redefining column names
  -----------------------

  -----------------
  WIG compatibility
  -----------------

  -------------------
  FASTA compatibility
  -------------------

  ------------------------
  Defining GTrack subtypes
  ------------------------

------------------
    References
------------------

[1] Gundersen S, Kalas M, Abul O, Frigessi A, Hovig E, Sandve GK: Identifying
    elemental genomic track types and representing them uniformly. BMC
    Bioinformatics 2011, 12:494.
[2] http://genome.ucsc.edu/FAQ/FAQformat.html
[3] http://www.gtrack.no
[4] http://www.gzip.org
[5] http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml
[6] http://genome.ucsc.edu/goldenPath/help/wiggle.html


------------------
    Change log
------------------

v1.0.6 - 2015.05.16:

    * Starting or ending a value with whitespace is now allowed, but adviced
      against

v1.0.5 - 2012.07.04:

    * Clarified the start position of the first element in a bounding region
      when using of "fixed gap size".

v1.0.4 - 2012.03.22:

    * Small clarification of the use of the missing value character in vectors
      and lists.

v1.0.3 - 2012.01.16:

    * Clarified the use of the missing value character in vectors and lists.

v1.0.2 - 2012.01.09:

    * Fixed typo in the explanation of the "value column" and "edges column"
      header variables.

v1.0.1 - 2011.12.30:

    * Rephrased the explanation of the "End inclusive" header variable.
    * Updated citation.
    * Fixed some quotation marks and capitalization issues.

v1.0 - 2011.12.23:

    * First public version, included as "Additional file 1" in [1].