|-----------------------------------------|
| Specification of the GTrack file format |
|-----------------------------------------|

Version: 1.0b2
Date: 02 Sep 2011
Authors: Sveinung Gundersen, Matus Kalas, Osman Abul, Arnoldo Frigessi,
         Eivind Hovig, Geir Kjetil Sandve


----------------
    Contents
----------------

* Reading the specification
* What is GTrack?
* Example GTrack files
* Basic specification
    x.  Comments
    1.  Header lines
    2.  Column specification line
    3a. Bounding region specification line
    3b. Data lines
    -   BED compability
    -   Detailed specification of character usage
* GTrack subtypes
    -   Example #1: FASTA
    -   GTrack subtype header lines
    -   Example #2: Short reads
* References


---------------------------------
    Reading the specification
---------------------------------

This document contains the complete specification of the GTrack format. As the
document contains many details, we here present some reading recommendations:

- Skip the "Developer notes" sections if you are not planning to develop parsers
  of the GTrack format.

- The "Restrictions" section after each main type of GTrack lines contain
  detailed descriptions that can be skipped in the first read-through.

- The section "Detailed specification of character usage" contain very detailed
  information and is not generally required reading for basic use.

- All information about GTrack subtyping, i.e. extra header lines for subtyping
  purposes, are collected in a separate section at the end of the specification.
  This is done in order to give a better overview of the basic functionality in
  the main sections, which the reader should understand before he/she delves
  into the subtyping functionality.


-----------------------
    What is GTrack?
-----------------------

GTrack is shorthand for Genome Track. The GTrack file format is a general
purpose file format for genome annotations. The main purpose of the format is
the unified and optimised formalization of sequence level genome data into one
of fifteen main track formats, as developed in [1]:

Points (P)
Valued Points (VP)
Segments (S)
Valued Segments (VS)
Genome Partition (GP)
Step Function (SF)
Function (F)
Linked Points (LP)
Linked Valued Points (LVP)
Linked Segments (LS)
Linked Valued Segments (LVS)
Linked Genome Partition (LGP)
Linked Step Function (LSF)
Linked Function (LF)
Linked Base Pairs (LBP)

These fifteen track types encompass most of the existing track types, while
providing support for, among other things, genomic data of a three-dimensional
nature. The primary goals of the GTrack format are to support all track types
systematically, simplify parsing and manipulation, allow custom extensions, and
provide efficient storage. 


---------------------
Example GTrack files
---------------------

Before delving into the details, it is recommended that you examine these
examples of simple GTrack files. You may return to them while reading the rest
of the specification, if needed. The first example is the simplest version of
GTrack, without any specification lines. It shows a data set of a couple of
genomic segments, and the track type is simply Segments (S).


#
# GTrack example file 1
#
# A GTrack file without headers are handled as three-column BED files [2]
#
chr1  121  201
chr2  486 1240


The second example contains all GTrack specification lines (header line, colomn
specification line and boundary region specification line) and shows a dataset
of genomic segments with additional associated information in extra columns. One
of these is selected as the main "value" of the segments, which are then of type
Valued Segments (VS). The example also shows how to add custom columns.


#
# GTrack example file 2
#
# Note: tech is a custom column and not part of the GTrack specification
#
##Track type: valued segments
###seqid  tech       start  end   value  strand
####genome=hg19
chr1      ChIP-seq   1047   1165  0.625  -
chr2      ChIP-chip  2002   2450  .      +
chr2      ChIP-chip  3033   3246  0.355  +


The third example is more advanced, showing a Step Function dataset, that is a
dataset where every base pair in the domain have an associated value, but where
this value is constant, or approximated, over larger regions (250-500 bps). The
domain is, in this case, composed of two boundary regions. In addition, some of
the regions are linked by edges to other regions in the genome. This example
file is thus of type Linked Step Function (LSF).


#
# GTrack example file 3
#
##Track type: linked step function
##Undirected edges: true
###id  end   value  edges

####seqid=chr1; start=1000; end=2250
1      1250  10     4=0.4
2      1500  7      .
3      2000  2      .
4      2250  6      1=0.4;6=0.3

####seqid=chr1; start=3000; end=4000
5      3250  7      .
6      3500  4      4=0.3
7      4000  6      .


(Note that, for readability issues, spaces are used instead of tab characters in
these example files. They will therefore not work "out of the box".)


---------------------------
    Basic Specification
---------------------------

GTrack is a tabular text file format. All files in the GTrack format should end
their names with ".gtrack". The GTrack format consists of 5 different line
types, distinguished by the leading characters:

    x. Comments
    1. Header lines
    2. Column specification line
    3a. Bounding region specification line
    3b. Data lines

Note: The number preceding each line type defines the order in which the lines
must be present, i.e. column specification must follow the header lines, but
comments may be present anywhere. Note that a bounding region specification line
must be followed by a data line, but that a file may have multiple bounding
region specifications with data lines in between.


-----------
x. Comments
-----------

- Leading characters: #
- Example

    #This is a comment!
    
- Usage: Optional

Comments are ignored by parsers and may be present anywhere in the file. 


---------------
1. Header lines
---------------

- Leading characters: ##

- Format

    ##VARIABLE:[ ]VALUE

    where
        VARIABLE = Header variable name
        [ ] = Optional space character
        VALUE = Header variable value

- Example

    ##gtrack version: 1.0
    ##track type: valued points
    ##value type: category
    ##O-indexed: False
    ##end-inclusive: True
    
- Usage

    Optional, but any header variables not declared regain their default values.

- Restrictions

    All variable names and reserved variable values are treated as case
    insensitive and do not support character escaping. Custom values, i.e.
    header values defined in GTrack subtypes, do, however, support escaping. For
    more details, see the section "Detailed specification of character usage".
    
    Values are restricted to the ones allowed by the header variable (see
    below).

Header lines provide structural information readable by both humans and
automatic parsers. The GTrack format defines a reserved set of header variables,
each with a default value. If a header variable is not declared in the header
lines, the default value is used. We encourage the use of header lines even when
they contain default values as this adds to the clarity of the file and helps
reduce parsing errors. The order of the header lines are unimportant.

Developer notes
---------------
As not all parsers/tools will have the need to support the full
GTrack specification, developers are welcome to support only subsets. We do,
however encourage all GTrack parsers to allways check the GTrack header lines
and give feedback to the user if a particular feature is unsupported by the
parser/tool.
---------------


Reserved header variables
-------------------------

- GTrack version

    The version of the GTrack specification used for the file.
    
    Default value: 1.0

- Track type*
    one of:
        points
        valued points
        segments
        valued segments
        genome partition
        step function
        function
        linked points
        linked valued points
        linked segments
        linked valued segments
        linked genome partition
        linked step function
        linked function
        linked base pairs
        
    Defines the track type of a GTrack file. Each track type defines a set of
    core columns to be used. See the Column specification section for more
    details.
    
    Default value: segments

- Value type*

    one of:
        number
        category
        case-control
        number vector
    
    Only used if the "value" column is defined. Defines the kind of content
    accepted in the value column. See the Column specification section for more
    details.
    
    Default value: number

- Vector length*

    Only used if the "value" column is defined and number vector is defined as
    the value type. Defines the maximal length of the number vector of the
    "value" column. Must be 2 or longer.
   
    Default value: 2

- Edge weight type*

    one of:
        number
        category
        case-control
        number vector
    
    Only used if the "edges" column is defined. Defines the kind of content
    accepted as edge weights. See the Column specification section for more
    details.
    
    Default value: number

- Edge weight vector length*

    Only used if the "edges" column is defined and number vector is defined as
    the edge weight value type. Defines the maximal length of the number vector
    of the edge weights. Must be 2 or longer.
   
    Default value: 2

- Multiple bounding regions*

    True if the file defines more than one bounding region, else False. This is
    used to prepare parsers that multiple bounding regions may appear among the
    data lines.
    
    Default value: false

- Overlapping elements*

    True if any two track elements overlap, else false. Only tracks of type
    Points and Segments, and the variations of these, i.e. Linked and/or Valued
    Points (VP/LP/LVP) and Linked and/or Valued Segments (VS/LS/LVS), are allowed
    to overlap.

    Default: true
    
- Circular elements*

    True if any track element cross the coordinate borders of a circular
    sequence, i.e. that the "end" value is smaller than the "start" value.
    
    Default: false
    
- Undirected edges*

    True if all edges specified in the GTrack file are undirected, else False.
    Note that undirected edges between two track elements must still be
    specified in both data lines, using the same weights. It 
    
    Default: false

- Fixed-size data lines

    True if a each data line have an exact size in terms of number of
    characters. This is only allowed for track type Function (F), and only if
    the only column specified is "value". Newline and carriage return characters
    are ignored when parsing, and the data lines are separated by the number of
    characters specified in the header variable "Data line size" (below).
    
    This header is used to support FASTA-like sequences, and may also be used to
    create function tracks of data such as GC-content, in a condensed manner.
    
    See section "Examples of GTrack subtypes" for an example.
    
    Developer notes
    ---------------
    Note that parsers still need to be able to recognize boundary region
    specification lines.
    ---------------
    
    Default: false
    
- Data line size

    The size of each data line in terms of number of characters. Is only used if
    the header variable "Fixed-size data lines" (above) is true.
    
    Default: 1

- O-indexed

    True if the coordinates start at 0, false if the coordinates start at 1.
    
    Default value: true

- End-inclusive

    True if the chromosome coordinate specified in the end column is included in
    the interval, else false.
    
    Default value: false

(Note that the section "GTrack subtypes" includes some more reserved header
variables.)

*   Some header lines include redundant information when regards to the rest of
    the file. These are marked with * in the listing above. The redunant header
    lines are still explicitly defined for several reasons. First, in order for
    a human reader to easily find out which features are used in a file. Second,
    as a way for simple parsers that only use a subset of the specification to
    check whether they can parse a particular file. Third, it enables automatic
    validation of whether a file contains the information in the way the author
    intended. These header lines can be automatically extracted from the rest of
    a GTrack file by the GTrack Header Expander tool, available at [3].

    Developer notes
    ---------------
    Following the guidelines of defensive programming, we recommend that parsers
    check that these header lines correspond to the contents in the data lines
    and give the users feedback if there are inconsistencies.
    ---------------


----------------------------
2. Column specification line
----------------------------

- Leading characters: ###

- Format

    ###COL1  COL2  COL3...

    where
        COL1, COL2, COL3 = Column names
        "  " = tab character

- Example

    ###genome  seqid  start  end  strand  geneId  score  id  edges
    (with tabs instead of spaces)

- Default value
    
    ###seqid  start  end
    (with tabs instead of spaces)

- Usage

    Optional, but if not defined, retains the default value.

- Restrictions

    Column names are treated as case insensitive and do not support character
    escaping. For more details, see the section "Detailed specification of
    character usage".


A tab-separated list of column names.

The GTrack specification defines a set of eight reserved column names. Four of
these are associated with the four core informational properties: position,
length, value and edges. The specific set of core columns present defines the
track type (see [1] for more details). The GTrack format also defines 4 reserved
columns that, allthough they do not define track type, have reserved meanings.
The associations between the reserved columns and track types are shown in the
following table:


                    Column name:  genome seqid start end value strand id edges 
                 Type of column:     N     N     C    C    C      N    N   C   
 Track type:
  Points                   (P)       ?     !     X    .    .      ?    ?   .    
  Segments                 (S)       ?     !     X    X    .      ?    ?   .    
  Genome Partition         (GP)      ?     !     .    X    .      ?    ?   .   

  Valued Points            (VP)      ?     !     X    .    X      ?    ?   .   
  Valued Segments          (VS)      ?     !     X    X    X      ?    ?   .   
  Step Function            (SF)      ?     !     .    X    X      ?    ?   .   
  Function                 (F)       ?     !     .    .    X      ?    ?   .   

  Linked Points            (LP)      ?     !     X    .    .      ?    X   X   
  Linked Segments          (LS)      ?     !     X    X    .      ?    X   X   
  Linked Genome Partition  (LGP)     ?     !     .    X    .      ?    X   X   

  Linked Valued Points     (LVP)     ?     !     X    .    X      ?    X   X   
  Linked Valued Segments   (LVS)     ?     !     X    X    X      ?    X   X   
  Linked Step Function     (LSF)     ?     !     .    X    X      ?    X   X   
  Linked Function          (LF)      ?     !     .    .    X      ?    X   X   

  Linked Base Pairs        (LBP)     ?     !     .    .    .      ?    X   X   

  C - Core reserved columns (defines track type)
  N - Non-core reserved columns (reserved, but do not define track type)
  X - Column mandatory
  ? - Column optional
  . - Column not allowed
  ! - Property must be present, either as a column or in a bounding region
      specification (see below)
    
Table 1: Overview of the eight reserved columns in the GTrack format and their
         associations to track type.

(Note that the GTrack Header Expander tool, available at [3], may be used to
fill out a default column specification line based on track type. The default
column specification line is then the mandatory columns defined in Table 1, in
the same order. In that case, the GTrack file needs to include the "Track type"
header variable.)


Reserved columns
----------------

- genome

    The genome assembly of the track element (e.g. hg19, mm9). The GTrack format
    has no explicit requirements on the syntax or semantics of the genome
    specification; the interpretation is up to the particular parsers/tools.
    Elements from different genomes are allowed in the same GTrack file.
   
    Specifying the genome of a track element is optional. The genome may be
    specified either as a separate column in the data lines, or in a preceding
    bounding region specification line (see below), or both. If genome is
    specified both in a bounding region specification and as a column, the
    values must be equal.

- seqid
    
    A sequence identifier, i.e. an identifier of the underlying sequence of the
    particular track element. Usually defined as chromosome (e.g. chr3, chrY,
    chr2_random) or scaffold (e.g. scaffold10671), as defined in the genome
    assembly. As for the "genome" column, the GTrack format have no explicit
    requirements on the syntax or semantics of the "seqid" column; the
    interpretation is up to the particual parsers/tools. Some parsers may for
    instance allow chromosome arms (e.g. chr1p) as seqid.
    
    All track elements in a GTrack file must have a seqid, either as a separate
    column in the data lines, or in a preceding bounding region specification
    line (see below), or both. If seqid is specified both in a bounding region
    specification and as a column, the values must be equal.

- start

    The start position of the track element, using the indexing system defined
    in the header (0- or 1-based).
    
    Developer notes
    ---------------
    The start column is not defined for some track types (as described in Table
    1). In order to still work on the start position of an elements, it has to
    be inferred from other information in the following manner, according to
    track type:
    
    Genome Partition�(GP), Step Function (SF), Linked Genome Partition (LGP) and
    Linked Step Function (LSF):
        
        The start position of each track element can be seen as the position
        immediately following the end of the track element of the previous line.
        The exact value of the start position depends on the "End-inclusive"
        header variable, i.e. if the coordinates are end-exclusive, the start
        position of one track element should be exactly the same as the end
        position of the previous line, if not, the start position should be set
        to the previous end position + 1. For the first line in a set of data
        lines, the start position should be set to the start position of the
        preceding bounding region (see below).
        
    Function (F), Linked Function (LF) and Linked Base Pairs (LBP):
    
        Each line defines a successive location along the genome. The start of
        the first line in a set of data lines is then the start position of the
        preceding bounding region. The start value is then increased by 1 for
        each line.
    ---------------

- end

    The end position of the track element, using the indexing system (0- or
    1-based) and end-inclusiveness as defined in the header. 

    Developer notes
    ---------------
    The end column is not defined for some track types (Points (P), Valued
    Points (VP), Function (F), Linked Points (LP), Linked Valued Points (LVP),
    Linked Function (LF) and Linked Base Pairs (LBP), as described in Table 1).
    In order to still work on the end position of an element, it has to be
    inferred from the start position. In these cases, the end position depends
    on the "End-inclusive" header variable. If False, the end position is the
    same as the start position, if True, the end position is the start position
    + 1.
    ---------------

- strand

    The strand of the track element. "+" for positive and "-" for negative
    strand.

- value

    The value or score of the track element. The character "." denotes that the
    track element has a missing value. The format of the contents follow the
    "Value type" header variable as follows:

        number
        
            One floating point number, e.g. -1.23, 12 or 3.1e-4. Note that
            integer numbers are a subset of floating point numbers, and
            should use "number" as the value type.
        
        category
        
            A string defining a category. The set of all category values over
            all track elements form a category set, e.g: gene, exon, promoter.
            
        case-control
        
            One binary value, 1 for case, 0 for control. The missing value
            character, ".", is not allowed in this case.
            
        number vector
        
            A vector of floating point numbers, separated by comma, e.g.
            1.23,2.34,3.45,4,5. The length of any vector must not be longer than
            the value of the "Vector length" header variable.
        
    Developer notes
    ---------------    
    For all floating point values, the period character, ".", should be parsed
    as the "not a number" value. For "case-control", period is not allowed, and
    for "category", the "." character is just a category on the same level as
    other categories. For "number vector" the "." character is parsed as a
    vector of "not a number" values, with vector length equal to 2, or equal to
    the length of the header variable "Vector length", respectively.
    
    If a number vector is shorter than the "Vector length" header variable, it
    should be padded with "not a number" values.
    
    Note also that, for floating point numbers, the English decimal notation is
    used, with the period character representing the decimal separator, but with
    no spacing.
    ---------------

- id
    
    An unique string identifying each track element (data line). Can be in any
    format, e.g. 1, aab or uc002ico.1.

- edges

    A semicolon-separated list of id's, representing edges from the track
    element in the current line to the track elements which the id's identify. A
    "." character denotes that the track element has no edges. An edge is by
    default directed. Each edge can have a weight value directly following after
    an equals sign. The format of the weight value follows the "Edge weight
    type" header variable in the same way as the "value" format follows the
    "Value type" header variable (see above). Note that no space characters are
    allowed after the semicolon.

    Example:
    
    ###seqid  start  end  id   edges
    chr1      0      100  aaa  aab=1.2;aac
    chr1      200    350  aab  aaa=1.1
    chr1      450    500  aac  .

    Here, the aaa node is connected to the aab node with two directed edges,
    with the edge from aaa to aab having higher weight than the one in the
    other direction. Note that undirected edges must still be specified in both
    directions, using the same weights. This adds redundancy, but simplifies
    parsing.
    
    Developer notes
    ---------------
    If a weight value is not specified for an edge, the edge weight is by
    default handled as a "." character, following the rules outlined for the
    "value" column. The only exception is in the case of weigths of type
    "number", where the default weight should be the number 1. 
    ---------------


--------------------------------------
3a. Bounding region specification line
--------------------------------------

- Leading characters: ####

- Format

    A) ####genome=VAL1
    
        or
    
    B) ####[genome=VAL1;[ ]]seqid=VAL2[;[ ]start=VAL3][;[ ]end=VAL4]

    where
        [x] = 'x' is optional, e.g. [ ] means optional space character
        genome, seqid, start, end = reserved attribute names
        VAL1, VAL2, VAL3, VAL4 = attribute values

- Example

    ####genome=hg18; seqid=chr1; start=100; end=10000

- Usage

    Type B is mandatory for GTrack files of one of the following track types:
        Genome Partition (GP)
        Step Function (SF)
        Function (F)
        Linked Genome Partition (LGP)
        Linked Step Function (LSF)
        Linked Function (LF)
        Linked Base Pairs (LBP)
        
    For all other track types, bounding region specification lines are optional.

    Note that if more than one bounding region is defined, the "Multiple
    bounding regions" header variable must be set to True.

- Restrictions

    Attribute names are treated as case insensitive and do not support character
    escaping. Genome and seqid values do, however, support escaping. For more
    details, see the section "Detailed specification of character usage".
    
    A boundary region specification remains in effect for a set of data lines
    until the next boundary region specification. Note that only one bounding
    region specification is allowed at a time.
    
    Bounding region specifications are not allowed to overlap.

    For track types Genome Partition (GP), Step Function (SF), Linked Genome
    Partition (LGP) and Linked Step Function (LSF), the "end" attribute must be
    equal to the end position of the last track element immediately following
    the bounding region specification line.
    
    Example:
    
    ##track type: genome partition
    ###start
    ####seqid=chr1; start=100; end=200
    125
    133
    200
    
    For track types Function (F), Linked Function (LF) and Linked Base Pairs
    (LBP), the "end" attribute must be exactly equal to the "start" attribute
    plus the number of data lines immediately following the bounding region
    specification line. If the header line "End-inclusive" is true, the end
    position should be 1 less.
    
    Example:
    
    ##track type: function
    ###value
    ####seqid=chr1; start=100; end=103
    1.2
    -0.1
    0.8


A bounding region specifies a genomic interval encompassing the data lines that
follow. A bounding region should be thought of as constituing the domain of the
following track elements, i.e. the region where we have information about the
properties modelled by the track elements. The set of all bounding regions of a
track then constitutes the domain of the track.

Note that, in the case of Points and Segments (and the variations of these, i.e.
Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments
(VS/LS/LVS), see Table 1), lack of elements is also considered information. A
bounding region is then, in this case, a region where we know that the lack of
data means something. Areas of the genome that has not been investigated (such
as centromeres) should be left outside the bounding regions. For track types
other than Points and Segments (and their variations), the track elements do by
definition fill the entire domain, as the positions of the track elements in
these cases are not informational. For example, a Function has, by definition, a
value for all base pairs in the domain. A bounding region is then just the
smallest region encompassing the track elements that follow. For more details,
see [1].

The bounding region specification comes in two flavours:

A)
    The bounding region specifies the genome assembly for the following track
    elements, using the same format as for the "genome" column (see the "Column
    specification line" section). The domain of the track is then the set of
    sequences constituing the genome, e.g. all chromosomes of the genome. If a
    track contains several genomes, the domain of the track is the collected set
    of sequences constituing all the specified genomes.
    
B)
    The bounding region specifies a single sequence, or part of this sequence,
    as the domain of the following track elements. The format is a set of
    attribute pairs separated by semicolon and an optional space character. For
    each attribute pair, the attribute name and the value are separated by the
    equals sign. The attributes may appear in any order. The allowed attributes
    are the following:
    
    - genome
    
        The genome assembly of the bounding region(e.g. hg19, mm9). The format
        of the genome attribute is the same as for the "genome" column (see the
        "Column specification line" section). The "genome" attribute is
        optional.
        
    - seqid
    
        A sequence id, e.g. the id of the underlying sequence of the bounding
        region. The format of the seqid attribute is the same as for "seqid"
        column (see the "Column specification line" section). The "seqid"
        attribute is mandatory for a bounding region specification line of type
        B.
        
        Note that if a type B bounding region specification is not defined, the
        "seqid" column must be included in the column specification line.
        
    - start
    
        The start position of the bounding region, using the indexing system
        defined in the header (0- or 1-based). The "start" attribute is
        optional.
        
        Developer notes
        ---------------
        If the "start" attribute is not specified, the start position of the
        bounding region is 0 (or 1, if the header variable "O-indexed" is
        false).
        ---------------
        
    - end
    
        The end position of the track element, using the indexing system (0- or
        1-based) and end-inclusiveness as defined in the header. The "end"
        attribute is optional.
        
        Developer notes
        ---------------        
        If the "end" attribute is not specified, the end position of the
        bounding region is the same as the end position of the sequence
        referenced by the 'seqid' attribute, e.g. the length of the current
        chromosome. If the parser do not have information about the length of
        the sequence in question, the user should be informed, or, in the case
        that the bounding region is unimportant for the parser, the bounding
        region specification should be ignored.
        
        Note that the restrictions regarding the "end" attribute for certain
        track types (see section "Restrictions" over) must still hold, even if
        the "end" attribute is not explicitly specified.
        ---------------


--------------
3b. Data lines
--------------

- Leading characters: 

- Format

    ###VAL1  VAL2  VAL3...

    where
        VAL1, VAL2, VAL3 = column values
        "  " = tab character

- Example

    chr21  304  997  -  FOOGENE  423  1  .
    (with tabs instead of spaces)
        
- Usage

    A GTrack file must contain at least one data line

- Restrictions

    Column values support character escaping, as specified in the section
    "Detailed specification of character usage".
    
    The number of columns of each data line must be equal to the number of
    columns in the column definition line. 


A tab-separated list of values, as defined by the column definition line. If
there is a missing value in either of the "value" and "edges" columns, the
period character, ".", may be used. See the section Column specification line
for more details.


---------------
BED compability
---------------

Note that a simple BED file without a header line and only using the three
columns chr, start and end are directly compatible with the GTrack format. This
is because the default track type of a GTrack file is Segments (S), which
defines the same three core columns as a simple BED file (see Table 1). One may
thus only rename the file ending of such a file from ".bed" to ".gtrack" and run
it through a GTrack parser. If a BED header line is present, this must be
commented out. More complex BED files must be converted. Converters to common
file formats are available at [3].


-----------------------------------------
Detailed specification of character usage
-----------------------------------------

- The GTrack format supports escaping of special characters using URL escaping
  convensions (%XX hex codes). All ASCII characters are supported, except the
  following, which must be escaped everywhere:
    
    Most control characters (except TAB, LF, CR): %00-%08, %0B-%0C, %0E-%1F, %7F
    Extended ASCII characters: %80 through %FF
        
  Also, the following characters have reserved meaning, and must be escaped when
  used with other meanings in places where they may interfere with the parsing:
    
    tab (TAB): %09
    newline (LF): %0A
    carriage return (CR): %0D
    space: %20
    # (hash): %23
    % (percent): %25
    , (comma): %2C
    ; (semicolon): %3B
    = (equals): %3D
    . (period): %2E
  
  Note that spaces needs not be escaped in data lines, as these are separated by
  tabs.
    
- Reserved words in a GTrack file receive special treatment. With reserved words
  are meant all header variable names, reserved header variable values (except
  custom header variable values), column names (including custom columns) and
  bounding region attribute names. Reserved words should be treated as case
  insensitive and do not support URL escaping.

- A line must end with the newline character (LF), optionally preceded by a
  carriage return (CR).

- Blank lines should be ignored by parsers.

- Comments, header lines, column specification lines and bounding region
  specification lines are characterized by the leading number of #-characters.
  Note that, except for comments, once the file reaches a certain "level" of
  #-characters, this count never goes down. Thus, header lines, column
  specification and bounding region specifications are always found in that
  order.
  
- Note that delimeter characters differ for the various lines/columns. See the
  specification above for details. Also note that examples in this file use
  spaces instead of tabs for readability. These examples should not be directly
  copied into GTrack files.


-----------------------
    GTrack subtypes
-----------------------

The GTrack format includes support for creating GTrack subtypes, that is, file
formats that adheres to only a subset of the GTrack specification. This allows
creation of more specialized, simpler parsers, while at the same time ensures
that subtype GTrack files still work with full GTrack parsers. GTrack subtypes
may also be used to standardize special GTrack configurations, removing the need
for the individual GTrack files to include all the required meta information. We
encourage independent specification of subtypes catering to specialized needs.

A GTrack subtype defines default values for header variables and/or the column
specification line. A subtype may also add new header variables or define how
parsers should interpret the values of any non-reserved columns. GTrack subtypes
must still conform to the GTrack specification. Interpretation of new columns or
header lines do of course require specialized parsers.


Example #1: FASTA
-----------------

As an example of the use of subtypes, we show how GTrack can be used in a
similar manner as conventional FASTA files [4]. Example file 4A is the subtype
specification file:


#
# GTrack example file 4A
#
# Specification of FASTA subtype for GTrack.
# Available at http://www.gtrack.org/fasta.gtrack
#
##GTrack version: 1.0
##GTrack subtype: FASTA
##Subtype version: 1.0
##Subtype adherence: strict
##Track type: function
##Value type: category
##Fixed-size data lines: true
##Data line size: 1
###value


When using the subtype, an "online" parser will download the subtype specification
file (over), and fill out the GTrack header with new default values. The GTrack
header may then be as simple as to include the URL of the subtype specification,
as in example file 4B:


#
# GTrack example file 4B
#
# This file makes use of the FASTA subtype specification shown as GTrack example
# file 4A.
#
##Subtype URL: http://www.gtrack.org/fasta.gtrack
####seqid=seq0001
TAGACATTACCGCTAGGATGATGCGATCGATCGATCCCTCTGGATTAGGAGATCTCTAGATCGATGATATCCTCNNNNNN
NNNNNATTGCTCTAGCTCTAGCTCTAGCT
####seqid=seq0002
GATTACATATCGCGATCGACTCGCCACTATAACTTCGAGTCTGACGATGATGGGGGGG


GTrack subtype header lines
---------------------------

Subtype functionality is applied with the following header variables:

- GTrack subtype

    The name of the subtype of the GTrack format specification used for the
    file, if any. 
    
    Developer notes
    ---------------
    Custom parsers that only support certain subtypes should check this header
    and give feedback to users if the subtype is not correct.
    ---------------
    
    Default value: ""

- Subtype URL

    URL to a GTrack file used as a specification/model for the GTrack subtype,
    if any. The subtype GTrack specification file is a normal GTrack file, but
    without bounding region specification lines or data lines. The header lines
    and the column specification line of a GTrack subtype model file is used as
    default values for other GTrack files that adhere to the subtype. Any other
    specifications/restrictions should be included as comments.
    
    Developer notes
    ---------------
    If a GTrack file contains a Subtype URL header line, the subtype
    specification file should be downloaded by the parser. Incomplete URLs
    without a specified scheme (e.g. www.gtrack.org) should be treated as
    HTTP-addresses (e.g. http://www.gtrack.org). After this, the header lines of
    the GTrack files should be parsed again, and any inconsistencies with the
    subtype headers should be treated according to the "subtype adherence"
    header variable (see below). If the header variables "GTrack subtype" or
    "Subtype version" (see below) in a GTrack file do not correspond to the same
    header variables in the subtype specification file, the user should be
    informed. It is then up to the parser to decide whether or not to continue
    parsing.
    
    If subtype specification downloading is not supported by the parser and a
    subtype URL is provided in the GTrack file, the user should be informed that
    he/she may use the GTrack Header Expander tool available at [3] in order to
    merge the subtype headers with the GTrack file for use in "offline" parsers.
    ---------------
    
    Default value: ""

- Subtype version

    The version of the subtype specification used.
    
    Default value: 1.0

- Subtype adherence

    Subtype adherence may be specified in the subtype GTrack specification file
    and will then regulate the way a GTrack file may override the subtype
    specifications. The subtype adherence may also be specified in a GTrack
    file, and will in this case function as a signal to parsers. In this way,
    different parsers may allow different levels of adherence for GTrack files
    of the same subtype.
    
    The following values are allowed:
    
        strict
        
            Default values of header variables, as defined by the subtype, may
            not be overridden by the contents of a file. GTrack defaults may be
            overridden.
            
            This option may be used to force users of a subtype to follow the
            specification exactly.
            
        medium
            
            As strict, but allows redefinition of the column specification line
            in two aspects:
            
            1.  The "values" and "edges" columns may be redefined, i.e. any
                non-core column names may be renamed to "value" or "edges", and
                vice-versa. If the subtype specification includes "value" or
                "edges" columns, they must still be present in the redefined
                column specification line. Correspondingly, the header lines
                "track type", "value type", "vector length", "edges weight
                type", "edges weight vector length" and "undirected edges" may
                also be redefined by the GTrack file.
                
            2.  Any number of extra columns, including reserved columns, may be
                added to the end of the column specification line.
            
            This option may be used to allow users of a subtype to add their own
            content, including redefining the "value" and "edges" columns, while
            maintaining the exact interpretation of the first columns as defined
            by the subtype.
               
        low
        
            As strict, but allows redefinition of the column specification line
            in a more relaxed manner: all columns specified in the subtype
            specification must be included, but can be put in any order, and
            extra columns may be added. Note that in this case, redefinition of
            the "value" or "edges" columns is not allowed, as in "medium", but a
            "value" or an "edges" column may be added, if not present.
            
            This option may be used to allow users of a subtype to adopt their
            own column ordering, while at the same time maintaining that a
            minimum of columns must be present, identifiable by column name. 
                 
        free
        
            Everything is allowed, as long as the GTrack specification is
            followed.
            
            This option leads to the subtype specification being used for no
            more than an alternative definition of default values of the GTrack
            header lines and column specification line.
 
    Developer notes
    ---------------
    Note that if subtype adherence is specified in the subtype specification as
    anything other than "free", a GTrack file using the subtype specification
    may not redefine this value.
    ---------------

    Default value: free


Example #2: Short reads
-----------------------

As an extra example of the subtype functionality, we here propose a format for
storing short reads (e.g. from ChIP-seq experiments). Again, example file 5A is
the GTrack subtype specification file, and example file 5B is a GTrack file
making use of the subtrack:


#
# GTrack example file 5A
#
# Specification of Short reads example subtype.
# Available at http://www.gtrack.org/shortreadsexample.gtrack
#
##GTrack version: 1.0
##GTrack subtype: Short reads example
##Subtype version: 0.9
##Subtype adherence: medium
##Track type: segments
###seqid  start  end  strand  read        quality


---


#
# GTrack example file 5B
#
# GTrack file making use of the Short reads example subtype.
#
##Subtype URL: http://www.gtrack.org/shortreadsexample.gtrack
###seqid  start  end  strand  read        quality  new
chr1      101    111  +       AGTAGATAGC  0.8      0
chr1      203    244  -       0:C;15:G    0.7      1


In this case, the "Short reads example" subtype defines an extra column, named
"read". A read is then either the exact read (using nucleotide symbols with the
exact same length as the track element) or a semicolon-separated list of
colon-separated mismatches, where a mismatch is represented by a relative
position and a nucleotide symbol. The reference is here the genome assembly
specified in the description lines. The relative positions should follow the
indexing defines by the "0-indexed" header variable. Other columns are allowed.

Note that example file 5B includes an extra column, as allowed by the "medium"
subtype adherence setting.


------------------
    References
------------------

[1] To be added
[2] http://genome.ucsc.edu/FAQ/FAQformat.html
[3] http://hyperbrowser.uio.no
[4] http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml