|-----------------------------------------| | Specification of the GTrack file format | |-----------------------------------------| Version: 1.0b2 Date: 02 Sep 2011 Authors: Sveinung Gundersen, Matus Kalas, Osman Abul, Arnoldo Frigessi, Eivind Hovig, Geir Kjetil Sandve ---------------- Contents ---------------- * Reading the specification * What is GTrack? * Example GTrack files * Basic specification x. Comments 1. Header lines 2. Column specification line 3a. Bounding region specification line 3b. Data lines - BED compability - Detailed specification of character usage * GTrack subtypes - Example #1: FASTA - GTrack subtype header lines - Example #2: Short reads * References --------------------------------- Reading the specification --------------------------------- This document contains the complete specification of the GTrack format. As the document contains many details, we here present some reading recommendations: - Skip the "Developer notes" sections if you are not planning to develop parsers of the GTrack format. - The "Restrictions" section after each main type of GTrack lines contain detailed descriptions that can be skipped in the first read-through. - The section "Detailed specification of character usage" contain very detailed information and is not generally required reading for basic use. - All information about GTrack subtyping, i.e. extra header lines for subtyping purposes, are collected in a separate section at the end of the specification. This is done in order to give a better overview of the basic functionality in the main sections, which the reader should understand before he/she delves into the subtyping functionality. ----------------------- What is GTrack? ----------------------- GTrack is shorthand for Genome Track. The GTrack file format is a general purpose file format for genome annotations. The main purpose of the format is the unified and optimised formalization of sequence level genome data into one of fifteen main track formats, as developed in [1]: Points (P) Valued Points (VP) Segments (S) Valued Segments (VS) Genome Partition (GP) Step Function (SF) Function (F) Linked Points (LP) Linked Valued Points (LVP) Linked Segments (LS) Linked Valued Segments (LVS) Linked Genome Partition (LGP) Linked Step Function (LSF) Linked Function (LF) Linked Base Pairs (LBP) These fifteen track types encompass most of the existing track types, while providing support for, among other things, genomic data of a three-dimensional nature. The primary goals of the GTrack format are to support all track types systematically, simplify parsing and manipulation, allow custom extensions, and provide efficient storage. --------------------- Example GTrack files --------------------- Before delving into the details, it is recommended that you examine these examples of simple GTrack files. You may return to them while reading the rest of the specification, if needed. The first example is the simplest version of GTrack, without any specification lines. It shows a data set of a couple of genomic segments, and the track type is simply Segments (S). # # GTrack example file 1 # # A GTrack file without headers are handled as three-column BED files [2] # chr1 121 201 chr2 486 1240 The second example contains all GTrack specification lines (header line, colomn specification line and boundary region specification line) and shows a dataset of genomic segments with additional associated information in extra columns. One of these is selected as the main "value" of the segments, which are then of type Valued Segments (VS). The example also shows how to add custom columns. # # GTrack example file 2 # # Note: tech is a custom column and not part of the GTrack specification # ##Track type: valued segments ###seqid tech start end value strand ####genome=hg19 chr1 ChIP-seq 1047 1165 0.625 - chr2 ChIP-chip 2002 2450 . + chr2 ChIP-chip 3033 3246 0.355 + The third example is more advanced, showing a Step Function dataset, that is a dataset where every base pair in the domain have an associated value, but where this value is constant, or approximated, over larger regions (250-500 bps). The domain is, in this case, composed of two boundary regions. In addition, some of the regions are linked by edges to other regions in the genome. This example file is thus of type Linked Step Function (LSF). # # GTrack example file 3 # ##Track type: linked step function ##Undirected edges: true ###id end value edges ####seqid=chr1; start=1000; end=2250 1 1250 10 4=0.4 2 1500 7 . 3 2000 2 . 4 2250 6 1=0.4;6=0.3 ####seqid=chr1; start=3000; end=4000 5 3250 7 . 6 3500 4 4=0.3 7 4000 6 . (Note that, for readability issues, spaces are used instead of tab characters in these example files. They will therefore not work "out of the box".) --------------------------- Basic Specification --------------------------- GTrack is a tabular text file format. All files in the GTrack format should end their names with ".gtrack". The GTrack format consists of 5 different line types, distinguished by the leading characters: x. Comments 1. Header lines 2. Column specification line 3a. Bounding region specification line 3b. Data lines Note: The number preceding each line type defines the order in which the lines must be present, i.e. column specification must follow the header lines, but comments may be present anywhere. Note that a bounding region specification line must be followed by a data line, but that a file may have multiple bounding region specifications with data lines in between. ----------- x. Comments ----------- - Leading characters: # - Example #This is a comment! - Usage: Optional Comments are ignored by parsers and may be present anywhere in the file. --------------- 1. Header lines --------------- - Leading characters: ## - Format ##VARIABLE:[ ]VALUE where VARIABLE = Header variable name [ ] = Optional space character VALUE = Header variable value - Example ##gtrack version: 1.0 ##track type: valued points ##value type: category ##O-indexed: False ##end-inclusive: True - Usage Optional, but any header variables not declared regain their default values. - Restrictions All variable names and reserved variable values are treated as case insensitive and do not support character escaping. Custom values, i.e. header values defined in GTrack subtypes, do, however, support escaping. For more details, see the section "Detailed specification of character usage". Values are restricted to the ones allowed by the header variable (see below). Header lines provide structural information readable by both humans and automatic parsers. The GTrack format defines a reserved set of header variables, each with a default value. If a header variable is not declared in the header lines, the default value is used. We encourage the use of header lines even when they contain default values as this adds to the clarity of the file and helps reduce parsing errors. The order of the header lines are unimportant. Developer notes --------------- As not all parsers/tools will have the need to support the full GTrack specification, developers are welcome to support only subsets. We do, however encourage all GTrack parsers to allways check the GTrack header lines and give feedback to the user if a particular feature is unsupported by the parser/tool. --------------- Reserved header variables ------------------------- - GTrack version The version of the GTrack specification used for the file. Default value: 1.0 - Track type* one of: points valued points segments valued segments genome partition step function function linked points linked valued points linked segments linked valued segments linked genome partition linked step function linked function linked base pairs Defines the track type of a GTrack file. Each track type defines a set of core columns to be used. See the Column specification section for more details. Default value: segments - Value type* one of: number category case-control number vector Only used if the "value" column is defined. Defines the kind of content accepted in the value column. See the Column specification section for more details. Default value: number - Vector length* Only used if the "value" column is defined and number vector is defined as the value type. Defines the maximal length of the number vector of the "value" column. Must be 2 or longer. Default value: 2 - Edge weight type* one of: number category case-control number vector Only used if the "edges" column is defined. Defines the kind of content accepted as edge weights. See the Column specification section for more details. Default value: number - Edge weight vector length* Only used if the "edges" column is defined and number vector is defined as the edge weight value type. Defines the maximal length of the number vector of the edge weights. Must be 2 or longer. Default value: 2 - Multiple bounding regions* True if the file defines more than one bounding region, else False. This is used to prepare parsers that multiple bounding regions may appear among the data lines. Default value: false - Overlapping elements* True if any two track elements overlap, else false. Only tracks of type Points and Segments, and the variations of these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments (VS/LS/LVS), are allowed to overlap. Default: true - Circular elements* True if any track element cross the coordinate borders of a circular sequence, i.e. that the "end" value is smaller than the "start" value. Default: false - Undirected edges* True if all edges specified in the GTrack file are undirected, else False. Note that undirected edges between two track elements must still be specified in both data lines, using the same weights. It Default: false - Fixed-size data lines True if a each data line have an exact size in terms of number of characters. This is only allowed for track type Function (F), and only if the only column specified is "value". Newline and carriage return characters are ignored when parsing, and the data lines are separated by the number of characters specified in the header variable "Data line size" (below). This header is used to support FASTA-like sequences, and may also be used to create function tracks of data such as GC-content, in a condensed manner. See section "Examples of GTrack subtypes" for an example. Developer notes --------------- Note that parsers still need to be able to recognize boundary region specification lines. --------------- Default: false - Data line size The size of each data line in terms of number of characters. Is only used if the header variable "Fixed-size data lines" (above) is true. Default: 1 - O-indexed True if the coordinates start at 0, false if the coordinates start at 1. Default value: true - End-inclusive True if the chromosome coordinate specified in the end column is included in the interval, else false. Default value: false (Note that the section "GTrack subtypes" includes some more reserved header variables.) * Some header lines include redundant information when regards to the rest of the file. These are marked with * in the listing above. The redunant header lines are still explicitly defined for several reasons. First, in order for a human reader to easily find out which features are used in a file. Second, as a way for simple parsers that only use a subset of the specification to check whether they can parse a particular file. Third, it enables automatic validation of whether a file contains the information in the way the author intended. These header lines can be automatically extracted from the rest of a GTrack file by the GTrack Header Expander tool, available at [3]. Developer notes --------------- Following the guidelines of defensive programming, we recommend that parsers check that these header lines correspond to the contents in the data lines and give the users feedback if there are inconsistencies. --------------- ---------------------------- 2. Column specification line ---------------------------- - Leading characters: ### - Format ###COL1 COL2 COL3... where COL1, COL2, COL3 = Column names " " = tab character - Example ###genome seqid start end strand geneId score id edges (with tabs instead of spaces) - Default value ###seqid start end (with tabs instead of spaces) - Usage Optional, but if not defined, retains the default value. - Restrictions Column names are treated as case insensitive and do not support character escaping. For more details, see the section "Detailed specification of character usage". A tab-separated list of column names. The GTrack specification defines a set of eight reserved column names. Four of these are associated with the four core informational properties: position, length, value and edges. The specific set of core columns present defines the track type (see [1] for more details). The GTrack format also defines 4 reserved columns that, allthough they do not define track type, have reserved meanings. The associations between the reserved columns and track types are shown in the following table: Column name: genome seqid start end value strand id edges Type of column: N N C C C N N C Track type: Points (P) ? ! X . . ? ? . Segments (S) ? ! X X . ? ? . Genome Partition (GP) ? ! . X . ? ? . Valued Points (VP) ? ! X . X ? ? . Valued Segments (VS) ? ! X X X ? ? . Step Function (SF) ? ! . X X ? ? . Function (F) ? ! . . X ? ? . Linked Points (LP) ? ! X . . ? X X Linked Segments (LS) ? ! X X . ? X X Linked Genome Partition (LGP) ? ! . X . ? X X Linked Valued Points (LVP) ? ! X . X ? X X Linked Valued Segments (LVS) ? ! X X X ? X X Linked Step Function (LSF) ? ! . X X ? X X Linked Function (LF) ? ! . . X ? X X Linked Base Pairs (LBP) ? ! . . . ? X X C - Core reserved columns (defines track type) N - Non-core reserved columns (reserved, but do not define track type) X - Column mandatory ? - Column optional . - Column not allowed ! - Property must be present, either as a column or in a bounding region specification (see below) Table 1: Overview of the eight reserved columns in the GTrack format and their associations to track type. (Note that the GTrack Header Expander tool, available at [3], may be used to fill out a default column specification line based on track type. The default column specification line is then the mandatory columns defined in Table 1, in the same order. In that case, the GTrack file needs to include the "Track type" header variable.) Reserved columns ---------------- - genome The genome assembly of the track element (e.g. hg19, mm9). The GTrack format has no explicit requirements on the syntax or semantics of the genome specification; the interpretation is up to the particular parsers/tools. Elements from different genomes are allowed in the same GTrack file. Specifying the genome of a track element is optional. The genome may be specified either as a separate column in the data lines, or in a preceding bounding region specification line (see below), or both. If genome is specified both in a bounding region specification and as a column, the values must be equal. - seqid A sequence identifier, i.e. an identifier of the underlying sequence of the particular track element. Usually defined as chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671), as defined in the genome assembly. As for the "genome" column, the GTrack format have no explicit requirements on the syntax or semantics of the "seqid" column; the interpretation is up to the particual parsers/tools. Some parsers may for instance allow chromosome arms (e.g. chr1p) as seqid. All track elements in a GTrack file must have a seqid, either as a separate column in the data lines, or in a preceding bounding region specification line (see below), or both. If seqid is specified both in a bounding region specification and as a column, the values must be equal. - start The start position of the track element, using the indexing system defined in the header (0- or 1-based). Developer notes --------------- The start column is not defined for some track types (as described in Table 1). In order to still work on the start position of an elements, it has to be inferred from other information in the following manner, according to track type: Genome PartitionÊ(GP), Step Function (SF), Linked Genome Partition (LGP) and Linked Step Function (LSF): The start position of each track element can be seen as the position immediately following the end of the track element of the previous line. The exact value of the start position depends on the "End-inclusive" header variable, i.e. if the coordinates are end-exclusive, the start position of one track element should be exactly the same as the end position of the previous line, if not, the start position should be set to the previous end position + 1. For the first line in a set of data lines, the start position should be set to the start position of the preceding bounding region (see below). Function (F), Linked Function (LF) and Linked Base Pairs (LBP): Each line defines a successive location along the genome. The start of the first line in a set of data lines is then the start position of the preceding bounding region. The start value is then increased by 1 for each line. --------------- - end The end position of the track element, using the indexing system (0- or 1-based) and end-inclusiveness as defined in the header. Developer notes --------------- The end column is not defined for some track types (Points (P), Valued Points (VP), Function (F), Linked Points (LP), Linked Valued Points (LVP), Linked Function (LF) and Linked Base Pairs (LBP), as described in Table 1). In order to still work on the end position of an element, it has to be inferred from the start position. In these cases, the end position depends on the "End-inclusive" header variable. If False, the end position is the same as the start position, if True, the end position is the start position + 1. --------------- - strand The strand of the track element. "+" for positive and "-" for negative strand. - value The value or score of the track element. The character "." denotes that the track element has a missing value. The format of the contents follow the "Value type" header variable as follows: number One floating point number, e.g. -1.23, 12 or 3.1e-4. Note that integer numbers are a subset of floating point numbers, and should use "number" as the value type. category A string defining a category. The set of all category values over all track elements form a category set, e.g: gene, exon, promoter. case-control One binary value, 1 for case, 0 for control. The missing value character, ".", is not allowed in this case. number vector A vector of floating point numbers, separated by comma, e.g. 1.23,2.34,3.45,4,5. The length of any vector must not be longer than the value of the "Vector length" header variable. Developer notes --------------- For all floating point values, the period character, ".", should be parsed as the "not a number" value. For "case-control", period is not allowed, and for "category", the "." character is just a category on the same level as other categories. For "number vector" the "." character is parsed as a vector of "not a number" values, with vector length equal to 2, or equal to the length of the header variable "Vector length", respectively. If a number vector is shorter than the "Vector length" header variable, it should be padded with "not a number" values. Note also that, for floating point numbers, the English decimal notation is used, with the period character representing the decimal separator, but with no spacing. --------------- - id An unique string identifying each track element (data line). Can be in any format, e.g. 1, aab or uc002ico.1. - edges A semicolon-separated list of id's, representing edges from the track element in the current line to the track elements which the id's identify. A "." character denotes that the track element has no edges. An edge is by default directed. Each edge can have a weight value directly following after an equals sign. The format of the weight value follows the "Edge weight type" header variable in the same way as the "value" format follows the "Value type" header variable (see above). Note that no space characters are allowed after the semicolon. Example: ###seqid start end id edges chr1 0 100 aaa aab=1.2;aac chr1 200 350 aab aaa=1.1 chr1 450 500 aac . Here, the aaa node is connected to the aab node with two directed edges, with the edge from aaa to aab having higher weight than the one in the other direction. Note that undirected edges must still be specified in both directions, using the same weights. This adds redundancy, but simplifies parsing. Developer notes --------------- If a weight value is not specified for an edge, the edge weight is by default handled as a "." character, following the rules outlined for the "value" column. The only exception is in the case of weigths of type "number", where the default weight should be the number 1. --------------- -------------------------------------- 3a. Bounding region specification line -------------------------------------- - Leading characters: #### - Format A) ####genome=VAL1 or B) ####[genome=VAL1;[ ]]seqid=VAL2[;[ ]start=VAL3][;[ ]end=VAL4] where [x] = 'x' is optional, e.g. [ ] means optional space character genome, seqid, start, end = reserved attribute names VAL1, VAL2, VAL3, VAL4 = attribute values - Example ####genome=hg18; seqid=chr1; start=100; end=10000 - Usage Type B is mandatory for GTrack files of one of the following track types: Genome Partition (GP) Step Function (SF) Function (F) Linked Genome Partition (LGP) Linked Step Function (LSF) Linked Function (LF) Linked Base Pairs (LBP) For all other track types, bounding region specification lines are optional. Note that if more than one bounding region is defined, the "Multiple bounding regions" header variable must be set to True. - Restrictions Attribute names are treated as case insensitive and do not support character escaping. Genome and seqid values do, however, support escaping. For more details, see the section "Detailed specification of character usage". A boundary region specification remains in effect for a set of data lines until the next boundary region specification. Note that only one bounding region specification is allowed at a time. Bounding region specifications are not allowed to overlap. For track types Genome Partition (GP), Step Function (SF), Linked Genome Partition (LGP) and Linked Step Function (LSF), the "end" attribute must be equal to the end position of the last track element immediately following the bounding region specification line. Example: ##track type: genome partition ###start ####seqid=chr1; start=100; end=200 125 133 200 For track types Function (F), Linked Function (LF) and Linked Base Pairs (LBP), the "end" attribute must be exactly equal to the "start" attribute plus the number of data lines immediately following the bounding region specification line. If the header line "End-inclusive" is true, the end position should be 1 less. Example: ##track type: function ###value ####seqid=chr1; start=100; end=103 1.2 -0.1 0.8 A bounding region specifies a genomic interval encompassing the data lines that follow. A bounding region should be thought of as constituing the domain of the following track elements, i.e. the region where we have information about the properties modelled by the track elements. The set of all bounding regions of a track then constitutes the domain of the track. Note that, in the case of Points and Segments (and the variations of these, i.e. Linked and/or Valued Points (VP/LP/LVP) and Linked and/or Valued Segments (VS/LS/LVS), see Table 1), lack of elements is also considered information. A bounding region is then, in this case, a region where we know that the lack of data means something. Areas of the genome that has not been investigated (such as centromeres) should be left outside the bounding regions. For track types other than Points and Segments (and their variations), the track elements do by definition fill the entire domain, as the positions of the track elements in these cases are not informational. For example, a Function has, by definition, a value for all base pairs in the domain. A bounding region is then just the smallest region encompassing the track elements that follow. For more details, see [1]. The bounding region specification comes in two flavours: A) The bounding region specifies the genome assembly for the following track elements, using the same format as for the "genome" column (see the "Column specification line" section). The domain of the track is then the set of sequences constituing the genome, e.g. all chromosomes of the genome. If a track contains several genomes, the domain of the track is the collected set of sequences constituing all the specified genomes. B) The bounding region specifies a single sequence, or part of this sequence, as the domain of the following track elements. The format is a set of attribute pairs separated by semicolon and an optional space character. For each attribute pair, the attribute name and the value are separated by the equals sign. The attributes may appear in any order. The allowed attributes are the following: - genome The genome assembly of the bounding region(e.g. hg19, mm9). The format of the genome attribute is the same as for the "genome" column (see the "Column specification line" section). The "genome" attribute is optional. - seqid A sequence id, e.g. the id of the underlying sequence of the bounding region. The format of the seqid attribute is the same as for "seqid" column (see the "Column specification line" section). The "seqid" attribute is mandatory for a bounding region specification line of type B. Note that if a type B bounding region specification is not defined, the "seqid" column must be included in the column specification line. - start The start position of the bounding region, using the indexing system defined in the header (0- or 1-based). The "start" attribute is optional. Developer notes --------------- If the "start" attribute is not specified, the start position of the bounding region is 0 (or 1, if the header variable "O-indexed" is false). --------------- - end The end position of the track element, using the indexing system (0- or 1-based) and end-inclusiveness as defined in the header. The "end" attribute is optional. Developer notes --------------- If the "end" attribute is not specified, the end position of the bounding region is the same as the end position of the sequence referenced by the 'seqid' attribute, e.g. the length of the current chromosome. If the parser do not have information about the length of the sequence in question, the user should be informed, or, in the case that the bounding region is unimportant for the parser, the bounding region specification should be ignored. Note that the restrictions regarding the "end" attribute for certain track types (see section "Restrictions" over) must still hold, even if the "end" attribute is not explicitly specified. --------------- -------------- 3b. Data lines -------------- - Leading characters: - Format ###VAL1 VAL2 VAL3... where VAL1, VAL2, VAL3 = column values " " = tab character - Example chr21 304 997 - FOOGENE 423 1 . (with tabs instead of spaces) - Usage A GTrack file must contain at least one data line - Restrictions Column values support character escaping, as specified in the section "Detailed specification of character usage". The number of columns of each data line must be equal to the number of columns in the column definition line. A tab-separated list of values, as defined by the column definition line. If there is a missing value in either of the "value" and "edges" columns, the period character, ".", may be used. See the section Column specification line for more details. --------------- BED compability --------------- Note that a simple BED file without a header line and only using the three columns chr, start and end are directly compatible with the GTrack format. This is because the default track type of a GTrack file is Segments (S), which defines the same three core columns as a simple BED file (see Table 1). One may thus only rename the file ending of such a file from ".bed" to ".gtrack" and run it through a GTrack parser. If a BED header line is present, this must be commented out. More complex BED files must be converted. Converters to common file formats are available at [3]. ----------------------------------------- Detailed specification of character usage ----------------------------------------- - The GTrack format supports escaping of special characters using URL escaping convensions (%XX hex codes). All ASCII characters are supported, except the following, which must be escaped everywhere: Most control characters (except TAB, LF, CR): %00-%08, %0B-%0C, %0E-%1F, %7F Extended ASCII characters: %80 through %FF Also, the following characters have reserved meaning, and must be escaped when used with other meanings in places where they may interfere with the parsing: tab (TAB): %09 newline (LF): %0A carriage return (CR): %0D space: %20 # (hash): %23 % (percent): %25 , (comma): %2C ; (semicolon): %3B = (equals): %3D . (period): %2E Note that spaces needs not be escaped in data lines, as these are separated by tabs. - Reserved words in a GTrack file receive special treatment. With reserved words are meant all header variable names, reserved header variable values (except custom header variable values), column names (including custom columns) and bounding region attribute names. Reserved words should be treated as case insensitive and do not support URL escaping. - A line must end with the newline character (LF), optionally preceded by a carriage return (CR). - Blank lines should be ignored by parsers. - Comments, header lines, column specification lines and bounding region specification lines are characterized by the leading number of #-characters. Note that, except for comments, once the file reaches a certain "level" of #-characters, this count never goes down. Thus, header lines, column specification and bounding region specifications are always found in that order. - Note that delimeter characters differ for the various lines/columns. See the specification above for details. Also note that examples in this file use spaces instead of tabs for readability. These examples should not be directly copied into GTrack files. ----------------------- GTrack subtypes ----------------------- The GTrack format includes support for creating GTrack subtypes, that is, file formats that adheres to only a subset of the GTrack specification. This allows creation of more specialized, simpler parsers, while at the same time ensures that subtype GTrack files still work with full GTrack parsers. GTrack subtypes may also be used to standardize special GTrack configurations, removing the need for the individual GTrack files to include all the required meta information. We encourage independent specification of subtypes catering to specialized needs. A GTrack subtype defines default values for header variables and/or the column specification line. A subtype may also add new header variables or define how parsers should interpret the values of any non-reserved columns. GTrack subtypes must still conform to the GTrack specification. Interpretation of new columns or header lines do of course require specialized parsers. Example #1: FASTA ----------------- As an example of the use of subtypes, we show how GTrack can be used in a similar manner as conventional FASTA files [4]. Example file 4A is the subtype specification file: # # GTrack example file 4A # # Specification of FASTA subtype for GTrack. # Available at http://www.gtrack.org/fasta.gtrack # ##GTrack version: 1.0 ##GTrack subtype: FASTA ##Subtype version: 1.0 ##Subtype adherence: strict ##Track type: function ##Value type: category ##Fixed-size data lines: true ##Data line size: 1 ###value When using the subtype, an "online" parser will download the subtype specification file (over), and fill out the GTrack header with new default values. The GTrack header may then be as simple as to include the URL of the subtype specification, as in example file 4B: # # GTrack example file 4B # # This file makes use of the FASTA subtype specification shown as GTrack example # file 4A. # ##Subtype URL: http://www.gtrack.org/fasta.gtrack ####seqid=seq0001 TAGACATTACCGCTAGGATGATGCGATCGATCGATCCCTCTGGATTAGGAGATCTCTAGATCGATGATATCCTCNNNNNN NNNNNATTGCTCTAGCTCTAGCTCTAGCT ####seqid=seq0002 GATTACATATCGCGATCGACTCGCCACTATAACTTCGAGTCTGACGATGATGGGGGGG GTrack subtype header lines --------------------------- Subtype functionality is applied with the following header variables: - GTrack subtype The name of the subtype of the GTrack format specification used for the file, if any. Developer notes --------------- Custom parsers that only support certain subtypes should check this header and give feedback to users if the subtype is not correct. --------------- Default value: "" - Subtype URL URL to a GTrack file used as a specification/model for the GTrack subtype, if any. The subtype GTrack specification file is a normal GTrack file, but without bounding region specification lines or data lines. The header lines and the column specification line of a GTrack subtype model file is used as default values for other GTrack files that adhere to the subtype. Any other specifications/restrictions should be included as comments. Developer notes --------------- If a GTrack file contains a Subtype URL header line, the subtype specification file should be downloaded by the parser. Incomplete URLs without a specified scheme (e.g. www.gtrack.org) should be treated as HTTP-addresses (e.g. http://www.gtrack.org). After this, the header lines of the GTrack files should be parsed again, and any inconsistencies with the subtype headers should be treated according to the "subtype adherence" header variable (see below). If the header variables "GTrack subtype" or "Subtype version" (see below) in a GTrack file do not correspond to the same header variables in the subtype specification file, the user should be informed. It is then up to the parser to decide whether or not to continue parsing. If subtype specification downloading is not supported by the parser and a subtype URL is provided in the GTrack file, the user should be informed that he/she may use the GTrack Header Expander tool available at [3] in order to merge the subtype headers with the GTrack file for use in "offline" parsers. --------------- Default value: "" - Subtype version The version of the subtype specification used. Default value: 1.0 - Subtype adherence Subtype adherence may be specified in the subtype GTrack specification file and will then regulate the way a GTrack file may override the subtype specifications. The subtype adherence may also be specified in a GTrack file, and will in this case function as a signal to parsers. In this way, different parsers may allow different levels of adherence for GTrack files of the same subtype. The following values are allowed: strict Default values of header variables, as defined by the subtype, may not be overridden by the contents of a file. GTrack defaults may be overridden. This option may be used to force users of a subtype to follow the specification exactly. medium As strict, but allows redefinition of the column specification line in two aspects: 1. The "values" and "edges" columns may be redefined, i.e. any non-core column names may be renamed to "value" or "edges", and vice-versa. If the subtype specification includes "value" or "edges" columns, they must still be present in the redefined column specification line. Correspondingly, the header lines "track type", "value type", "vector length", "edges weight type", "edges weight vector length" and "undirected edges" may also be redefined by the GTrack file. 2. Any number of extra columns, including reserved columns, may be added to the end of the column specification line. This option may be used to allow users of a subtype to add their own content, including redefining the "value" and "edges" columns, while maintaining the exact interpretation of the first columns as defined by the subtype. low As strict, but allows redefinition of the column specification line in a more relaxed manner: all columns specified in the subtype specification must be included, but can be put in any order, and extra columns may be added. Note that in this case, redefinition of the "value" or "edges" columns is not allowed, as in "medium", but a "value" or an "edges" column may be added, if not present. This option may be used to allow users of a subtype to adopt their own column ordering, while at the same time maintaining that a minimum of columns must be present, identifiable by column name. free Everything is allowed, as long as the GTrack specification is followed. This option leads to the subtype specification being used for no more than an alternative definition of default values of the GTrack header lines and column specification line. Developer notes --------------- Note that if subtype adherence is specified in the subtype specification as anything other than "free", a GTrack file using the subtype specification may not redefine this value. --------------- Default value: free Example #2: Short reads ----------------------- As an extra example of the subtype functionality, we here propose a format for storing short reads (e.g. from ChIP-seq experiments). Again, example file 5A is the GTrack subtype specification file, and example file 5B is a GTrack file making use of the subtrack: # # GTrack example file 5A # # Specification of Short reads example subtype. # Available at http://www.gtrack.org/shortreadsexample.gtrack # ##GTrack version: 1.0 ##GTrack subtype: Short reads example ##Subtype version: 0.9 ##Subtype adherence: medium ##Track type: segments ###seqid start end strand read quality --- # # GTrack example file 5B # # GTrack file making use of the Short reads example subtype. # ##Subtype URL: http://www.gtrack.org/shortreadsexample.gtrack ###seqid start end strand read quality new chr1 101 111 + AGTAGATAGC 0.8 0 chr1 203 244 - 0:C;15:G 0.7 1 In this case, the "Short reads example" subtype defines an extra column, named "read". A read is then either the exact read (using nucleotide symbols with the exact same length as the track element) or a semicolon-separated list of colon-separated mismatches, where a mismatch is represented by a relative position and a nucleotide symbol. The reference is here the genome assembly specified in the description lines. The relative positions should follow the indexing defines by the "0-indexed" header variable. Other columns are allowed. Note that example file 5B includes an extra column, as allowed by the "medium" subtype adherence setting. ------------------ References ------------------ [1] To be added [2] http://genome.ucsc.edu/FAQ/FAQformat.html [3] http://hyperbrowser.uio.no [4] http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml