coloc-stats
Get started by:
- Uploading datasets of genomic locations (bed-files) OR import sample data into galaxy history.
- Go to the coloc-stats tool to analyze your datasets by multiple methods
- View your results (by clicking the eye icon on the history element to the right).
- View an example analysis of colocalization of transcription factor binding , an example analysis of GWAS SNPs versus cell-specific open chromatin data , or get a quick feel of the tool with this one minute example analysis.
- Read Documentation, Manual and FAQ for elaborate help on diverse topics.
- Look at the #News section under Documentation tab to know more about upcoming features.
- Look at the screencast below on how to get started.
Documentation
# What is co-localization analysis?
Co-localization analysis
High-throughput sequencing methods assay various features of the genome including regulatory elements, transcription factor binding sites (TFBS), transcripts and so on. Functionally related genomic features often co-occur within a genomic sequence (for e.g., co-occurrence of TFBS). One important way of determining whether genomic features are functionally related, is to search for their significant co-localization (based on overlap or proximity). The methodology that determines a significant co-localization of genomic features is often referred to as co-localization analysis or alternatively as co-occurrence analysis or region set enrichment analysis.
Basic overview of a typical co-localization analysis
Genomic features are mapped on the reference genome using chromosomal coordinates (e.g., chr1:100-1000), which are simply the start position and end position of a sequence of nucleotides on a chromosome. Such a representation of nucleotide sequence is referred to as genomic regions or genomic intervals. One of the main outputs of genomic assays is a list of genomic regions and a collection of sets of genomic regions are referred to as genomic tracks or region sets. Arithmetic set operations are performed between genomic tracks to determine the amount of overlap (overlapping sequence nucleotides) or spatial proximity (e.g. geometric distance between two genomic regions). This is followed by statistical testing to determine whether the observed overlap or spatial proximity is likely due to chance.
# What is coloc-stats and what can it do?
- To make the most appropriate analysis choices given their research question.
- Find a relevant method (amongst several) that supports their analysis choices.
- To assess the robustness of conclusions by running and inspecting results across several methods.
- Extend methods that only support a pair of tracks to consider track collections.
# Input, File formats & Genome assemblies
Required files
File formats
Invalid file formats
File size limitations
Genome assemblies
# Output
Explaining the results pages
# News
New features going to be implemented (during Feb-Mar 2018)
- Although not a feature, we plan to add a lot of screencast tutorials to help the users with distinct use cases.
- Handling confounding feature tracks
- Handling local heterogeneities using a third genomic track
- Heatmaps for many-against-many analyses
- Counting overlap with customizable flank size
- Allowing the user to choose the coordinate to be used (start, midpoint or end) when computing distances
- Allowing the user to choose among different distance metrics (currently only a default distance metric is being used)
- Allowing the user to select among different types of correlation metrics
# Legend of tool parameters & Output pages
Parameter | Explanation |
---|---|
Running mode: | Basic mode with partly tool-specific defaults performs co-localization analysis with six different tools using partly their default settings. Only a few main selections need to be made by the user. Advanced mode allows the selection of co-localization analysis tools/methods based on detailed choices of parameters and methodological assumptions. After selecting one or many specific parameter values, a list of compatible tools will be shown. | Reference genome: | Support for hg19, hg38, mm9 and mm10 is integrated in the system. Analysis can be performed on other reference genomes by uploading a custom chromosome lengths file. |
Custom reference genome: | Upload the chromosome lengths file of a reference genome of your choice, if the default reference genomes are not suitable for your analysis. The file should be tab separated with two fields; the first field should contain the chromosome name (e.g. chr22) and the second field should contain the length of the chromosome (e.g., 123456). |
Type of co-localization analysis: | Here you can choose to perform the co-localization analysis either between a pair of genomic tracks or analyse a single genomic track against a large collection of genomic tracks.For more information about co-localization analysis, refer the following citations:
[1] https://doi.org/10.1186/gb-2010-11-12-r121 [2] https://doi.org/10.1101/157735 [3] 10.1093/bioinformatics/btv612 |
Upload files: | Upload files through the upload button on the top left corner under the tools menu. The file will appear under the galaxy history in the right menu panel. For now, only BED files are supported. It is strongly advised to adhere to the BED file format specifications. If you have a different file format other than BED, you can use the tool on the left-hand menu to convert between file formats. Please see under the tools menu "Format and convert tracks". |
Query track: | The genomic regions obtained through an experiment of your interest is usually used as a query track. For example, a query track file could be the genomic regions of SNPs that reached certain significance level in GWAS. |
Reference track: | Reference track usually contains genomic regions that you hypothesise as having an association with your query track. For example, a reference track could be the genomic regions of gene promoters. |
Collections of reference tracks: | As the name explains, this means multiple reference tracks of interest. For example, the collection of reference tracks could contain the genomic regions of regulatory elements in multiple tissues (multiple tracks). The large collection of genomic tracks can be a core database (see the next row), or build a custom collection of reference tracks. For building a custom reference track collection, one can upload a bunch of BED files through the upload functionality (drag and drop works) or a tar file containing several BED files further one can build a collection of genomic tracks (which we refer to as GSuite). |
Core databases: | For the ease of users, we provide collections of reference tracks that are of wide interest (largely curated by individual tools). We refer to these collections of reference tracks as core databases. Instead of selecting user-specific reference track collections, the user can choose one of the core databases for analysis. Examples of such core databases include data from ENCODE project, Roadmap Epigenomics project such as transcription factor binding sites, DNAse I hypersensitive sites and so on. |
Background regions: | The genomic regions in a genomic track file are typically a result of some form of genomic assay analysed on a high throughput sequencing or genotyping platform, where some predefined regions of the genome are assayed (e.g., all the SNPs, transcripts, exonic regions and so on). The genomic regions found based on such assays are thus restricted to the regions queried on the technology platform. Such background regions represent the universe of regions that could have possibly ended in the genomic tracks of interest being queried for co-localization. As an example, when testing the co-localization of a SNP set with other annotations, the background set could be all the SNPs covered by the technology platform, which are all assumed to have equal probability to be included in the SNP set of interest. The statistical test (null model) should ideally restrict the analysis space to the regions queried on the technology platform. Some tools (but not all) provide the possibility to restrict the analysis to background set of regions, by either excluding the regions supplied by the user or by performing the analysis only against an explicit set of background regions supplied by the user. As only a few tools accept the supplying of background regions, supplying that file is not always required. Only BED files are supported for now. |
Runtime mode: | Depending upon the analysis choices and the run times of corresponding compatible methods, the run time can vary from minutes to hours. Choosing a quick mode reduces the run time of the compatible methods,for instance, by reducing the number of permutations. |
Test statistic: | A test statistic in co-localization analysis provides a quantitative summary of the relationship between a pair of genomic tracks. Typically, it can measure the overlap or distance or correlation between two genomic tracks. |
Overlap: | Overlap measure can use counts (total number of overlapping intervals between two genomic tracks) or size (total number of overlapping base pairs between two files). |
Proximity (distance): | Proximity (distance) can be computed to either start or midpoint or the closest coordinate. It can either be an absolute distance or average log distance. |
Correlation metric: | A correlation metric can be obtained genome-wide (overall relationship), at a fine-scale, or a local correlation (region-level). |
Clumping tendency: | Genome architecture is known to have quite complex dependency structure, where the genomic features are not uniformly distributed but known to co-occur in clumps. This clumping tendency of genomic elements should ideally be handled by the statistical tests in the null model. As an example of the known clumping tendencies, refer to the following reference: https://www.ncbi.nlm.nih.gov/pubmed/18691400 |
What is coloc-stats
Tools integrated in coloc-stats
About us
-
- I know some co-localization analysis tools. What is the purpose of a new tool?
- The existing co-localization analysis tools vary in the statistical methodology and analysis approaches, thus potentially providing different conclusions for the same research question. As the findings of co-localization analysis are often the basis for further follow-up experimental investigations, assessing the robustness of the findings is warranted. However, there exists no unified interface to perform co-localization analysis across a multitude of analytical methods and method-specific options (e.g., test statistics, resolution, null models, and whether or not they can handle local genomic heterogeneities and co-localizing confounding features). To address this gap, we propose coloc-stats, a single interface for multiple previously published co-localization analysis methods.
-
- What can I achieve more with coloc-stats, than just running other co-localization analysis tools?
Coloc-stats allows a user:- To make the most appropriate analysis choices given their research question.
- Find a relevant method (amongst several) that supports their analysis choices.
- To assess the robustness of conclusions by running and inspecting results across several methods.
- Extend methods that only support a pair of tracks to consider track collections.
-
- When and why should I use coloc-stats?
Coloc-stats is useful when:- The user is not aware of all the analysis choices that would affect the conclusions of co-localization analysis
- The user is not familiar of all the existing methods and tools that support their analysis choices and assumptions
- The user is uncertain about the association between a pair of tracks, to assess the robustness of findings
- The user wants to run existing pairwise co-localization analysis tools in one-against-many mode
- The user wants to run several co-localization analysis tools at once
-
- I do not know how to get started, please help me
- To perform co-localization analysis, you need a query track and reference track of interest in BED format. Once you have the BED files, upload them using the Upload button on the top left corner of the web page. If you do not know how to upload files into a galaxy interface, follow a quick tutorial Once you upload the files, they will appear in your history (on the panel that appears as margin to the right). To know more about histories in galaxy, see a tutorial here. Then go to the coloc-stats tool and choose the relevant parameter settings to execute. Once you execute the tool, the analysis will be performed by one or several tools in parallel depending upon the chosen parameter settings. This may take from few minutes to several hours. The results element will be created in your history, which you can also monitor while the analysis is in progress by clicking on the “eye” icon on the history element. Once all the computations are completed, the history element will turn green. If the analysis is failed for some reason, it will turn red.
-
- Is it okay to try out different methods and parameters until I get a significant result?
- The goal and necessity of running multiple colocalization analysis tools is NOT to find a single method or parameter choices that would give a significant p-value to report in a manuscript. Rather, it is to be doubtful of any significant finding reported by a single method/parameter combinations. It is NOT advisable to hunt for a desriable p-value that can be used as basis for claiming a significant finding (a bad practice known as p-value hacking).
-
- Is it okay to use and report the results of only one method?
- The existing colocalization analysis tools vary to a large extent in the statistical methodology and analysis approaches, thus potentially providing different conclusions for the same research question. Therefore, it is highly recommended to assess the robustness of the findings across different tools and parameter choices. It is also a good practice to report the results of all tools. If the results of only one method are to be reported, that should be based on explicit reasoning in terms of which analytical assumptions are reasonable, not based on which p-values are desirable in themselves (a bad practice known as p-value hacking).
-
- What should I do if different methods give disagreeing results?
- Different methods may not necessarily agree on what may be considered as a significant finding. They may provide potentially different conclusions for the same research question. This is not surprising since the different tools use different statistical methodologies and analysis approaches. Before reporting a finding, it is thus our opinion that researchers should check whether the finding also holds when applying alternative methodologies.There are in statistical testing no simple context-free rules for how to conclude when alternative methodologies give disagreeing results, since some underlying assumptions may be more appropriate for the analytical scenario than others. In the article on null models for genomic data, we motivated a general advice of being conservative and conclude based on the weakest significance level achieved. If one is to deviate from this and rather trust the results of other particular methods, one should have (and provide to readers) explicit reasons for considering the underlying assumptions of these to match the analytical scenario.
-
- What is galaxy project ?
- If you are not familiar of galaxy project, you can read a < 10-sentence abstract and visit the galaxy project homepage to know more about it.
-
- I am new to a galaxy interface; give me a quick overview on how to navigate.
- You can follow a quick and brief tutorial on using a galaxy interface (takes ~ 2 min).
-
- Is coloc-stats available as a command line tool?
- As of now, coloc-stats is available only through the web server. Command line tool might be a future possibility.
-
- Why is it taking ages to run a coloc-stats analysis?
- Depending upon the number of genomic tracks being analysed and also depending upon the number of tools and tool-specific configuration variants chosen, the analysis might take few minutes to several hours. Especially, some tools by nature are relatively on a slower side. To quickly confirm that this is not a server issue, you can run another analysis with less number of genomic tracks and tools to check if the slowness persists. Depending upon the traffic to the server, sometimes the webpage might appear to be slow and the computations might be queued. But this is supposed to be a temporary thing and the webpage should return to its original smoothness within few minutes-hours. If that is not the case you can report to us to get support.
-
- Is there a way of programmatically accessing coloc-stats page?
- We do not yet have a programming interface, but this will be a possibility in the near future.
-
- The coloc-stats webpage is quite slow. Is there something I can do to fix it?
- Depending upon the traffic to the server, sometimes the webpage might appear to be slow. But this is supposed to be a temporary thing and the webpage should return to its original smoothness within few minutes-hours. If that is not the case you can report to us to get support.
-
- Something is causing an error, I do not know what it is. What do I do now?
- We observed that some co-localization analysis tools throw errors sometimes because of various different reasons. However, the coloc-stats tool itself should run all the other methods (that pass) and show the results. If this is not the case, (i.e., if coloc-stats tool itself throws error) then the history element will turn red. When the coloc-stats tool itself throws error, you can share the history with us to get support.
-
- My input file is causing an error!
- Be aware that coloc-stats currently accepts only BED files. If you have your genomic regions in other file formats, you have to convert them to BED format. BED is a standard file format to store genomic regions, where the first three fields are required to be chr, start and end. Please refer to the BED file format specifications described on the UCSC browser page.
-
- I am using a BED file as input, still it shows an error!
- Adhering to the BED format specifications is always recommended. Deviations from the file format specifications are a common cause for errors when running several of the bioinformatic tools. See more about the BED format specifications in the previous question. If you are sure that you adhered to the file format specifications, but still encounter an error, please report to us and we would be very happy to help.
-
- I am not sure if I completely understood the output page. Explain me.
- While the computations of various tools are ongoing, the results page shows statements indicating “in progress”. Once the computations are completed, the results page is populated with the results from one or several tools, depending upon the parameters and configurations chosen by the user. The results page shows the parameters chosen for the analysis. When you run a pairwise co-localization analysis (between two genomic tracks), a plot is displayed showing the negative logarithmic p-values obtained through different methods/configurations chosen. A simplistic indication is also provided to the user on how to interpret the findings to allow the user to make an informed decision. A table showing p-values, enrichment statistics and full results (as given out by the specific tool) of each tool/configuration is also provided. If any chosen tool fails or reports an error, the error messages are shown in a separate table.
-
- The links to the full results pages of individual tools are broken. What do I do now?
- This is not the expected behaviour of coloc-stats tool. Please take a minute to report this to us and we would be very happy to help.
-
- The p-values are set to NA for some methods. What does this mean?
- Some tools by their nature (e.g., IntervalStats) do not return p-value for the overall association between tracks as of now. Otherwise all the other tools should return p-value unless they throw an error for the input genomic tracks. If this is the case, refer to the error messages table to see if an error message was thrown for the tracks of interest. If you observe a trend of consistent error messages, you can suspect something went wrong. If that is the case, please take a minute to report this to us and we would be very happy to help.
-
- Why are the p-values of different methods different, when they are analysing the same research question?
- Many of the co-localization analysis methods use some form of statistical test to determine the significance of co-localization. All the statistical tests assume some form of null model that can vary from being too simplistic to being too cautious. How the null model is defined - i.e., what properties of the real data it preserves and how the remaining properties are distributed - would affect the conclusions of a statistical test. Thus, it is not surprising to see different p-values when using different methods. However, to be safe than sorry one should be cautious when interpreting a less robust finding.
-
- Some tables are empty/contain empty fields. What does this mean?
- If no errors are thrown, the table containing error messages would be empty. This is expected behaviour. Other than that, the tables should not be empty or contain empty fields. If that is the case, If that is the case, please take a minute to report this to us and we would be very happy to help.
-
- How can I share my analysis history with anyone?
- See here on how to share your analysis history with anyone.
-
- I was not able to reproduce the analysis that I did earlier.
- All the analyses you performed are stored in the history, if you choose to create a user account and store your analysis histories. To know more about histories in galaxy, see a brief tutorial here.
-
- I was not able to replicate the findings. The p-values seem to change everytime I run the analysis.
- This is quite expected behaviour in some types of statistical testing that involves permutations and random sampling. As all the tools integrated into coloc-stats do not support setting a seed, we currently do not provide a functionality to set seed. On the other hand, if the p-values do not stand significant at a given threshold, that finding should always be interpreted with caution.
-
- Which genome assemblies are supported by coloc-stats?
- Currently hg19, hg38, mm9 and mm10 are supported right away in the coloc-stats tool. However, all the genome assemblies of any species can be used to perform the analysis. For that, you only need a text file containing the chromosome lengths.
-
- How can I use coloc-stats for a different species or genome assembly?
- To use coloc-stats for a genome assembly or species that is not installed on coloc-stats tool, all you need is a text file containing the chromosome lengths of that genome assembly. Select custom reference genome in the “reference genome” section of the tool, which will prompt you to upload a chromosome lengths file. Upload that file and continue with the remaining analysis choices normally.