None

Data and results for MCFDR manuscript

This is an overview of input data and analyses performed in connection with a manuscript on MCFDR, which is an algorithm that makes Monte Carlo based significance computation in multiple testing settings practical. As an example, MCFDR allowed asking about preferential location of H3K4me2 histone modifications, either towards the upstream or downstream end of 3446 genes, and answering this by MC based hypothesis testing in less than an hour in either case. In comparison, standard  MC needed more than a day for both cases. Using sequential MC reduced the running time to less than an hour for the downstream case (with few significant regions), but for the upstream case (with many significant regions) the running time was still more than 9 hours.

This document contains brief descriptions, as well as references, to both simulation results and results on the above mentioned biological case involving positioning of H3K4me2 histone modifications in relation to genes.

Simulation results

As discussed in the article, the properties of the MCFDR algorithm can be studied by anyone using our simple web-tool for simulating MCFDR under different parameter settings.

Our main simulation run, as discussed in the article, was on 5000 tests with p-values simulated from a Beta(alpha,beta)-distribution with alpha of 0.25 and a beta of 25:

Results from main simulation

Bias for individually applied stopping criteria

In addition to the main run, we also performed two (otherwise identical) runs to study the effect of applying the FDR stopping criteria individually on each test, or on all tests simultanously (with the latter option used in all main results of the article). As can be seen from the results, there was a small but noticable bias if applying the FDR stopping criteria individually on each test:

Study of simultaneous criterion

Study of individual criterion

To study this bias in further detail, we performed multiple identical runs (except the random seed) to get distributions of how many tests are rejected under different schemes. The runs are collected here:

History 'Bias of individual criterion'

The resulting distributions are summarized in the figure below:

As can be seen from the plot, the individual application of stopping criterion gives a clearly biased distribution compared to standard MC, while the other three schemes (basic MC, sequential MC and main MCFDR scheme) gives essentially similar distributions. 

Influence of varying parameter h

In sequential MC, sampling stops after observing a given number h of samples that are more extreme than the observation. The value of h influence the number of samples, as well as the precision of estimated p-values. As MCFDR incorporates the same stopping criterion as used in sequential MC, the influence of h also applies to MCFDR. To investigate the concrete influence of h, we performed simulation runs where h varied between the values 2,5,10,20 and 50.  The runs are collected here:

History 'Influence of h'

The results were summarized in two sets of figures, showing the number of samples and the precision of estimated p-values, respectively. This first figure shows the number of samples (for varying proportion of true nulls) for each considered value of h (corresponding to Figure 1 of the MCFDR article):

 

This second figure shows how precisely p-values are estimated for each of the same values of h (corresponding to Figure 3a of the MCFDR article):

Biological example

We investigated the relative positioning of H3K4me2 histone modifications within genes. As discussed in the article, this analysis can reproduced by anyone, using H3K4me2 modifications as analysis track (no second track), Ensembl genes as bins (e.g. those linked to below), and asking about nonuniform location in The Genomic HyperBrowser.

In our analysis, to avoid a lot of accepted null hypotheses just because of lacking data, we focused on 3466 (Ensembl) genes that include at least 10 modifications:

Dataset 'All unique ensembl genes with overlaps having >9 H3K4me2 inside'

Asking whether H3K4me2  modifications occur preferentially at the upstream end of genes, which is in line with the expectation based on the literature, we conclude this at 10% FDR for 2713 out of the 3466 (Ensembl) genes considered:

Genes with H3K4me2 preferentially upstream

Asking the opposite question, whether H3K4me2 modifications occur preferentially at the downstream end of genes, we only get a few significant genes at 10% FDR (note that due to the stochasticity of Monte Carlo, the exact number of tests deemed significant may vary somewhat from run to run):

Genes with H3K4me2 preferentially downstream (4 tests deemed significant)

Alternative run, where 3 tests were deemed significant

The four gene regions for which H3K4me2 modifications occurred preferentially at the downstream end are at the following coordinates:

Dataset 'Four gene regions with H3K4me2 preferentially located downstream'

These regions can also be imported as custom tracks to UCSC, or alternatively all H3K4me2 modifications within these four gene regions.


Standard Monte Carlo runs

Running sequential MC with a maximum of 50 000 samples on the biological example gave essentially same results as MCFDR, though with much higher running time for both cases:

Standard MC run for genes with H3K4me2 preferentially downstream

Standard MC run for genes with H3K4me2 preferentially upstream

Sequential Monte Carlo runs

Running sequential MC with a maximum of 50 000 samples on the biological example gave essentially same results as MCFDR, though with much higher running time for the upstream case:

Sequential MC run for genes with H3K4me2 preferentially downstream

Sequential MC run for genes with H3K4me2 preferentially upstream

Instead running sequential MC with a maximum of 10 000 samples (to save running time) works fine for the upstream positioning, but for downstream positioning it misses all significant regions (too few samples to allow significance after multiple testing correction):

Sequential MC with maximum 10 000 samples for genes with H3K4me2 preferentially downstream

Running sequential MC with an unbounded number of samples did not provide results after several weeks of running for the upstream case, and did also use more than one million samples for certain tests for the downstream case:

Sequential MC with unbounded number of samples for genes with H3K4me2 preferentially downstream

Analytic evaluation of significance

As discussed in the article, assuming that points representing H3K4me2 modifications are randomly Poisson distributed is highly unrealistic. This assumption gives unrealistically low variance in the random distribution of the test statistic, with too many null hypotheses end up rejected, as shown in the following runs:

Results for analytic significance test

Runs on all Ensembl genes

In addition to the main results above, we also tried runs without any filtering of gene regions based on data availability (number of H3K4me2 modifications inside gene):

Results for upstream preference, using all unique Ensembl genes

Results for downstream preference, using all unique Ensembl genes

Technical note on this document

The results above can either be inspected directly, or imported into your own Galaxy history (in the latter case click the plus sign at the upper right corner, select 'start using the dataset', and finally click one the eye symbol shown in the history element at the right hand panel).

Because of slight difficulties with html frames in links to tools, both the simulation tool and the The Genomic HyperBrowser will show in a single frame (that is, without Galaxy menu and history frames) during and after analyses if following the links to these two tools above.