The GWAS catalog now contains over 300,000 genetic associations, but for the majority of these the underlying causal gene, the gene mediating the phenotypic impact of the genetic variation, is unknown . While the genes close to the genetic variation often represent plausible candidate genes, a precise definition of “close” has been difficult to define.
Traditionally, genetic variants influencing abundance of proteins or transcripts have been described as “cis-acting” or “trans-acting” with the understanding that “cis-acting” variants exert their influence directly on the cognate gene while “trans-acting” variants influence some other gene which then as a downstream consequence influences the transcript or protein being measured. In the absence of a simple method of determining the exact mechanism of any particular variant, researchers have typically relied on distance cutoffs to separate cis and trans variants, with the distance ranging from 250,000 bp  up to 1 million bps . One prior attempt to empirically define the divide between cis and trans effects found that the percent of cis-eQTLs that could be supported by allele-specific expression fell with increasing variant-TSS distance and could not be distinguished from that expected by error at a distance of between 0.85 and 1.3 million base pairs .
In this work we relied on a very well powered pQTL study which allowed us to identify two populations of variant-gene distances; one population where the distribution of distances is a function of the distance of the gene from the variant, and a second population where the distances are dictated by the mathematics of picking two points at random. The first population follows a Weibull distribution and is substantially contained within the interval from 0 to 1 Mb. For the second distribution, because most chromosomes are over 100 Mb long, two randomly selected intrachromosomal points are almost always (99%) more than 1 Mb apart. Thus, these two populations are well-separated and can be interpreted as the mathematical representations of the biological processes of cis- and transQTLs.
Previous analyses of molecular QTLs have similarly noted a rapid drop-off of observed associations with increasing variant-TSS distance. For example, Roby Joehanes et al. used a multi-exponential decay with a median variant-TSS distance of 27 kb to model a large set of eQTLs measured in whole blood samples from over 5000 participants in the Framingham Heart Study . There is however no theoretical model to rationalize an exponential decay for this distribution.
As noted by Lieberman-Aiden et al. , the distribution of promoter-enhancer Hi-C distances can be modeled using a power-law with an exponent of approximately -1 [14, 15]. Plotted against variant-TSS distance, a power-law with p(distance) proportional to 1/distance looks similar to an exponential decay, with p(distance) proportional to e−distance. However, a simple 1/distance power-law distance dependence would not generate the curve obtained in Fig. 1, since a power-law would place an equal number of observations in each bin, since the bins increase in width with increasing distance at the same rate that p(distance) is decreasing.
The Weibull distribution used here was first described as a family of curves  which has found applicability to describe the distribution of particle sizes following fragmentation or fractionation . Brown and Wohletz provide a mechanistic derivation for the Weibull distribution which follows from repeated fragmentation of a larger structure, with each step resulting in a fractal fragmentation pattern (thus following a power law). Smaller fragments escape further fragmentation, resulting in a rapid drop-off of larger particle sizes.
An entirely hypothetical conjecture would be that the pattern we observe in these data results from a similar superposition of multiple processes. The Activity-by-Contact model of enhancer-promoter regulation suggests that the activity of a particular enhancer-promoter pair is increased by the strength (activity) of the enhancer and decreased by the distance between the enhancer and promoter [18, 19]. Since any given promoter can be influenced by multiple enhancers, the strongest genetic associations are more likely to come from closer enhancers. The dense packing of the chromosome provides the equivalent of the single fractionation event, imposing a fractal distance geometry on the genome. The fact that there are far more enhancers than promoters in the genome provides the equivalent of multiple fractionation events, potentially explaining the fit to the Weibull distribution for molecular QTLs in the range of 0 to 1 million base pairs (Fig. 3).
Trans-eQTLs and trans-pQTLs are generally understood to be acting on a gene proximal to the variant which then influences the molecular trait of interest. The cis fraction in our combined model is then likely to reflect the extent to which we have correctly selected the set of true causal genes for a given study. Further, the model suggests that in general about 99.9% of GWAS variants should be explainable through a gene with a TSS within 1 megabase of the lead variant. Thus, if a large fraction of the variant-TSS distances fall into the long-range, distance-independent regime our model suggests it is worth taking another look at the set of proposed or potential causal genes.
Assuming that most GWAS variants are likely impacting biology through their influence on molecular traits such as transcript, protein or metabolite abundance we expect that the cis- and trans- models and distributions observed here will apply to other, more complex or polygenic traits. It should be noted however that the exact mechanism linking the GWAS variant to the causal gene is not addressed in this model. It has been observed that a large fraction of pQTLs and metabolite QTLs are linked to missense variants, and that may skew the exact distributions somewhat when looking at other phenotypes or disease traits.
An important consideration in the dissection of individual loci is the observation that paralogous genes often exist near one another , meaning many genes with similar functions may exist in cis to the lead SNP. We avoided this scenario in this analysis by, for example, only including metabolite QTLs for which HMDB listed a single biochemically-related gene on the entire chromosome (see Methods). In practice, researchers should look carefully at whether there are multiple plausible causal genes, such as from paralogs, which exist within the 944 kb distance cutoff recommended here.
An additional caveat is that this study focused on only the single strongest association per molecular trait (per chromosome) and this will tend to bias the set of variants as well. This simplification was applied here because while it is straight-forward to define the primary signal per locus there are still multiple approaches to defining secondary or independent signals. As molecular QTL studies continue to grow in size and power, it will be important to revisit this analysis with respect to secondary signals.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.