LD and its impact on cross-population correlations of allele frequencies

Linkage disequilibrium is the correlation between allele frequencies within a population and is quantified by the coefficient of linkage disequilibrium:


where A and B are two alleles at two different loci.

However, there is another kind of correlation between alleles, and that is the correlation of allele frequencies between populations.

The cross-population correlation between two unliked alleles will be r= 0. However, linkage disequilibrium will increase the cross-population correlation. Two alleles that are perfectly linked should have a cross-population correlation of 1, that is equal to their within population LD. However, there is a phenomenon known as “linkage breakdown”. As far as I know, there are no publications trying to quantify linkage breakdown in human populations.

Linkage breakdown reflect the extent to which the correlation between true and predicted values decays approximately linearly with respect to genetic related between the training and the target populations, due to different linkage disequilibrium patterns (Marigorta & Navarro, 2013). That is, if an association between gene X and phenotype Y is found in a population (training population), its replicability in other populations will depend on their genetic distance from the training population. This is because SNPs that are found by GWAS are usually not directly causal variants but instead are “tag” (proxy) SNPs, in LD with the real causal variants. If LD breaks down, this will affect also the frequencies distributions. Hence, tag SNPs will not necessarily have the same allele frequencies as the causal SNPs in all populations.

In order to estimate the level of LD breakdown in a way that also would affect the validity of my method based on factor analysis of allele frequencies, I computed the correlation between frequencies of SNPs in LD. Moreover, this was compared to the frequencies of random SNPs (with LD<0.5).

LD was calculated using the R package “rsnps”, with the CEU panel.

The frequencies of SNPs in LD (N=93) with a GWAS hit (rs301800) by Okbay et al. (2016) were downloaded from 1000 Genomes. The correlation between each SNP’s minor allele and and rs301800 was computed. The average correlation was r=0.815.

Conversely, the average correlation between an SNP from the set of random SNPs and all the other SNPs was as expected not significantly different from zero (0.053).

This simulation is not exhaustive nor conclusive but it shows that LD decay is unlikely to be a big problem because LD decay isn’t strong across 26 populations. Further analysis limited to populations from some continents would show if LD breaks down in some continents more than in others. For example, do SNPs in LD among Europeans show more linkage breakdown among East Asians or Africans? One could look at the correlation between allele frequencies in East Asian and African sub-populations separately. If the correlation is stronger among East Asians, this would suggest that LD patterns among Africans are more different.




Marigorta, U.M., Navarro, A. (2013). High Trans-ethnic Replicability of GWAS Results Implies Common Causal Variants. PLOS Genetics 9, http://dx.doi.org/10.1371/journal.pgen.1003566




Height, IQ,polygenes: selection signal or noise?

Okbay et al. (2016) reported 162 independent SNPs that reached genome-wide significance (P < 5*10-8) in the pooled-sex EduYears meta-analysis of the discovery and replication samples (N =405,072). 161 SNPs were found in 1000 Genomes.  These were divided into 32 subsets of 5 SNPs and factor analyzed. The correlations of factor loadings and corr x pop IQ with p value were r= -0.273 and -0.008, respectively. Moreover, the two vectors (factor loadings and corr x pop IQ) were intercorrelated (r= 0.223), implying that the internal coherence of the factors is correlated to their predictive validity.

The scatterplot is shown in figure 1.


The top 4 significant SNPs sets (N=20) were used to compute a polygenic score and the 4 factor scores were averaged. These were chosen because they had the highest loadings, highest correlation to population IQ and lowest p value (respectively, 0.383 and 0.83, compared to an average of 0.22 and 0.11 for the entire dataset), hence suggesting more signal in the data.

The largest GWAS to date (Wood et al., 2016) identified 697 SNPs which reached statistical significance for their association with human height. Factor analysis was carried out on 69 sets of 10 SNPs.

The top 10 significant SNPs for height were chosen because they had a higher average factor loading (0.419) than the entire set (0.166), actually the third highest among 69 sets of 10 SNPs. Polygenic and factor scores are reported in table 1. The latter are also reported in table 2 and 3, in descending order.

Table 1. Factor and polygenic scores. Top significant SNPs for height and educational attainment (IQ) GWAS.

Population PS_IQ IQ_Top_4_Fs_Mean Height_PS F_Height
Afr.Car.Barbados 0.339 -1.124 0.636 1.342
US Blacks 0.358 -0.904 0.612 0.662
Bengali Bangladesh 0.368 -0.051 0.503 -0.349
Chinese Dai 0.43 0.736 0.417 -1.381
Utah Whites 0.412 0.838 0.569 0.483
Chinese, Bejing 0.471 1.175 0.419 -1.456
Chinese, South 0.45 1.058 0.418 -1.504
Colombian 0.374 0.201 0.515 -0.103
Esan, Nigeria 0.345 -1.307 0.653 1.629
Finland 0.43 0.76 0.417 0.524
British, GB 0.421 0.832 0.551 0.299
Gujarati Indian, Tx 0.386 -0.059 0.524 -0.333
Gambian 0.342 -1.196 0.61 1.33
Iberian, Spain 0.419 0.728 0.552 0.245
Indian Telegu, UK 0.372 -0.127 0.521 -0.475
Japan 0.459 1.235 0.419 -1.568
Vietnam 0.435 0.845 0.417 -1.321
Luhya, Kenya 0.338 -1.306 0.618 1.263
Mende, Sierra Leone 0.332 -1.475 0.624 1.278
Mexican in L.A. 0.36 0.143 0.502 -0.561
Peruvian, Lima 0.304 -0.28 0.496 -0.803
Punjabi, Pakistan 0.39 0.091 0.519 -0.402
Puerto Rican 0.374 -0.012 0.525 0.254
Sri Lankan, UK 0.373 0.025 0.5 -0.576
Toscani, Italy 0.415 0.511 0.562 0.238
Yoruba, Nigeria 0.343 -1.338 0.638 1.285

Table 2. IQ factor scores sorted in descending order.

Population IQ_Top_4_factors_Mean
Japan 1.235
Chinese, Bejing 1.175
Chinese, South 1.058
Vietnam 0.845
Utah Whites 0.838
British, GB 0.832
Finland 0.76
Chinese Dai 0.736
Iberian, Spain 0.728
Toscani, Italy 0.511
Colombian 0.201
Mexican in L.A. 0.143
Punjabi, Pakistan 0.091
Sri Lankan, UK 0.025
Puerto Rican -0.012
Bengali Bangladesh -0.051
Gujarati Indian, Tx -0.059
Indian Telegu, UK -0.127
Peruvian, Lima -0.28
US Blacks -0.904
Afr.Car.Barbados -1.124
Gambian -1.196
Luhya, Kenya -1.306
Esan, Nigeria -1.307
Yoruba, Nigeria -1.338
Mende, Sierra Leone -1.475


Table 3. Height factor scores in descending order

Population Factor_Height_10SNPs
Esan, Nigeria 1.629
Afr.Car.Barbados 1.342
Gambian 1.33
Yoruba, Nigeria 1.285
Mende, Sierra Leone 1.278
Luhya, Kenya 1.263
US Blacks 0.662
Finland 0.524
Utah Whites 0.483
British, GB 0.299
Puerto Rican 0.254
Iberian, Spain 0.245
Toscani, Italy 0.238
Colombian -0.103
Gujarati Indian, Tx -0.333
Bengali Bangladesh -0.349
Punjabi, Pakistan -0.402
Indian Telegu, UK -0.475
Mexican in L.A. -0.561
Sri Lankan, UK -0.576
Peruvian, Lima -0.803
Vietnam -1.321
Chinese Dai -1.381
Chinese, Bejing -1.456
Chinese, South -1.504
Japan -1.568


There is a strong negative correlation between height and intelligence factor scores (r=-0.778).

The correlation between population IQ estimates (Piffer, 2015) with the average factor score and the polygenic score were r=0.923 and  0.867. The very high correlation of the factor score exceeds the 99% C.I. produced with a simulation using 200 iterations on random SNPs.

East Asians top the IQ rankings but are at the bottom of the height rankings. The opposite is true of African populations. Europeans have mid-high scores for both IQ and height, whereas South Asians and Hispanics/Latinos have mid to low scores on both traits.

The higher internal (i.e. factor loadings) and external (i.e. corr x IQ) coherence of factors extracted from more significant SNPs and the different patterns observed for height and IQ suggest that these SNPs represent signal of polygenic selection and not merely phylogenetic autocorrelation. Another important finding is that the signal is restricted to the most significant hits of each GWAS.

The individual scores are dependent on the choice of SNPs and the computational method (e.g. polygenic vs factor scores) but the overall pattern isn’t affected, since it is pretty consistent across GWAS samples and publications.




Okbay, A., Beauchamp, J.P., Fontana, M.A., Lee, J., Pers, T.H., et al. (2016). Genome-wide association study identifies 74 loci associated with educational attainment. Nature, doi:10.1038/nature17671

Piffer, D. (2015). A review of intelligence GWAS hits: Their relationship to country IQ and the issue of spatial autocorrelation. Intelligence, 53, 43-50.

Wood AR, Esko T, Yang J, et al.: Defining the role of common variation in the genomic and biological architecture of adult human height. Nat Genet. 2014; 46(11): 1173–86