How representative are 1000 Genomes samples?

1000 Genomes made an effort to collect representative samples of several (as of today, 26) ethnic groups. A typical condition is that 3 or 4 of the grandparents share the same ancestral origin or come from the same geographical region as the participant. However, an emphasis on genetic/ancestral “purity” has been achieved by focusing on rural areas in some instances, which may or may not be representative of the entire population, particularly for some traits. It has been shown by several studies that city dwellers have an intellectual advantage over rural folks, in terms of IQ. Moreover, city dwellers also tend to be better educated. This would introduce bias when 1000 Genomes samples are used to compare populations on frequencies of alleles related to educational attainment or intelligence.

Unfortunately, detailed sample information for 1000 Genomes is (to the best of my knowledge) not reported on the 1000 Genomes website and I found it through the Wikipedia link to the Corriell Institute for Medical research. This is a body whose existence I ignored, but day after day I realize that population genetics information is scattered all over the web and not as well organized as I used to believe.

The website in question reports basic information such as whether the samples come from unrelated (e.g. mother/father) or related (e.g. father/child) participants and technical info on the DNA samples.

Unfortunately, there is not much information regarding the individual participants or even the precise geographic origin (e.g. city or county). Furthermore, the samples are not described with the same level of detail. At the bottom of the page, I report the geographic information of each population.

A few samples can perhaps be considered representative of the population (PUR, YRI and FIN). The most detailed info is provided by the Iberian group, which made an effort to sample from all over the provinces of the country.

The sample from Japan comes mainly from Tokyo, although the grandparents came from all over the country. However, selective migration to capital cities can introduce bias.

The Vietnamese sample comprises individuals from Vietnam’s biggest city, hence is not representative of the population as a whole.

The Tuscan sample from Italy was collected in a single small town, hence is biased towards rural people of a specific town. Similarly, the British sample is rural, although scattered around a wider area.

We can say that the least representative sample particularly regarding intelligence and education, is CHB, which comprises individuals from Bejing Normal University. The CDX comprises individuals from the Xishuangbanna Health School of Xishuangbanna. I could not find much information regarding this school, but I suspect it is not an university.

Using the best set of SNPs for Educational Attainment (9 SNPs I selected because they were in linkage with others across 3 GWAS studies of educational attainment), which provided the best predictive power for the spatial and temporal comparison of populations, one can see a small advantage of the CHB over the other Chinese samples (CHB and CDX): 1.511 vs 1.382 and 1.017. However, this advantage disappears with the set of intelligence SNPs and the Intelligence/EA replicated SNPs found by Sniekers et al. (2017). However, Sniekers et al. (2017) used a dubious measure of fluid intelligence, and their sample is much smaller than the EA sample, casting their findings in a different light.

One could argue that the advantage of the CHB over the other Chinese samples is evidence for the validity of a polygenic score. When the next GWAS of EA will be published, we will be able to test this prediction.

Sample info:

ASW (African Ancestry in SW USA): The samples were collected from individuals who identified themselves primarily as African-American. All parents in the trios and duos, and all the unrelated individuals identified themselves as having four African-American grandparents who were born in the same general area of the Southwest USA.

ACB (African Caribbean in Barbados):  Adult parent-child trios who identified themselves as having at least three out of four grandparents who self-identify as African Caribbean and who were born in Barbados.

BEB (Bengali in Bangladesh): The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves as Bengali. All individuals identified themselves as having four Bengali grandparents.

GBR (British from England and Scotland): These cell lines and DNA samples were prepared from blood samples collected in Cornwall and Kent (England) and Orkney and Argyll & Bute (Scotland). All of the samples are from unrelated individuals who identified themselves as having all four of their grandparents born in the same rural area; each rural area was generally defined as being less than 40 miles apart from the next rural area. These samples can be considered representative of the areas in the UK from which they were collected.

CDX (Chinese Dai in Xishuangbanna): These cell lines and DNA samples were prepared from blood samples collected from individuals living in the community of Xishuangbanna Health School of Xishuangbanna, Yunnan, China. All of the samples are from unrelated individuals who identified themselves as having four Dai Chinese grandparents.

CLM (Colombian in Medellín, Colombia): These cell lines and DNA samples were prepared from blood samples collected in the Medellín, Colombia, metropolitan area. All of the samples are from mother-father-adult child trios. All parents in the trios identified themselves as having all four grandparents born in Colombia.

ESN (Esan in Nigeria): . The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves as Esan. All individuals identified themselves as having four Esan grandparents.

FIN (Finnish in Finland): These cell lines and DNA samples were prepared from blood samples collected from unrelated individuals from Finland. All individuals identified themselves as having at least three out of four grandparents who were born in Finland, and 98% of individuals participating have all four grandparents born in Finland. The participants include some individuals with grandparents born in Finnish Karelia, a part of Finland until 1947, who also represent the Finnish population.

GWD (Gambian in Western Division – Mandinka): These cell lines and DNA samples were prepared from blood samples collected in the Western District of The Gambia. All of the samples are from parent-adult child trios who identified themselves as Mandinka. All parents in the trios identified themselves as having Mandinka parents of at least two generations.

GIH (Gujarati Indians in Houston, Texas, USA ): These cell lines and DNA samples were prepared from blood samples collected in the Houston, Texas metropolitan area. All of the samples are from unrelated individuals who identified themselves as Gujarati and reported having at least three out of four Gujarati grandparents. “Gujarati” is a general term used to describe people who trace their ancestry to the region of Gujarat, located in the northwestern part of the Indian subcontinent, and who speak the Gujarati language. However, no attempt was made to clarify the meaning that donors attributed to their self-reported Gujarati identity.

CHB(Han Chinese Beijing): These cell lines and DNA samples were prepared from blood samples collected from individuals living in the residential community at Beijing Normal University. All of the samples are from unrelated individuals who identified themselves as having at least three out of four Han Chinese grandparents.

CHS (Han Chinese South): These cell lines and DNA samples were prepared from blood samples collected from southern Han Chinese individuals living in the Hu Nan and Fu Jian Provinces of South China. All of the samples are from mother-father-adult child trio families who identified themselves as having at least three out of four Han Chinese grandparents.

IBS (Iberian populations in Spain): These cell lines and DNA samples were prepared from blood samples collected throughout the Spanish territory. In order to assure representativeness of all geographical areas, samples were collected from individuals who identified themselves as having been born in the area and having all four grandparents (two generations) born in the same area. The total number of geographical areas was 50, corresponding to the 50 administrative provinces (geographical areas surrounding a medium-large city) which constitute Spain, including the area in the Iberian Peninsula as well as the islands. All samples consist of mother-father-adult child trios. At least two trios were collected from each province, smaller entities than the 17 different autonomous regions of Spain. Thus, this set of samples can be viewed as generally representative of the population of Spain, with a broad geographic spread. The overall group contains some individuals from the Basque Country and from the Canary Islands, sometimes regarded as differentiated genetically.

ITU (Indian Telegu in the UK): These cell lines and DNA samples were prepared from blood samples collected in the United Kingdom. The samples are primarily from unrelated individuals but include a small number of trios. All individuals identified themselves and their parents as Telugu.

JPT (Japanese in Tokyo, Japan): These cell lines and DNA samples were prepared from blood samples collected in the Tokyo metropolitan area. All of the samples are from unrelated individuals. Because it is considered culturally insensitive in Japan to inquire specifically about a person’s ancestral origins, prospective donors were simply told that the general aim was to include samples from people whose grandparents were all from Japan. The samples were collected from people who came from (or whose ancestors presumably came from) many different parts of Japan. Thus, this set of samples can be viewed as generally representative of the majority population in Japan.

KHV (Kinh in Ho Chi Minh City, Vietnam): These cell lines and DNA samples were prepared from blood samples collected from individuals living in Ho Chi Minh City, Vietnam. All of the samples are from unrelated individuals who identified themselves as having four Kinh Vietnamese grandparents.

LWK (Luhya in Webuye, Kenya): These cell lines and DNA samples were prepared from blood samples collected in Webuye Division, of Bungoma district in western Kenya. All of the samples are from unrelated individuals who identified themselves as having four Luhya grandparents.

MSL (Mende in Sierra Leone): These cell lines and DNA samples were prepared from blood samples collected in Sierra Leone. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves as Mende. All individuals identified themselves as having four Mende grandparents.

MXL (Mexican Ancestry in LA, USA): These cell lines and DNA samples were prepared from blood samples collected in Los Angeles, California. All of the samples are from parent-adult child trios. All parents in the trios identified themselves as having at least three out of four grandparents who were born in Mexico. Note that the individuals whose samples are included in this set are different from those who provided samples for the “Mexican-American” panels included in the NIGMS Human Genetic Cell Repository, even though both sets of samples were collected in Los Angeles.

PEL (Peruvian in Lima, Peru): These cell lines and DNA samples were prepared from blood samples collected in the Lima-Callao, Peru, metropolitan area. All of the samples are from mother-father-adult child trios. All parents in the trios identified themselves as having four grandparents who were born in Peru.

PUR (Puerto Rican in Puerto Rico): These cell lines and DNA samples were prepared from blood samples collected throughout Puerto Rico. All of the samples are from mother-father-adult child trios. It was required that at least six of the eight great-grandparents of the child in the trio were Puerto Ricans. Because half of all Puerto Ricans live in different localities in the United States and there is constant migration back and forth between the U.S. and Puerto Rico, for purposes of this sample collection, trios were regarded as Puerto Rican based exclusively on the place of birth of the child’s great-grandparents. Because none of the Puerto Rico municipalities were excluded from the sampling, and because Puerto Rico is culturally homogeneous, these samples can be considered to be generally representative of all Puerto Ricans.

PJL (Punjabi in Lahore, Pakistan): These cell lines and DNA samples were prepared from blood samples collected in Lahore, Pakistan. The samples are from a mix of parent- adult child trios and unrelated individuals who identified themselves and their parents as Punjabi.

STU (Sri Lankan Tamil in the UK): These cell lines and DNA samples were prepared from blood samples collected in the United Kingdom. The samples are primarily from unrelated individuals but include a small number of trios. All individuals identified themselves and their parent as Sri Lankan Tamil.

TSI (Toscani in Italia): These cell lines and DNA samples were prepared from blood samples collected in a small town near Florence in the Tuscany region of Italy. All of the samples are from unrelated individuals who identified themselves as having at least three out of four Tuscan grandparents.

YRI (Yoruba in Ibadan, Nigeria): These cell lines and DNA samples were prepared from blood samples collected in a particular community in Ibadan, Nigeria. All of the samples are from parent-adult child trios. All parents in the trios identified themselves as having four Yoruba grandparents.

 

 

 

 

 

 

Advertisement

Piffer’s results replicated (again) by latest GWAS (N=147,194)

The results are new, but the game is getting old. However, given the replicability crisis in the social sciences (which I had the misfortune of trying on my own skin at my PhD lab), any replicate (does this word exist?) should be welcome with open arms.

In a recent paper, I published my estimates of genotypic intelligence/EA (Educational Attainment) or more appropriately, the coefficient of polygenic selection, as cognitive ability is not due only to common variants (those that add up to create a polygenic score) but also rare variants (those missed by GWAS arrays) and de-novo mutations (those that uniquely arise in each individual).

My latest estimates  were published in June, one month before the Hill et al. paper came out on Biorxiv. My 9 SNPs heavyweight was published in January 2017  (although the publication was delayed by a particularly tough-passive-aggressive, weak and slow Frontiers editor). This genetic heavyweight also predicted evolution of intelligence within Europe since the Bronze Age.

In short, these guys used wealth (household income) and educational attainment to power the search for intelligence genes. In vulgar terms, these traits are genotypically correlated, hence pooling them together should increase the power to detect signal. They call this approach MTAG (Multi-trait analysis of genome-wide association studies). Actually it was invented by Turley et al. but it does not really matter because every week a new tool to power GWAS is invented, each one as fancy as the other, and all the names sound like GATTACA. The method is not very original, but it is very common-sensical and is very brute-force driven (something very common in this field). However, the authors were very generous because they provided the full list of SNPs in the Biorxiv preprint, something that is not to be taken for granted.

It might seem strange that household income was thrown in together with educational attainment and intelligence. The authors defend their position by citing the finding that “household Income shows a genetic correlation of rg = 0.82 with education and rg = 0.65 with the GWAS meta-analysis of Sniekers et al.”.

To many of us, even educational attainment seemed a not too good proxy for cognitive abilities, and we are right to question whether adding an even less perfect proxy will increase power or just muddle the waters.

However, the authors validated their polygenic scores on childhood IQ and verbal-numerical reasoning, finding strong correlations: childhood IQ, rg = 0.84, SE = 0.06;
years of education, rg = 0.90, SE = 0.0005; and verbal numerical reasoning, rg = 0.85, SE = 0.0.

As usual, I computed the frequencies of the alleles with a positive beta in Hill et al. This is a large sample (N=107) of loci that independently reached GWAS significance.  Then I computed a polygenic score (PS or PGS) as a weighted mean (using Beta coefficients for each SNP as the weight).

What was the outcome? As can be seen in table 1, these are roughly similar to my previous estimates, giving top scores to East Asians, followed by Finns, and then other Europeans. Then Latin Americans and Africans again get the lowest scores.

The correlations with my previous estimates are moderately high (0.65 for the Sniekers et al. Intelligence factor, 0.79 for the 9 replicated EA SNPs and 0.8 for the Sniekers et al. Intelligence/EA replicated SNPs.  The correlation with population IQ is 0.64, not very high,  because the South Asians (Pakistani, Indians) appear to have large positive residuals. There is also a very odd result because Mende from Sierra Leone get a much higher score than all the other African populations, and this did not happen with the scores obtained from the other GWAS. Maybe there is a typo or an error in the frequency file, or some genuine statistical anomaly.

It’s possible that this discrepancy is due to chance, or due to some genetic variants involved in wealth but not in educational attainment/intelligence.

A much larger GWAS will come out later this year or next year from the James Lee group, so I will update you then.

Factor scores of “successful” alleles ( intelligence, EA and household income alleles).

Population G factor (Sniekers et al.) EA factor (Piffer, 2017 from Okbay et al, Davies et al and Rietveld et al.) Int-EA factor (Sniekers et al.) PGS. Hill et al. 2017
Afr.Car.Barbados -1.276 -1.351 -1.063 -0.926
US Blacks -0.961 -1.177 -0.997 -0.884
Bengali Bangladesh -0.075 -0.209 -0.66 0.249
Chinese Dai 1.35 1.017 1.251 1.197
Utah Whites 0.844 0.471 0.754 -0.025
Chinese, Bejing 1.109 1.511 1.374 1.717
Chinese, South 1.208 1.382 1.635 1.727
Colombian 0.357 0.01 -0.113 -0.727
Esan, Nigeria -1.66 -1.453 -1.255 -1.014
Finland 0.771 0.702 0.581 0.574
British, GB 0.797 0.745 0.782 -0.341
Gujarati Indian, Tx -0.049 0.271 -0.001 0.857
Gambian -1.358 -1.397 -1.186 -0.846
Iberian, Spain 0.631 0.35 0.476 -0.574
Indian Telegu, UK -0.074 0.049 -0.212 0.249
Japan 0.878 1.342 1.321 1.768
Vietnam 1.267 1.346 1.925 1.888
Luhya, Kenya -1.599 -1.488 -1.255 -1.017
Mende, Sierra Leone -1.444 -1.403 -1.165 -0.367
Mexican in L.A. 0.215 0.056 -0.259 -0.895
Peruvian, Lima -0.06 0.05 -0.762 -1.021
Punjabi, Pakistan 0.066 0.24 0.035 0.336
Puerto Rican 0.375 -0.004 -0.208 -1.154
Sri Lankan, UK -0.391 0.134 -0.432 0.401
Toscani, Italy 0.764 0.248 0.677 -0.371
Yoruba, Nigeria -1.684 -1.443 -1.243 -0.803

 

 

 

 

No evidence for positive selection for human height

Have humans gotten taller? Yes, there is evidence that contemporary people are much taller than their ancestors. This phenomeon is known as secular trend in height and has been particularly marked in the 20th century in Western countries, possibly as a result of improved health care and access to food (https://en.wikipedia.org/wiki/Human_height). Such a fast increase in height is usually taken to show the importance of the environment in physical growth because the timescale of DNA evolution is much larger and cannot take place in a few decades.

However, there is evidence for a reduced mating and reproductive success of shorter males, together with a preference for average height and tall men (Stulp et al., 2014), indicating that sexual selection is at work. This fact would lead us to think that there has been (sexual) selective pressure for taller stature, hence leading to an increase of height-increasing allele frequencies in contemporary human populations.

In a recently published paper, my colleagues and I (Woodley et al., 2017) found a higher frequency of IQ/educational attainment-increasing alleles in contemporary European individuals than in a sample of Bronze Age people from Europe and Western Asia, with odds ratios (for proportion of alleles in ancient vs modern) ranging from 0.8 to 0.9.

Wood et al. (2014) discovered 697 SNPs that were significantly associated with human height. I decided to look up the counts of these SNPs in modern and ancient populations using the same sample of Bronze Age people that was employed for the IQ/educational attainment study.  A 2 x 2 contingency table shows the counts of positive and negative alleles for ancient and contemporary genomes.

Table 1. 2 x 2 contingency table with Positive and Negative GWAS Effect Allele Counts for Ancient and Modern Genomes.

Positive allele count Negative allele count
Ancient Genomes 19283 19277
Modern Genomes 324781 332137

It can be seen that the counts are equally distributed among contemporary and ancient populations. An odds ratio was computed, yielding a null effect (O.R.= 1.022). Fisher’s exact test yielded significance, but this is due to the huge sample size as over 600 SNPs were employed. The magnitude of the effect is very small (and actually favoring ancient populations).

This null finding is paradoxical and hard to interpret in light of the evidence for lower mating reproductive success of shorter males in contemporary populations. It is possible that human stature did not affect reproductive success in traditional societies where female choice was very limited and marriages were arranged by families. Hence the higher attractiveness of taller males (or lower attractiveness of shorter men) might not have translated into different fitness levels.

Indirectly, this finding also strengthens the effect that my colleagues and I found for the educational attainment/IQ alleles because it shows that the method we employed does not have a systematic bias towards modern populations for alleles that have positive GWAS beta. In other words, this finding rules out the possibility that our results were due to an artifact.

All we are left with is a very puzzling finding. One possible explanation is balancing selection, where average height men enjoy higher reproductive success than short or very tall men, as suggested by Stulp et al. (2014). Another balancing force could be male preference for shorter females, counterbalancing the female preference for taller males. Finally, an advantage in times of resource scarcity for smaller bodies requiring less food might have also played a role in producing balancing selection. I am sure endless other interpretations are possible you are welcome to offer yours.

Update: A paper was published in Nature Genetics last week (Capellini et al., 2017) showing selection on alleles reducing height among Eurasians around the GDF5 gene. Hence, whatever sexual selection pressure for larger height might have been counterbalanced by other selective pressures.

References:

Capellini, T.D. et al. Ancient selection for derived alleles at a GDF5 enhancer influencing human growth and osteoarthritis risk. Nature Genetics (2017) doi:10.1038/ng.3911

Stulp, Mills, Pollet, Barrett (2014). Non-linear associations between stature and mate choice characteristics for American men and their spouses. Am J Hum Biol. 2014 Jul-Aug;26(4):530-7. doi: 10.1002/ajhb.22559.

Michael A. Woodley of Menie,1,2 Shameem Younuskunju,3 Bipin Balan,4 and Davide Piffer (2017).  Holocene Selection for Variants Associated With General Cognitive Ability: Comparing Ancient and Modern Genomes. Twin Research and Human Genetics Volume 20, Number 4, doi:10.1017/thg.2017.37