Davide Piffer – 03/08/2015
Q-Q plots are commonly used to detect deviations from the normal distribution. This can be done visually or – more formally – calculating the correlation between the theoretical and the empirical distributions.
Another widely used test of normality is the Shapiro-Wilk test. This produces a coefficient W with a value of 1 corresponding to perfect normality (no deviation from the theoretical distribution) and lower values representing deviations from normality.
My goal was to determine the degree of agreement between the estimates produced by these two methods. In order to achieve this, I computed the correlation between the theoretical (x axis) and the empirical (y axis) for the Q-Q plots and carried out the Shapiro-Wilk test on several continuous variables. Then, I correlated the W value to the Q-Q plot correlation coefficient.
Methods
Variables were taken from two files (NineHitsBetaFst_B.csv and Factors.csv) in the data set I used for the population genetics study of intelligence. The vectors represent allele frequencies or factors derived from allele frequencies via factor analysis (Piffer, 2015).
Data files containing the vectors can be downloaded from: https://osf.io/jt73x/
Results of the analysis are reported in this spreadsheet: https://docs.google.com/spreadsheets/d/1fg2evimqFlx2PqxopcfiJy99i4d6NgZw2HnsBeIzgUc/edit?usp=sharing
R was used to carry out the analysis.
R Code is in the appendix.
Results
The correlation between Q-Q xy and Shapiro-Wilk W was r=0.993 (N=19; p<0.001).
Figure 1. Relationship between Q-Q plot xy correlation and Shapiro-Wilk W.
The relationship between the two variables can be approximately described by this formula:
1-W =~ 2(1-Corr Q-Q plot),
e.g. 9SNPsGIDist: Q-Q corr= 0.952 and Shapiro-W= 0.905. This can be seen from table 1.
Table 1. Relationship between the two methods (1-x).
1-Corr Q-Q | 1-W | (1-W)/(1-Corr Q-Q) |
0.0322736 | 0.0661 | 2.048113628 |
0.0355605 | 0.07253 | 2.039622615 |
0.0231317 | 0.04782 | 2.067292936 |
0.0230315 | 0.04779 | 2.074984261 |
0.0471659 | 0.09458 | 2.005262276 |
0.0264781 | 0.05437 | 2.05339507 |
0.0881257 | 0.16912 | 1.919076955 |
0.0270037 | 0.05495 | 2.034906328 |
0.0243319 | 0.04994 | 2.052449665 |
0.0221553 | 0.04577 | 2.065871372 |
0.0267268 | 0.05474 | 2.048131464 |
0.0651654 | 0.12811 | 1.965920565 |
0.0228832 | 0.04754 | 2.077506642 |
0.0308334 | 0.06289 | 2.039671266 |
0.0267176 | 0.05426 | 2.030871036 |
0.0276686 | 0.05681 | 2.053230015 |
0.0328761 | 0.07879 | 2.396573803 |
0.0728126 | 0.15363 | 2.109937016 |
0.0384661 | 0.08824 | 2.293967935 |
There is indeed a slight tendency for the ratio to fall as departures from normality get bigger (i.e. with strong departures from 1, W is slightly less than twice as big as 1-corr Q-Q, whereas it is slightly more than twice as big when departures from normality are small).
Discussion
There is a very strong agreement between two commonly used methods to test for normality of data. An advantage of the Shapiro-Wilk test is that it provides a test of the null hypothesis that the population is normally distributed. However, p values have many issues, besides being affected by sample size such that a very large sample size will always result in rejection of the null hypothesis even in the the presence of tiny deviations from normality (Kirkegaard, 2014).
References:
Kirkegaard, E. (2014).W values from the Shapiro-Wilk test visualized with different datasets. http://emilkirkegaard.dk/en/?p=4452
Piffer, D. (2015). A review of intelligence GWAS hits: their relationship to country IQ and the issue of spatial autocorrelation. Figshare, http://dx.doi.org/10.6084/m9.figshare.1393160
Appendix
#Dataset NineHitsBetaFst_B
newdata3=na.omit(NineHitsBetaFst_B)
qqChr21Fst=qqnorm(newdata3$Chr21.Fst)#creates Q-Q plot and assigns it a name
cor(qqChr21Fst$x,qqChr21Fst$y)#computes correlation between x and y axes of Q-Q plot
shapiro.test(newdata3$Chr21.Fst) # Shapiro-Wilk test
qqChr1Fst=qqnorm(newdata3$Chr1.Fst)
cor(qqChr1Fst$x,qqChr1Fst$y)
shapiro.test(newdata3$Chr1.Fst)
qqIQdist=qqnorm(newdata3$IQ.distances)
cor(qqIQdist$x,qqIQdist$y)
shapiro.test(newdata3$IQ.distances)
qqX4.SNP=qqnorm(newdata3$X4.SNPs.GI.distances)
cor(qqX4.SNP$x,qqX4.SNP$y)
shapiro.test(newdata3$X4.SNPs.GI.distances)
qqX9.SNP=qqnorm(newdata3$X9.SNPs.GI.distances)
cor(qqX9.SNP$x,qqX9.SNP$y)
shapiro.test(newdata3$X9.SNPs.GI.distances)
qqset1=qqnorm(newdata3$Set1)
cor(qqset1$x,qqset1$y)
shapiro.test(newdata3$Set1)
qqset2=qqnorm(newdata3$Set2)
cor(qqset2$x,qqset2$y)
shapiro.test(newdata3$Set2)
qqset3=qqnorm(newdata3$Set3)
cor(qqset3$x,qqset3$y)
shapiro.test(newdata3$Set3)
qqset4=qqnorm(newdata3$Set4)
cor(qqset4$x,qqset4$y)
shapiro.test(newdata3$Set4)
qqset5=qqnorm(newdata3$Set5)
cor(qqset5$x,qqset5$y)
shapiro.test(newdata3$Set5)
qqset6=qqnorm(newdata3$Set6)
cor(qqset6$x,qqset6$y)
shapiro.test(newdata3$Set6)
qqset7=qqnorm(newdata3$Set7)
cor(qqset7$x,qqset7$y)
shapiro.test(newdata3$Set7)
qqset8=qqnorm(newdata3$Set8)
cor(qqset8$x,qqset8$y)
shapiro.test(newdata3$Set8)
qqset9=qqnorm(newdata3$Set9)
cor(qqset9$x,qqset9$y)
shapiro.test(newdata3$Set9)
qqset10=qqnorm(newdata3$Set10)
cor(qqset10$x,qqset10$y)
shapiro.test(newdata3$Set10)
qqsetpolscore=qqnorm(newdata3$Polygenic.Score)
cor(qqsetpolscore$x,qqsetpolscore$y)
shapiro.test(newdata3$Polygenic.Score)
#Dataset Factors
newdata4=na.omit(Factors)
qqsetX4=qqnorm(newdata4$X4SNPs.g.factor)
cor(qqsetX4$x,qqsetX4$y)
shapiro.test(newdata4$X4SNPs.g.factor)
qqsetX9=qqnorm(newdata4$X9.SNPs.factor)
cor(qqsetX9$x,qqsetX9$y)
shapiro.test(newdata4$X9.SNPs.factor)
qqsetgpol=qqnorm(newdata4$G.Polygenic.Score)
cor(qqsetgpol$x,qqsetgpol$y)
shapiro.test(newdata4$G.Polygenic.Score)
#Scatterplot (Q-Q cor vs Shapiro-Wilk W)
library(car)
newdatascatterplot=na.omit(qqplots..BetaFst)#load .csv file with results (download from Google Docs link)
scatterplot(newdatascatterplot$SHAPIRO.WILKS.W~newdatascatterplot$CORR.Q.Q.PLOT,main=”Q-Q Plot xy cor vs Shapiro-Wilk W (r=0.99)”, xlab=”Shapiro-Wilk W”,ylab=”Q-Q Plot xy cor”,smoother=FALSE) #creates regression scatterplot with Q-Q plot correlation and Shapiro-Wilk W
cor(newdatascatterplot$SHAPIRO.WILKS.W,newdatascatterplot$CORR.Q.Q.PLOT) #computes correlation between the two methods