Overview Edit

Manhattan Plot

A manhattan plot showing the results of a GWAS (from Wikipedia).

A Genome-Wide Association Study is a way of examining the genetic variants of a population to attempt to establish a correlation between a variant in the genome and an observable trait. The sample is divided into two groups, a group which has the trait of interest and a group which does not. "Associated" variants, or variants which are associated with the trait of interest, show a statistically significant (by the chi-squared test) difference in allele frequences between the two populations. GWA studies cannot determine if a variant is causal for the trait or disease however, they can only imply correlation between the variant and the trait[1].

DIY Genome-Wide Association Study Edit

Most genome-wide association studies focus exclusively on SNPs, and use SNP genotyping assays. With raw sequence data (such as FASTQ files[2]), it is possible to easily perform a genome-wide association study. The GATK best-practices pipeline includes a variant calling step which identifies SNPs, and the SNP data can then be loaded into a statistical analysis package such as R[3] and when combined with the information on which patient is in which group, a correlation between SNPs and trait occurrence can be found.

One of the difficulties in GWASs can be processing large amounts of data. A solution to this is either selecting SNPs by hand based on literature, or using a machine learning algorithm to do an initial feature selection before performing more computationally intensive algorithms. An algorithm called VLSReliefF[4] was developed by Maggie Eppstein, a UVM professor. This computer program uses a data mining algorithm (ReliefF) on a set of SNPs to attempt to narrow down the set into good candidate SNPs, hopefully in a way which a human would not be able to do.

References Edit

  1. Genome-Wide Association Study (n.d.). In Wikipedia. Retrieved August 28, from
  2. FASTQ Format (n.d.). In Wikipedia. Retrieved August 28, from
  3. R (Programming Language) (n.d.). In Wikipedia. Retrieved August 28, from
  4. MJ Eppstein, P Haake (2008). "Very Large Scale ReliefF for Genome-Wide Association Analysis" Computational Intelligence in Bioinformatics and Computational Biology. CIBCB '08. IEEE Symposium on , vol., no., pp.112,119, 15-17 Sept. 2008