My probably not-very-good R 23andme data analysis

My mom got the whole family 23andme kits for Christmas this year, and I’ve been looking forward to getting the results mostly so I could play with the raw data in R. It finally came in, so I went back to a blog post I’d read about analyzing 23andme data using a Bioconductor package called GWASCat. It’s a really good blog post, but as it happens, the package has been updated since it was written, so the code, sadly, didn’t work. Since I of course have VAST knowledge of bioinformatics (by which I mean I’ve hung around a lot of bioinformaticians and talked to them and kind of know a few things but not really) and am super awesome at R (by which I mean I’m like moderately okay at it), I decided to try my hand and coming up with my own analysis, based on the original blog post.

Let me be incredibly clear – I have only a vague notion of what I’m doing here, so you should not take any of what I say to be, you know, like, necessarily fact. In fact, I would love for a real bioinformatician to read this and point out my errors so I can fix them. That said, here is what I did, and what you can do too!

To walk you through it first in plain English – what you get in your raw data from 23andme is a list of RSIDs, which are accession numbers for SNPs, or single nucleotide polymorphisms. At a given position in your genetic sequence, for example, you may have an A, which means you’ll have brown hair, as opposed to a G, which means you’ll have blonde hair. Of course, it’s a lot more complicated than that, but the basic idea is that you can link traits to SNPs.

So the task that needs to be done here is two-fold. First, I need to get a list of SNPs with their strongest risk allele – in other words, what SNP location am I looking for, and which nucleotide is the one that’s associated with higher risk. Then, I need to match this up with my own list of SNPs and find the ones where both of my nucleotides are the risk allele. Here’s how I did it!

First I’m going to load my data and the packages we’ll need.

library(gwascat)
library(splitr)
d <- read.table(file = "file_name.txt",
 sep="\t", header=T,
 colClasses=c("character", "character", "numeric", "character"),
 col.names=c("rsid", "chrom", "position", "genotype"))

Next I need to pull in the data about the SNPs from gwascat. I can use this to match up the RSIDs in my data with the ones in their data. I’m also going to drop some other columns I’m not interested in at the moment.

data("gwdf_2014_09_08")
trait_list <- merge(gwdf_2014_09_08, d, by.x = "SNPs", by.y = "rsid")
trait_list <- trait_list[, c(1, 9, 10, 22, 27, 37)]

Now I want to find out where I have the risk allele. This is where this analysis gets potentially stupid. The risk allele is stored in the gwascat dataset with its RSID plus the allele, such as rs1423096-G. My understanding is that if you have two Gs at that position (remembering that you get one copy from each of your parents), then you’re at higher risk. So I want to create a new column in my merged dataset that has the risk allele printed twice, so that I can just compare it to the column with my data, and only have it show me the ones where I have two copies of the risk allele (since I don’t want to dig through all 10,000+ genes to find the ones of interest).

trait_list$badsnp <- paste(str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), sep = "")

Okay, almost there! Now let’s remove all the stuff that’s not interesting and just keep the ones where I have two copies of the risk allele. I could also have it remove ones that don’t match my ancestry (European) or gender, but I’m not going to bother with it, since keeping the ones where I have a match gives me a reasonable amount to scroll through.

my_risks <- trait_list[trait_list$genotype == trait_list$badsnp,]

And there you have it! I don’t know how meaningful this analysis really is, but according to this, some traits that I have are higher educational attainment (true), persistent temperament (sure, I suppose so), migraines (true), and I care about the environment (true). Also I could be at higher risk of having a psychotic episode if I’m on methamphetamine, so I’ll probably skip trying that (which was actually my plan anyway). Anyway, it’s kind of entertaining to look at, and I’m finding SNPedia is useful for learning more.

So, now, bring on the bioinformaticians telling me how incorrect all of this is. I will eagerly make corrections to anything I am wrong about!

Leave a Reply