My probably not-very-good R 23andme data analysis

My mom got the whole family 23andme kits for Christmas this year, and I’ve been looking forward to getting the results mostly so I could play with the raw data in R. It finally came in, so I went back to a blog post I’d read about analyzing 23andme data using a Bioconductor package called GWASCat. It’s a really good blog post, but as it happens, the package has been updated since it was written, so the code, sadly, didn’t work. Since I of course have VAST knowledge of bioinformatics (by which I mean I’ve hung around a lot of bioinformaticians and talked to them and kind of know a few things but not really) and am super awesome at R (by which I mean I’m like moderately okay at it), I decided to try my hand and coming up with my own analysis, based on the original blog post.

Let me be incredibly clear – I have only a vague notion of what I’m doing here, so you should not take any of what I say to be, you know, like, necessarily fact. In fact, I would love for a real bioinformatician to read this and point out my errors so I can fix them. That said, here is what I did, and what you can do too!

To walk you through it first in plain English – what you get in your raw data from 23andme is a list of RSIDs, which are accession numbers for SNPs, or single nucleotide polymorphisms. At a given position in your genetic sequence, for example, you may have an A, which means you’ll have brown hair, as opposed to a G, which means you’ll have blonde hair. Of course, it’s a lot more complicated than that, but the basic idea is that you can link traits to SNPs.

So the task that needs to be done here is two-fold. First, I need to get a list of SNPs with their strongest risk allele – in other words, what SNP location am I looking for, and which nucleotide is the one that’s associated with higher risk. Then, I need to match this up with my own list of SNPs and find the ones where both of my nucleotides are the risk allele. Here’s how I did it!

First I’m going to load my data and the packages we’ll need.

library(gwascat)
library(splitr)
d <- read.table(file = "file_name.txt",
 sep="\t", header=T,
 colClasses=c("character", "character", "numeric", "character"),
 col.names=c("rsid", "chrom", "position", "genotype"))

Next I need to pull in the data about the SNPs from gwascat. I can use this to match up the RSIDs in my data with the ones in their data. I’m also going to drop some other columns I’m not interested in at the moment.

data("gwdf_2014_09_08")
trait_list <- merge(gwdf_2014_09_08, d, by.x = "SNPs", by.y = "rsid")
trait_list <- trait_list[, c(1, 9, 10, 22, 27, 37)]

Now I want to find out where I have the risk allele. This is where this analysis gets potentially stupid. The risk allele is stored in the gwascat dataset with its RSID plus the allele, such as rs1423096-G. My understanding is that if you have two Gs at that position (remembering that you get one copy from each of your parents), then you’re at higher risk. So I want to create a new column in my merged dataset that has the risk allele printed twice, so that I can just compare it to the column with my data, and only have it show me the ones where I have two copies of the risk allele (since I don’t want to dig through all 10,000+ genes to find the ones of interest).

trait_list$badsnp <- paste(str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), str_sub(trait_list$`Strongest SNP-Risk Allele`, -1), sep = "")

Okay, almost there! Now let’s remove all the stuff that’s not interesting and just keep the ones where I have two copies of the risk allele. I could also have it remove ones that don’t match my ancestry (European) or gender, but I’m not going to bother with it, since keeping the ones where I have a match gives me a reasonable amount to scroll through.

my_risks <- trait_list[trait_list$genotype == trait_list$badsnp,]

And there you have it! I don’t know how meaningful this analysis really is, but according to this, some traits that I have are higher educational attainment (true), persistent temperament (sure, I suppose so), migraines (true), and I care about the environment (true). Also I could be at higher risk of having a psychotic episode if I’m on methamphetamine, so I’ll probably skip trying that (which was actually my plan anyway). Anyway, it’s kind of entertaining to look at, and I’m finding SNPedia is useful for learning more.

So, now, bring on the bioinformaticians telling me how incorrect all of this is. I will eagerly make corrections to anything I am wrong about!

Living my best academic life: 2018 resolutions for getting that PhD done

I didn’t feel very optimistic going into 2017. I had recently lost my father and grandfather in the same week, and I was feeling anxious and depressed about what seemed like a pretty disastrous outcome to the 2016 elections. I don’t think I made any resolutions that year because I was so disheartened by the whole situation that I figured, who cares? My focus in 2017 was basically, do what it takes to get through it, eat some good food and drink some good wine because possibly the world will end pretty soon, etc.

But I feel different going into 2018, more motivated and invigorated. Yeah, 2017 was pretty shitty in some ways, but there were also some good things about it, actually some really great things! I know it’s very silly, but it also feels like there’s something to wiping the slate clean and starting over. At this point, I’ve worked out 100% of the days in 2018! I’ve eaten healthy, and put my shoes away and all those other things I aim to do EVERY SINGLE DAY OF THIS YEAR.

More importantly to my motivation, there’s a chance that this year could be when I finish my PhD, if I can manage to do my dissertation work in three semesters (i.e. 12 months). Maybe this is a ridiculous goal, but I’m kind of a ridiculous person, and it sure would be nice to finish. To that end, I’m deciding to make the goal for this year to live my best academic life. What does that mean?

  • read (something academic, that is) every day. My former advisor, who I still keep in touch with on Twitter, very usefully recommended #365papers – i.e. read a scholarly paper every day of the year. I probably need to read around that much for my dissertation anyway, and I do also have a huge backlog of interesting articles I’ve filed away to read “one day.” So far I’m one down, 364 to go! (But again, I’ve read an academic paper 100% of the days this year)
  • write every day. It doesn’t have to be a lot. A blog post (this counts for today!), a bit of a paper, part of my dissertation, something for work, even an academic related tweet. I know that doing a dissertation will involve way more writing than I’d been doing for the other parts of my PhD work, so I want to get into the habit now.
  • keep working on open science. I’m finally getting to the point in my coding skills that I don’t feel horrendously embarrassed for other people to see my code, but I still often think, eh, who’s going to want to see this? That’s totally the wrong idea, especially for someone whose scholarly research focuses on data sharing and reuse! I’m going to try to make a lot more commits to GitHub, even if it’s just silly stuff that I’m working on for my own entertainment, because who knows how someone else might find it useful.

So there you go! I’ll be tweeting out the papers I read on my Twitter account (@lisafederer) using #365papers, putting stuff up on my GitHub account, and I’ll probably (hopefully?) be writing more here, so watch this space!