At one point during my PhD my advisor joked that my dissertation could at least be titled, “RAD-seq in pipefish: a cautionary tale”. Luckily, that didn’t end up being the case*, but my recently-published paper Substantial differences in bias between single-digest and double-digest RAD-seq: a case study1 comes pretty close.
This paper summarizes some major differences in the genomic data that is derived from two different methods of sampling the variation that exists in the genome. Those two methods are both types of Restriction Site Associated DNA-sequencing (RAD-seq), which primarily differ in the way they cut up they genome (one is called single-digest and the other double-digest, based on the number of restriction enzymes used to chop up the genome). People have identified various sources of bias that result from the different ways of fragmenting the genomes and have used those to debate the benefits of single-digest versus double-digest2,3,4. My paper shows how out-of-whack the results of a typical analysis can become when data derived from the two different methods are analyzed together.
The origin story of this dataset is why I was reassured** that I could at least publish something about a “cautionary tale”. As a new graduate student, back in 2011, I wanted to find a link between animal behaviors and the genome. One way to do this was to compare the frequency of different genetic variants in successfully mating and non-successfully mating females in a natural population of pipefish (a species in which sexual selection acts strongly on females). I described this approach in more detail in a previous post. So I collected fish from a population near Corpus Christi, TX, and set out to do the original RAD-seq method, the single-digest approach5. After about a year of troubleshooting every step of the method, from DNA extraction to the final amplification step, I finally had a library with DNA from 60 barcoded individuals ready to sequence (a library is one test tube that contains the pooled DNA from a bunch of different individuals, and is what eventually gets sequenced). I sequenced it and the data that came back seemed to be pretty decent quality. I breathed a sigh of relief – it worked! – and went to prep the next library.
This is where I ran into problems. The single-digest step required me to use a piece of equipment (a sonicator) in another lab, and when I prepped the next library, the sonication step returned different results than what it had given when I prepared the first library! Uh oh. I wasn’t actually the one running the sonicator, and I struggled to troubleshoot why I was getting different results because of that. So I decided to switch to the double-digest protocol6, where I would have total control over every step, using similar enzymes to recover at least some of the same genomic regions. Unfortunately, I then spent another year troubleshooting that method.*** Finally I got the double-digest method to work (yay!) and eventually I processed my samples and sent them off to sequencing (a total of 4 double-digest libraries).
Fast forward to 2015, and I finally have my DNA sequencing data, and because of the overlap between the single-digest and the double-digest markers I analyzed the two sets of data together. When I set about comparing individuals within the population for selection components analysis, I got an incredibly puzzling result:
My original comparison of males and females from a single population, using the merged single-digest and double-digest RAD-seq datasets. The colored points were deemed “outliers” based on their extreme values. Notice how there are basically two bands of points in the male-female comparison. These differences went away when only the double-digest dataset was analyzed.
See how the points form two separate bands? That’s because the single-digest and double-digest had so much bias that they were producing datasets with incredibly different allele frequencies!**** To continue with my selection components analysis, I focused on the double-digest dataset7 because I needed to finish my dissertation. Focusing only on the double-digest dataset, those two bands disappeared:
Selection components analysis using only the double-digest RAD-seq dataset. Published in Flanagan, S. P. and Jones, A. G. (2017), Genome-wide selection components analysis in a fish with male pregnancy. Evolution, 71: 1096–1105. doi:10.1111/evo.13173
However, when I started my postdoc, I returned to the datasets and tried to figure out the major sources of the differences between the two datasets.
By analyzing various aspects of the datasets, re-analyzing them in a variety of different ways, and by modeling the different sources of bias with an in silico digestion of the genome (basically, taking the genome sequence and using the computer to mock up what the results should look like), I was able to identify a few major sources of bias: polymorphic restriction sites (where the enzymes cut the genome can be variable, too, leading to skewed results), PCR duplicates (extra copies of particular sequences due to random chance in one of the molecular biology steps), what the ‘actual’ frequency of the variant is, and the fact that I had skewed sample sizes (60 individuals sampled with single-digest and 384 with double-digest). To ameliorate the problems, a few steps can be taken: (1) analyze the datasets separately and then find overlapping loci, rather than doing the entire analysis together; (2) focus on loci with similar coverage levels in different datasets; (3) be aware of the different sources of bias and check to see if they’re impacting your dataset.
So, from unexpected (and very frustrating) bumps-in-the-road, I was able to compare two different commonly-used methods. Of course, this was not an ideal dataset for a comparison (better would have been to have the same individuals sequenced using both methods), but I was still able to provide some guidelines and insight into the issues facing researchers trying to make sense of multiple sources of data.
*For those who care, my dissertation title was “Elucidating the genomic signatures of selection using theoretical and empirical approaches”
**I wasn’t very reassured.
***One of the key breakthroughs was buying a Qubit, which is a much more accurate way of quantifying DNA than a Nanodrop. Another breakthrough was starting with many more pooled samples, even for troubleshooting – more DNA in meant more DNA out, which helped tremendously. For those who care.
****Also, I wasn’t stringent enough about pruning out low-quality points, and I analyzed the datasets together at every step of the analysis. In the published paper, those bands don’t show up, but the differences in allele frequencies between the two datasets is really extreme.
1, . 2017. Substantial differences in bias between single-digest and double-digest RAD-seq libraries: A case study. Molecular Ecology Resources. 00:1–17. https://doi-org.proxy.lib.utk.edu:2050/10.1111/1755-0998.12734
2Andrews, Kimberly R., et al. 2016. Harnessing the power of RADseq for ecological and evolutionary genomics. Nature Reviews Genetics 17: 81-92. https://www.nature.com/articles/nrg.2015.28
3Andrews, Kimberly R., and Gordon Luikart. 2014. Recent novel approaches for population genomics data analysis. Molecular Ecology. 23: 1661-1667. http://onlinelibrary.wiley.com/doi/10.1111/mec.12686/full
4Puritz, Jonathan B., et al. 2014. Demystifying the RAD fad. Molecular Ecology 23: 5937-5942. http://onlinelibrary.wiley.com/doi/10.1111/mec.12965/full
5Baird, Nathan A., et al. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 3: e3376. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0003376
6Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S., & Hoekstra, H. E. 2012. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One. 7: e37135. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0037135
7Flanagan, S.P., and Jones AG. 2017. Genome‐wide selection components analysis in a fish with male pregnancy. Evolution 71: 1096-1105. http://onlinelibrary.wiley.com/doi/10.1111/evo.13173/full