Admit it — at some point in our childhoods, we’ve all wondered if we had a long-lost twin or a celebrity relative. Perhaps we’ve gone so far as to perform a genetic test through a company such as 23AndMe, and learned more about our ancestry and distant relatives. Though we often pursue genetic testing for personal purposes, finding out individuals that share DNA could have grand implications for our understanding of the health and structure of human populations at large.
This process of examining genetic connections has been streamlined by the introduction of iLASH (IBD by LocAlity-Sensitive Hashing), an algorithm that efficiently identifies relevant genetic connections among large sets of people, which then inform important advancements in population genetics and personalized medicine, among other fields.
What is iLASH?
iLASH came to be when Ruhollah Shemirani, Ph.D. student at USC, teamed up with Jose-Luis Ambite, Research Team Leader at USC’s Information Sciences Institute (ISI) and Associate Research Professor of Computer Science at USC Viterbi, to study how genetic connections among individuals can shed light on genetic causes of diseases and the genetic structure of populations. In addition, iLASH can also be used for other purposes such as finding distant relatives through services like 23AndMe. This project was made possible by collaborating with experts from the University of Colorado, the Icahn School of Medicine at Mount Sinai, and the University of North Carolina at Chapel Hill.
Establishing genetic connections can make the world feel smaller than you think. Ambite’s experience, among those of many others who have used similar services, serve as interesting examples.
“According to 23andMe, I share 0.07% of my genome with Greg Ver Steeg, another researcher at ISI,” mentioned Ambite. “Greg is American (from Dutch descent, many generations ago). I’m from Spain. Nonetheless, we share a bit of DNA from a common ancestor.”
Essentially, iLASH is a method for IBD, or Identity-By-Descent, estimation. “IBD estimation is the process of finding out where, and how much, each pair of individuals in a genetic dataset share their DNA due to shared ancestry,” explained Shemirani.
IBD estimation is the first step of IBD Mapping, a novel process to identify the genetic basis of disease. This process is broken down into three steps, each to be published independently in a paper. The first step, published as a featured article in the Editor’s Highlights in Nature Communications on June 10th, involves estimating genetic segments that are shared between pairs of individuals using iLASH. Next, this genetic pairing information is used to create groups of “distant families” using network clustering methods. The last paper will focus on statistical methods to show whether these “distant families” reveal elevated rates of diseases or other traits.
A Pioneer in IBD Estimation
So what sets iLASH apart from other genetic algorithms? Scalability and accuracy. With the ability to perform IBD estimation on a large-scale, biobanks, or storages of biological samples for research, that were previously unfeasible can now be analyzed for genetic connections at an unprecedented speed.
“Before iLASH, finding genetic connections in a dataset of 50,000 individuals would take more than a week (~6 days per chromosome),” said Shemirani. “The same dataset is analyzed by iLASH in an hour!”
To achieve this, iLASH employs Locality Sensitive Hashing (LSH), which eliminates unrelated pairs of genetic samples, leaving remaining pairs that have a high probability of shared DNA. This complex algorithm has been facilitated by parallel computing, which allows multiple processes to be carried out simultaneously, creating an efficient approach to IBD estimation.
As a crucial step, Shemirani and Ambite collaborated with geneticists and researchers from various institutions to ensure iLASH is compatible with common formats used by bioinformaticians, who apply information generated from the algorithm to biological and medical research.
“Without such feedback from real geneticists at the University of Colorado Medical Campus and Icahn School of Medicine at Mount Sinai, we could not have achieved this,” Shemirani said.
Revolutionizing Population Genetics
iLASH has significant real-world applications in both population genetics and personalized medicine.
In the population genetics field, iLASH’s efficiency and accuracy as an IBD estimation method has been unprecedented by other types of analyses and has already been implemented by experts across the country.
“We can use iLASH in very large datasets to extract patterns of migration and recent fine-scale ancestry structures for the first time,” said Shemirani.
In fact, Dr. Gillian Belbin, an iLASH co-author and researcher at the Icahn School of Medicine at Mount Sinai, used iLASH to analyze the UK Biobank, a genetic dataset of 500,000 people in the UK. Among other findings, the study showed patterns of common ancestry with Nordic populations who inhabited areas that are historically contact points for Viking populations.
Incorporating Diversity into the Conversation
In the field of medicine, iLASH is not only an effective tool for studying the genetic origins of rare diseases, but also a promising way to better our understanding of diversity in genetics.
“Helping with the discovery of these rare genetic origins for various diseases is just one of the utilities of such studies,” Ambite noted. “For example, they can also help geneticists with calibrating genetic disease risk calculations for diverse non-European populations.”
By building upon previous analyses that were limited to white European populations, iLASH enables researchers to expand existing results to cover a broader range of population groups.
“Including iLASH in genetic study pipelines, such as polygenic risk scores or disease mapping studies, will help to account for population structure and hidden relatedness in the datasets,” explained Shemirani. “This will help partially address the problems that arise from the imbalance, or lack of diversity, of the datasets and studies in terms of population demographics.”
Another upside of iLASH is that it’s more cost-friendly when compared to many other alternatives in medicine, therefore making it a much more accessible option.
Though iLASH has proven to be highly promising in various applications, there is still work to be done. Shemirani named three particular improvements that they are currently working on.
The primary challenge is to create a distributed version of iLASH to meet increasing scalability demands. As datasets grow larger by the day, iLASH needs the resources required to cover a sizable amount of data accurately and efficiently.
In addition, Shemirani and Ambite are also looking to create a cloud service for iLASH, though ethical and security issues surrounding sensitive genetic data pose a problem for this goal.
Finally, adding an incremental analysis would allow iLASH to be adopted in commercial settings where new customers are constantly being added and need to be incorporated into the existing dataset.
Though not all of us are going to find a lost-long twin or celebrity relative, iLASH can help researchers extract crucial genetic information that will inform relevant research in the fields of population genetics and medicine, benefiting us all in the long term.