Jun 7, 2023 10:30 AM - Jun 7, 2023 10:50 AM, Jianye Ge, Xuewen Wang, Novel Approaches, Section Presentation
Calling tandem repeat (TR) variants from DNA sequences is of both theoretical and practical significance. A large number of software tools have been developed for detecting TRs. However, little study has been done to detect TR alleles from long-read sequences, and the effectiveness of detecting TR alleles from whole genome sequence (WGS) data still needs to be improved. Herein, a novel algorithm is described that determines the boundaries of TR regions, and a software program, TRcaller, has been developed to call TR alleles from both short- and long-read sequences, both whole genome and targeted sequences generated from multiple sequencing platforms. The results showed that TRcaller can provide substantially higher accuracy of detecting TR alleles with magnitudes faster than the mainstream software tools. 99.4% call accuracy has been achieved for 20 CODIS core STR loci from 289 WGS data samples with 30x coverage of Illumina reads from the 1000 genomes project, which is higher than that from HipSTR (i.e., 93.4%). To reach a 99.9% calling rate, at least 25x, 10x, and 5x average depths were needed for Illumina PE150, Illumina PE250, and PacBio CCS reads, respectively. TRcaller takes less than 2 seconds for calling STRs at CODIS core STR set from WGS sequencing reads up to 300x. TRcaller is able to facilitate scalable, accurate, and ultrafast TR allele calling from large scale sequence datasets in various applications, such as forensics, genetic genealogy, medical research, disease diagnosis, clinical testing, etc.