The original lobSTR pipeline involves running two steps:
lobSTR: Detects repetitive reads and aligns them to STRs in the genome, outputting a BAM file
allelotype: Takes a BAM file generated by
lobSTRand produces genotype calls at each STR
bowtiewere very insentitive to large insertions or deletions from the reference genome. Therefore, these aligners missed a large fraction of polymorphism at STRs and we needed to create an aligner specific to STR regions. However, the more recently developed BWA-MEM has proven to be quite sensitive to large indels. It is now possible to use BAMs generated by
BWA-MEMas input to
allelotype. This has a number of advantages:
BWA-MEM, saving the computational cost of running
BWA-MEMalignment is generally more sensitive, resulting in higher coverage for making STR calls.
Different aligners behave differently in terms of their alignment algorithms, output formats, and scoring conventions. The following points are caveats that we found using other aligners:
BWA-MEMwill sometimes soft clip the end of the read. At STR regions, especially those in which the region flanking the STR also happens to be a bit messy, this can result in a read being called as the reference allele when in reality there is a length variation, but the variable region just happened to get clipped. To remedy this, we suggest either re-doing local realignment by adding the
--realignoption, or simply filtering these reads using
lobSTRdiscards reads that map to multiple locations. Many aligners, including
BWA-MEMmark reads that do not have a unique best alignment with a map quality score of 0. To filter these reads from allelotyping, use the flag
BWA-MEMis often much more sensitive, as it aligns the whole read at once rather than relying on finding alignments for each flanking region separately. As a result, a lot of reads with a flanking region too messy to be mapped by
lobSTRwill end up being mapped by
BWA-MEM. This is mostly a benefit, as it allows making calls based on higher coverage. However, reads that barely span an STR and have messy flanking regions are much more likely to be erroneous. Therefore we recommend requiring that at least 10 bp at each enc of the read match the reference sequence perfectly (by using the option
--min-read-end-match 10). Additionally, the option
--max-repeats-in-endsallows filtering reads for which the flanking region around the STR is too repetitive.
Based on comparisons to capillary electrophoresis and on physical inspection of alignments for erroneous calls, we recommend the following options to
... refers to required arguments to
allelotype (see usage).