Recommendations for setting lobSTR alignment parameters
General usage for running the lobSTR alignment step can be found on the usage and best practices for WGS/WES pages. This page gives advice on more specific topics related to alignment.
Read length recommendations
Depending on the length and type of your input reads, certain specific parameters might improve lobSTR performance.
If aligning to the standard hg19 reference provided in the resource bundle, we recommend at least 100bp reads for the best quality alignments. However, read lengths as low as 50-70bp can give good results if you are primarily interested in STRs with a total length of ~20bp or less. Be aware that short reads will give very biased results for longer STRs: only short alleles that can be spanned by the read length will be detected, leading to a large bias toward reporting the shorter allele at heterozygous loci.
lobSTR can also be used to align longer reads (several hundreds of bp). For very long reads, these parameters can be useful:
Note that lobSTR aligns each read to a primary STR. If a read spans more than one STR, this will be noted in a flag ("XO"). However the allelotype step will only look at the primary STR aligned to (see specific applications below for how to bias lobSTR toward choosing certain types of STRs as the primary STR). We plan to change this in the future.
--maxflank INT: this parameter gives how long to trim flanking regions down to before attempting alignment. Setting this value higher results in more specific alignment but longer run time. Setting it lower results in less specific alignments, some of which will be flagged as multi-mappers and discarded, but dramatically speeds up run time.
--maxflank 50 performs well in most cases, but if you are specifically interested in longer STRs (lengths ~60bp+) you will get higher quality alignments with
--bwaq INT: this parameters is used to trim read ends based on quality scores before attempting STR detection or alignment. Often, errors tend to accumulate at the ends of reads, which are reflected in the base quality scores. Trimming off these low quality ends results in better alignments. Setting
--bwaq 15 usually performs well. If you are using sequence files with base quality scores in the old Phred format (where quality scores are given as Phred + 64 rather than as Phred +33), add the
- With longer reads, there is a higher chance that the flanking regions will contain some errors. See the alignment parameters below to increase the tolerance to gaps and mismatches in the flanking regions.
Based on the characteristics of the STRs you're interested in genotyping, it might be helpful to set some special flags:
- Genotyping Y-STR/CODIS/Marhsfield markers in long reads. During alignment, lobSTR assigns each read to a single STR location. When using long reads (100bp+ or so), a single read might span more than 1 STR. By default it will choose the region with the best entropy score within a 16bp window. This tends to bias the algorithm toward choosing short homopolymers as the primary STR spanned by each read, even if a longer tetranucleotide exists elsewhere in the read. To bias it toward choosing longer and more complex STRs, you can set
--fft-window-size 24 --fft-window-step 12.
- Genotyping long expansions. By default lobSTR trims flanking regions before attempting to align reads to STRs. Additionally, it doesn't return reads with more than 50bp length difference from the reference. For genotyping large expansions (e.g. Fragile X or Huntington's), this is not desirable. Setting
--maxflank 100 --max-diff-ref 1000 will allow you to pick up many more of these events.
Alignment sensitivity and specificity
lobSTR offers the following parameters to control alignment specificity and sensitivity:
--mismatch INT: edit distance allowed during alignment of each flanking region. Set this higher to be more tolerable of alignment errors.
-g INT: maximum number of gap opens allowed in each flanking region. This is set to 1 by default. Gap opens in flanking regions are often a signal of low quality alignment to the STR, so it is not recommended to increase this for short reads.
-e INT: maximum number of gap extensions allowed in each flanking region. This is also set to 1 by default but can be increased if using very long reads.
-r INT: edit distance allowed during alignment of each flanking region as a percentage of the read length. Use this instead of
-m to have a max edit distance per read length rather than an absolute cutoff.
Filtering output reads
You can use the following to filter which reads are output:
--max-diff-ref INT: don't report reads different in length by more than INT bp from the reference allele. If you are specifically interested in big expansions, you must set this to a number significantly higher than the expansion length you are looking for.
--mapq INT: maximum allowed mapq score. This is calculated as the sum of qualities at base mismatches.
-u: require length differences to be a multiple of the repeat unit.
By default, lobSTR does not report alignments for reads that did map to a unique best location. To report these reads, use
--multi. This will report one primary alignment and list alternative locations in the "XA" tag of the BAM file.
Speeding up alignment
To speed up lobSTR alignment, try:
- Running in multithread mode by using the
-p INT parameter to set the number of processors to use.
- As described above, setting
--maxflank INT to a smaller number dramatically reduces alignment time, but at the cost of a slight reduction in alignment quality. We have not tested values of
--maxflank less than 50.
- Don't allow gaps in flanking regions (set
-g 0 -e 0).