lobSTR: a short tandem repeat profiler for next generation sequencing data

genotyping y-str/codis
validation sets


How do I cite lobSTR?
Please cite: Gymrek M, Golan D, Rosset S, & Erlich Y. lobSTR: A short tandem repeat profiler for personal genomes. Genome Research. 2012 April 22.

How can I visualize lobSTR alignments?
We have created the package pybamview which allows you to visualize BAM alignments from lobSTR and other sources in your web browser. Other alignment viewers include samtools tview, which is text based and run from the terminal, as well as UCSC and IGV, but these do not give visualizations of lengths of insertions at STRs.

Why do I need to specify --rg-lib and --rg-sample when I run lobSTR?
lobSTR generates alignment outputs in the BAM format. lobSTR uses these parameters to set the sample and library in the read group for each output alignment. This allows downstream tools (including the lobSTR allelotype step) to know which sample each read came from. Details about the meaning of the read group tag can be found in the SAM format specification and are also described on the GATK forum.

How does lobSTR determine the STR allele supported by a read?
lobSTR has two different options for methods to determine the allele supported by a read:
  1. Use all indels from the read, even if they are not in the STR region.
  2. Make an attempt to only count indels that are actually in the STR region.
For example:
would be called as -2 by method 1 and as 0 by method 2, since there is a length difference but it is not in the STR.

In v2.0.3 and below, #2 is the default, and you can switch to #1 by the option --include-flank.
In v2.0.4 and above, #1 is the default, and you can switch to #2 by the option --dont-include-flank.

We made this switch because we found the method 1, while less precise, was giving more robust results when comparing to data from other sources such as from capillary electrophoresis. However, some users have reported that they get more accurate results with the --dont-include-flank option, and the best option may depend on your application.

Can lobSTR detect STRs of periods below 2 or above 6?
The definition of what constitutes an STR is not clear cut. Currently lobSTR is limited to STR motifs of periods 1 through 6. We recently expanded our reference to include mononucleotides and a larger set of STRs.

How are mismatches calculated?
The parameters that control alignment mismatches/gaps are:

Each flanking region is aligned separately, so each paramter above is applied to both the left and right flanking regions. That means, if you allow an edit distance of 2 (-m 2), then edit distance 2 is allowed both in the left and right flanking sequences.

How is the sam format generated?
After an STR alignment is found, lobSTR performs a local realignment using the Needleman Wunsch algorithm with an affine gap penalty score. This step allows lobSTR to generate accurate CIGAR scores for the entire read, including the STR region, that can be used downstream for viewing in samtools tview and for calling SNPs in STR regions.

How do you obtain the genomic coordinates of markers?
We use the table of STR's generated by Tandem Repeat Finder available on the UCSC Genome Browser.

How should I filter my results to obtain only high quality genotypes?
lobSTR reports a score for each genotype result based on the likelihood ratios of each possible genotype, ranging from 0 to 1. We have found that this score reliable reflects the accuracy of calls.

What does the remove duplicates procedure do? How can I turn it off? 
All reads starting at the same 5' location with the same length are collapsed into a single read. If the reads have multiple copy numbers of an STR, the copy number exhibited by the majority of the PCR duplicates is retained. If there is no majority vote, quality scores are used to break ties. To turn off removing duplicates, use the --no-rmdup option to allelotype.

How can I turn off trimming?
lobSTR trims reads based on quality scores to remove erroneous read ends. The --bwaq acts the same as the -q parameter for BWA. Additionally, by default, lobSTR trims all flanking regions to 25bp before aligning. To avoid this step, set --maxflank to a value near the read length of the input.

What does the --noise_model option take?
The --noise_model option takes a file name to read or write the noise model parameters to. If you specify --command classify, the allelotype program reads in the noise model file given and uses those parameters. If you specify --command train, allelotype will write to the file given to noise_model with the new trained parameters. Note that training is only reliable when the sample is male and contains a large number of reads aligned to sex chromosomes, so in most cases we recommend just using the --command classify option with a provided noise model.

The lobSTR indexer did not produce any output.
In versions below 2.0.0, lobSTR requires chromosome names to not contain "_". If lobSTR does not return any output from the indexing step, it may be because this illegal character is present. This has been changed in version 2.0.0 to allow "_" but disallow "$" in chromosome names.

What read lengths can lobSTR align?
Because most STRs in lobSTR's reference are ~40bp, and a read must fully span an STR plus some flanking region in order to produce a unique alignment, lobSTR works best with reads of 100bp or more. Some high quality alignments can be produced with reads as short as 45bp, but alignment errors are much more frequent with these short reads. 36bp reads cannot be reliably aligned by lobSTR.