Lobstr-code

lobSTR: a short tandem repeat profiler for next generation sequencing data

home
download
install
usage
documentation
faq
changelog
genotyping y-str/codis
validation sets
contact-us

You will need

Announcements

On this page

Only basic lobSTR usage is described here. Documentation for additional scripts and more usage details can be found on the documentation page.

Basic usage (as of v4.0.0)

To run lobSTR using default parameters:

Single-end reads:

lobSTR -f FILE1,FILE2,.. \
   --index-prefix PATH_TO_INDEX/lobSTR_ \
   -o OUTPUT_PREFIX \
   --rg-sample SAMPLE \
   --rg-lib LIB
	
Paired-end reads:
lobSTR --p1 FILE1,FILE2,.. --p2 FILE1,FILE2,...	\
   --index-prefix PATH_TO_INDEX/lobSTR_ \
   -o OUTPUT_PREFIX
	
Sort and index:
samtools sort OUTPUT_PREFIX.aligned.bam OUTPUT_PREFIX.sorted
samtools index OUTPUT_PREFIX.sorted.bam
	
Run allelotype:
allelotype \
  --command [train|classify] \
  --bam OUTPUT_PREFIX.sorted.bam \
  --noise_model NOISEMODELPREFIX \
  --out OUTPUT_PREFIX \
  --strinfo STRINFOFILE \
  --index-prefix PATH_TO_INDEX/lobSTR_
	
To run an example using files available in the download:
lobSTR \
  --p1 PATH_TO_LOBSTR/tests/tmp_1.fq \
  --p2 PATH_TO_LOBSTR/tests/tmp_2.fq \
  -q \
  --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ \
  -o test \
  --rg-sample mysample --rg-lib mylibrary

samtools sort test.aligned.bam test.sorted
samtools index test.sorted.bam

allelotype \
  --command classify \
  --bam test.sorted.bam \
  --noise_model PATH_TO_LOBSTR/models/illumina_v3.pcrfree \
  --out test \
  --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab \
  --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_
	

The index and strinfo files are available in the resource bundle that can be found on the downloads page.

Note that the arguments --rg-sample and --rg-lib are now required to run lobSTR. These flags add the appropriate read group information to the output bam file which is important for multi-sample calling. For more information on these, see the faq page.

This will create the example output files: To see a description of the output file formats, see file formats. To see all available parameters, run lobSTR --help and allelotype --help.


Running lobSTR on Curoverse

You can now run the lobSTR v3 pipeline on Curoverse, which is built with the Arvados open source project and makes it easy to manage data and pipelines. To create a free account, go to https://curoverse.com and click "Log In." A step-by-step tutorial for running lobSTR v3 on Curoverse is posted here. If you run into any difficulties or have any feature suggestions, email support@curoverse.com or join the #arvados channel on IRC.

Detailed description of command line options

lobSTR

ParameterDescriptionTypeDefault
-f
--files
file or comma-separated list of single ends in fasta, fastq, or BAM files to align. Also use this parameter to specify paired-end BAM files to align. string -
--p1 file or comma-separated list of files containing the first end of paired end reads in fasta or fastq format string -
--p2 file or comma-separated list of files containing the second end of paired end reads in fasta or fastq format string -
-o
--out
(REQUIRED) prefix for output files. will output prefix.aligned.bam and prefix.aligned.stats. string -
--index-prefix (REQUIRED) prefix for lobSTR's bwa reference (must run lobstr_index.py to create index or download from lobSTR website). If the index is in PATH_TO_INDEX, this argument is PATH_TO_INDEX/lobSTR_. string -
--rg-sample (REQUIRED) Use this in the read group SM tag string -
--rg-lib (REQUIRED) Use this in the read group LB tag string -
-h
--help
display help screen bool -
-v
--verbose
print out useful progress messages bool -
--quiet don't print anything to stderr or stdout bool -
--version print out lobSTR program version bool -
-q
--fastq
reads are in fastq format bool -
--bam reads are in single-end BAM format bool -
--gzip the input files are gzipped (fastq or fasta input) bool -
--bampair Reads are in BAM format and are paired-end. NOTE: BAM file MUST be sorted or collated by read name. (e.g. samtools sort -n file.bam prefix) bool -
--bwaq trim reads based on quality score. Same as the -q parameter in BWA int 10
--oldillumina specifies that base pair quality scores are reported in the old Phred format (Illumina 1.3+, Illumina 1.5+) where quality scores are given as Phred + 64 rather than Phred + 33 bool -
--multi Report reads mapping to multiple genomic locations bool -
--noweb [deprecated] Do not report any user information to Amazon S3. [This feature has been discontinued] bool -
-p number of threads to use int 1
--min-read-length don't process reads shorter than this int 45
--max-read-length don't process reads longer than this int 1024
--fft-window-size size of sliding entropy window int 16
--fft-window-step step size of the sliding window int 4
--entropy-threshold threshold score to call a window periodic float 0.45
--minflank minimum length of flanking region to try to align int 8
--maxflank length to trim flanking regions to before aligning int 100
--max-diff-ref only report reads differing by at most this number of bp from the reference allele int 50
--extend Number of bp the reference was extended when building the index. Must be same as --extend parameter used to run lobstr_index.py int 1000
--mapq maximum allowed mapq score calculated as the sum of qualities at base mismatches int 100
-u only report reads different by an integer number of copy numbers from the reference allele bool -
-m
--mismatch
allowed edit distance in either flanking region int 1
-g Maximum number of gap opens allowed in each flanking region int 1
-e Maximum number of gap extends allowed in each flanking region int 1
-r fraction of missing alignments given 2% uniform base error rate. Ignored if -m is set float -1
--max-hits-quit-aln Stop alignment search after this many hits found. Use -1 for no limit. int 1000
--min-flank-allow-mismatch don't allow mismatches if aligning flanking regions shorter than this int 30

allelotype

ParameterDescriptionTypeDefault
--command (REQUIRED). One of:
  • train: train the noise model based on the provided BAM alignment. This only works on male samples with a large number of reads aligned to sex chromosome STRs. It is currently recommended that you use the provided stutter model file. Training outputs the files:
    • $noisemodelprefix.stepmodel
    • $noisemodelprefix.stuttermodel
    • $noisemodelprefix.stutterproblem
  • classify: generate allelotypes from the provided alignment file and using the provided noise model. Classifying outputs the files:
    • $output_prefix.vcf
    • $output_prefix.genotypes.tab
string -
--bam (REQUIRED) comma-separated list of BAM files to analyze. Each should have a unique read group and be sorted and indexed. Recommended input is results from either running lobSTR or BWA-MEM string -
--out (REQUIRED) Prefix to name output files string -
--strinfo (REQUIRED) File containing statistics for each STR. Available in the resource bundle download. string -
--noise_model (REQUIRED) Prefix of files to write (--command train) or read (--command classify) noise model parameters to. string -
--index-prefix (REQUIRED) prefix for lobSTR's bwa reference (must run lobstr_index.py to create index of download from lobSTR website). If the index is in PATH_TO_INDEX, this argument is PATH_TO_INDEX/lobSTR_. string -
--no-rmdup Don't remove PCR duplicates before allelotyping. bool -
--realign Redo local realignment. Useful if using alignments generated by other tools. bool -
--min-het-freq minimum frequency to make a heterozygous call float 0.1
--haploid comma-separated list of chromosomes to force homozygous calls. string -
--gridk Search genotype grid including all observed alleles +/- kbp. int 0
--dont-include-pl Do not print the PL field in the VCF file. bool -
-h
--help
display help screen bool -
-v
--verbose
print out useful progress messages bool -
--quiet don't print anything to stderr or stdout bool -
--version print out lobSTR program version bool -
--include-gl Include GL field in the VCF bool -
--min-border Filter reads that do not extend past both ends of the STR region by this many bp. int 5
--mapq maximum allowed mapq score calculated as the sum of qualities at base mismatches int 100
--max-matedist Filter reads with a mate distance larger than this many bp. int 100000
--max-diff-ref only report reads differing by at most this number of bp from the reference allele int 50
--min-bp-before-indel Filter reads with an indel occurreing less than this many bp from eithe rend of the read int 7
--min-read-end-match Filter reads whose alignments don't exactly match the reference for at least this many bp at both ends int 5
--maximal-end-match Filter reads whose prefix/suffix matches to reference are less than or equal to those obtained when shifting the read ends by distances within this many bp. int 15
--chrom only processs loci on this chromosome. string -
--unit only report reads different by an integer number of copy numbers from the reference allele bool -
--max-repeats-in-ends Filter reads with more than this number of occurrences of the repeat motif in the 4*period bp on either end of the read. -1 means no filter is applied. int -1
--filter-mapq0 Filter reads with map quality 0. Only use for alignments not generated by lobSTR. bool -
--filter-clipped Filter reads with hard or soft clipped bases at the ends. bool -
--filter-reads-with-n Filter reads that have one or more N bases bool -
--chunksize Number of loci to read into memory at a time int 1000
--output-bams Output BAM files:
  • out.reads.bam: contains all reads used for analysis (before collapsing duplicates)
  • out.filtered.bam: contains all reads removed by filters.
where out is the argument to --out.
bool -
--noweb [deprecated] Do not report any user information to Amazon S3. [This feature has been discontinued] bool -

lobstr_index.py

ParameterDescriptionTypeDefault
--str (REQUIRED) A bed file containing a list of STR loci output from Tandem Repeat Finder. This file does not necessarily need to be created by TRF. The necessary columns are described above string -
--ref (REQUIRED) A fasta file containing the reference genome string -
--out_dir (REQUIRED) the directory to store the index created string -
-h
--help
Display a helpful usage message bool -
-v
--verbose
Display helpful progress messages bool -
--extend The number of base pairs to include flanking either region of the STR. This should be at least as long as the read length you intend to align. IMPORTANT: if you change the extend parameter here, you MUST remember to set the --extend paramter to the same value when you call lobSTR to ensure that the output coordinates are accurate int 1000