Lobstr-code

lobSTR: a short tandem repeat profiler for next generation sequencing data

home
download
install
usage
documentation
faq
changelog
genotyping y-str/codis
validation sets
contact-us

Building a custom lobSTR reference

Overview

lobSTR comes with a pre-built reference and index for humans (hg19). This reference was built using a process described in Willems et al. by running Tandem Repeat Finder on the hg19 reference genome. Additionally, it contains Y-STR and CODIS markers (described here). The resource bundle for hg19 is available on the download page. However you may want to use a custom set of STR loci in your reference if:

To use your own reference, you will need to create a bed file with the loci of interest, build a lobSTR index, and build an STR info file. These steps are described below.

Note, all scripts mentioned below are available in the scripts/ directory of the lobSTR download. In addition, you must have bedtools installed and in the PATH. We have tested using version v2.22.1-17-gd6547b3, and older versions may not be compatible.

Reference bed file

The first step is to create a bed file with your custom set of STR loci. One way to do this is by running Tandem Repeats Finder on your reference genome of interest. You will need to make a bed file with the following columns present:
  1. Column 1: chromosome
  2. Column 2: start coordinate of the STR
  3. Column 3: end coordinate of the STR
  4. Column 4: period of the STR
  5. Column 5: reference copy number
  6. Column 9: STR score. This score measures the purity of the STR sequence and is based on the suggested Tandem Repeats Finder scoring scheme with match=2, mismatch=-7 and indel=-7. Therefore the maximum possible score for a perfectly pure STR sequence (e.g. ATATATATATAT) is 2*(length of STR region).
  7. Column 15: STR repeat unit
Note some columns are not used. This is because lobSTR was originally designed to take input directly from tandem repeat finder. You can put any value in the non-required columns, just make sure there are at least 15 columns with the required information listed above.

Build the lobSTR index

To build the lobSTR index, create a clean directory where the index will be stored. Then run:
python scripts/lobstr_index.py \
  --str str_bed_file \
  --ref ref_genome.fa \
  --out output_directory
	  
This will create an index with the prefix output_directory/lobSTR_.

Generating an STRInfo file

If using a custom index, you will also need to generate a file with information on each STR locus to input to the --strinfo option of the allelotyper.
python scripts/GetSTRInfo.py str_bed_file ref_genome.fa > strinfo.tab