lobSTR Reference and Index
lobSTR aligns reads specifically to places in the reference genome known to contain short tandem repeats. Based on a reference set of STRs, lobSTR builds a special index data structure to allow it to efficiently align reads to STR-containing regions. The structure of the reference BED file is described on the usage page under "Building a custom index". The sections below describe the structure of the lobSTR index created from a reference set of STRs.
Index Structure: Version 3
This describes the index structure used in lobSTR v3 and up. Here there is a single BWA reference, rather than a different reference for each repeat unit. Unlike in the version 2 index, where each individual reference STR was treated as a separate entry in the fasta used to build the index, for version 3 we merged multiple STRs that are close to each other in the reference to a single entry to save space. As in version 2, the index consists of a collection of files in a single directory with the same prefix ("lobSTR_" by default).
- BWA index:
- lobSTR_ref.fasta: a fasta file consisting of a different entry for each STR containing region in the genome. Note, each region may contain multiple reference STRs. The ID for each entry consists of the following fields separated by "$"s: entry ID, chromosome, region start, region end.
- lobSTR_mergedref.bed: This file describes the STRs from the original reference STR BED file that were merged to a single reference region. It contains the following columns:
- region start
- reion end
- describes each reference STR contained in that entry. For each reference STR, the following fields are separated by "_"s: start, end, repeat unit in the reference, canonical repeat unit. Multiple reference STRs are separated by ";"s.
- lobSTR_ref_map.tab: This file describes the reference STRs contained in each entry of lobSTR_ref.fasta. It contains the following columns:
- entry ID in the fasta file
- all repeat units of any STR containd in the region encompassed by that entry. Multiple repeat units are separated by ";"s.
- same as Column 4 in lobSTR_mergedref.bed
- lobSTR_chromsizes.tab: contains the size of each chromosome in the reference genome
Index Structure: Version 2
This section describes the index structure used for lobSTR v2.0.3+. A new index structure was introduced with lobSTR v3+ (see below). The lobSTR index consists of a collection of files within a single directory with the same prefix ("lobSTR_" by default). It contains the following files:
In total, the hg19 reference download consists of 4,115 files: 9 BWA index files for each 457 repeats, plus lobSTR_ref.fasta and lobSTR_strdict.txt.
- A separate BWA-style index for each class of repeat unit. This is made up of the following files for each repeat unit:
These file types are generated using BWA.
lobSTR_ref.fa: This file contains one entry per STR in the reference genome. The sequence for each entry consists of the STR +/- 1kb of flanking region. The header consists of the following fields separated by "$"s:
- ID number of the entry
- Start coordinate of the region in the reference genome
- End coordinate of the region in the reference genome
- Canonical repeat motif
- Copy number in the reference genome
- Repeat motif in the reference genome
lobSTR_strdict.txt: This file contains additional metadata about the index used by lobSTR. The head of the file gives the size of each chromosome in the original reference genome. Below the header lists each repeat unit for which there is a reference index.