lobSTR: a short tandem repeat profiler for next generation sequencing data

genotyping y-str/codis
validation sets

Building a custom noise model


In the allelotype step, we use a model of PCR stutter noise to determine the likelihood of each genotype. This model specifies the probability of observing PCR stutter at a given locus based on sequence properties (motif length, total STR length, GC content, and STR purity) and the expected step size distribution. This model will differ between sequencing technologies (i.e. Illumina vs. IonTorrent) and also between different protocols (PCR free vs. not). In the lobSTR download (see download), we provide a model for Illumina PCR free data. This default noise model should work reasonably well in most cases. However, if you are using data from a different platform or protocol, you might want to generate your own noise model.

Input training data

The model is trained using data from haploid chromosomes, usually from the male sex chromosomes. For these, we know there should only be a single allele present and that reads supporting an allele besides the modal allele at each locus are likely due to stutter. Therefore, for training you need to have a reasonably high coverage male genome (or something with haploid chromosomes if you are using another organism) that you have run the lobSTR step on (see usage and wgs best practices pages). With the sorted and indexed bam file in hand, you are ready to build a noise model.

Building the model

To build the noise model, use the allelotype command with the parameter --command train. You must specify which chromosomes should be treated as haploid.
allelotype \
  --command train \
  --bam my_male_genome.bam \
  --haploid chrX,chrY \
  --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ \
  --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab \
  --noise_model my_noise_model
This will generate noise model files my_noise_model.stepmodel and my_noise_model.stuttermodel which can then be used as input to allelotype for performing genotyping. The structure of these files is described here.