In the allelotype step, we use a model of PCR stutter noise to determine the likelihood of each genotype. This model specifies the probability of observing PCR stutter at a given locus based on sequence properties (motif length, total STR length, GC content, and STR purity) and the expected step size distribution. This model will differ between sequencing technologies (i.e. Illumina vs. IonTorrent) and also between different protocols (PCR free vs. not). In the lobSTR download (see download), we provide a model for Illumina PCR free data. This default noise model should work reasonably well in most cases. However, if you are using data from a different platform or protocol, you might want to generate your own noise model.
Input training data
The model is trained using data from haploid chromosomes, usually from the male sex chromosomes. For these, we know there should only be a single allele present and that reads supporting an allele besides the modal allele at each locus are likely due to stutter. Therefore, for training you need to have a reasonably high coverage male genome (or something with haploid chromosomes if you are using another organism) that you have run the lobSTR step on (see usage and wgs best practices pages). With the sorted and indexed bam file in hand, you are ready to build a noise model.
Building the model
To build the noise model, use the allelotype command with the parameter --command train. You must specify which chromosomes should be treated as haploid.
This will generate noise model files my_noise_model.stepmodel and my_noise_model.stuttermodel which can then be used as input to allelotype for performing genotyping. The structure of these files is described here.