This page describes how to run lobSTR on multiple samples at once. For more specific use cases or advice on setting specific parameters, see the documentation page.
The figure below gives an overview of the steps required to run lobSTR on multiple samples. The alignment step (lobSTR) is run separately on each sample. Then the resulting sorted and indexed BAMs are all used as input to a single run of allelotype, which generates a single VCF file containing STR calls across all samples.
For multi-sample calling, the alignment (lobSTR) step is called separately on each sample. This step is described in the best practices for whole genome and whole exome sequencing page. Importantly, for multi-sample calling you must set the read group information correctly for each sample. Downstream, the allelotype step knows which reads come from which sample according to the sample annotated in the read group (--rg-sample). Note that if two different lobSTR runs have the same --rg-sample but different --rg-lib specified, they will be treated as coming from the same sample.
The following are example alignment commands assuming you are genotyping 3 samples.
The allelotype step takes in BAMs from one or more samples generated in the alignment step. It calls STR genotypes across all samples at once to improve calling accuracy (see how it works below). The output is a single VCF file with STR genotypes across all samples.
Running allelotype on multiple samples
This will create the output my_output.vcf and my_output.allelotype.stats The format of the VCF output is described on the file formats page. The output of the stats file is described here.
How multi-sample calling works
Multi-sample calling calls STR genotypes at one locus at a time. For each locus, it reads in all sequence reads that lobSTR aligned to the locus to ascertain all alleles for which there is any evidence. Once all alleles are ascertained, allelotype calculates the maximum likelihood genotype for each sample by scoring all pairs of possible alleles using a model of noise expected at STR loci.
A big advantage of using multi-sample calling for STR genotypes is that STRs are highly multi-allelic, and unlike SNPs, we do not know beforehand which alleles we will see in the sample. Therefore, if you were to run allelotype separately on each sample, the resulting VCF files would have different alternate allele annotations and you would not be able to easily merge VCFs across callsets.
To make allelotype faster, you can run it separately on each chromosome and analyze multiple chromosomes in parallel. Use the argument --chrom chr1 for example to only analyze chromosome 1.
Tips on setting allelotype parameters can be found here.
Now with a list of STR variant calls, you might be interested in: