lobSTR: a short tandem repeat profiler for next generation sequencing data

genotyping y-str/codis
validation sets

Filtering lobSTR VCFs


This page describes how to use lobSTR_filter_vcf.py to filter VCF files generated by lobSTR based on a number of metrics. It then describes our recommended filters for multi-sample calling. Note this script will only work with VCFs generated by lobSTR v4.0.0+.

Using lobSTR_filter_vcf.py

If you followed the instructions on the install page, lobSTR_filter_vcf.py should be installed in $PREFIX/share/lobSTR/scripts/lobSTR_filter_vcf.py. This tool applies locus and call level filters which are annotated in the output VCF file.

Basic usage

To run lobSTR_filter_vcf.py:

lobSTR_filter_vcf.py --vcf VCFFILE > OUTFILE
where VCFFILE is a VCF file generated by lobSTR and OUTFILE is an output VCF file. By default no filters are applied. A number of options allow for specific locus and call level filters. All locus level filters are of the form --loc-XX and all call level filters are of the form --call-XX. For example:
lobSTR_filter_vcf.py --vcf VCFFILE --loc-max-ref-length 80 --call-cov 10 > OUTFILE
will filter all loci with a reference length of greater than 80bp and all calls with less than 10x coverage.

Output files

The output file from this script is in standard VCF format. Loci remaining after filtering will have "PASS" in the FILTER column of the VCF. Non-passing loci will have a comma-separated list of filters that failed. If it was not present, an "FT" field will be added to the FORMAT field. This is a string field that will say:

All command line arguments

The table below lists filtering options that are currently available. If you would like additional filter options added, feel free to contact us or submit a pull request on github.

--vcf (REQUIRED) Input VCF file (generated by allelotype) string -
--loc-log-score Min mean -log10(1-Q) cutoff to include a locus float 0.0
--loc-cov Min mean coverage to include a locus. int 0
--loc-max-ref-length Max reference length of a locus to include int 10000
--loc-call-rate Min call rate to include a locus. float 0.0
--call-dist-end Max mean absolute difference in distance from read ends to include a call. float 100.0
--call-log-score Min mean -log10(1-Q) cutoff to include a call float 0.0
--call-cov Min coverage to include a call int 0
--ignore-samples Ignore these samples when appying filters. File with one sample/line. string -

Recommended filters

Based on comparison with capillary electrophoresis data, we recommend the follow quality filters for high coverage whole genome sequencing data: Alternative filters may be appropriate for other types of datasets.