Lobstr-code

lobSTR: a short tandem repeat profiler for next generation sequencing data

home
download
install
usage
documentation
faq
changelog
genotyping y-str/codis
validation sets
contact-us

Genotyping Y-STRs and CODIS markers

Overview

One of the most common questions we receive is about how to use lobSTR to genotype Y-STRs and the CODIS set from sequencing data. Both are useful for identification in forensics settings, and Y-STRs are also used both in the genetic genealogy community and population genetics. Here we give a short tutorial on how to go from the lobSTR VCF output to a list of Y-STR and CODIS genotypes in standard nomenclature. As an example we will use the Venter genome.

Preliminaries

Before starting this tutorial, you should have:

The examples below use reads from the Venter Genome. These can be downloaded from TraceDB. For the example below, we combined all traces to a single file and trimmed the first 50bp of each read due to the high error rate at the beginning of Sanger reads. The processed reads can be downloaded from https://s3.amazonaws.com/mgymrek_data/venter_reads_19621.fa.gz

Running lobSTR and allelotype

The first step is to use lobSTR to generate a VCF file. The full steps for running lobSTR are available on the usage and best practices for WGS/WES pages. Follow those steps to generate a lobSTR VCF file. The steps for doing this on the Venter genome are below. Note that we use the parameters --fft-window-size 24 --fft-window-step 12 as described in Best practices for alignment to bias lobSTR toward focusing on the longer, more complex Y-STRs rather than on shorter repeats such as homopolymers. Also note that we used the noise model for Illumina PCR free data, even though this is Sanger data. With such low coverage it won't make a big difference, and we didn't have a high enough coverage male Sanger sequencing dataset to train with.
lobSTR \
  --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ \
  -f venter_reads_19621.fa \
  --rg-sample venter --rg-lib venter \
  --fft-window-size 24 --fft-window-step 12 \
  --out venter

samtools sort venter.aligned.bam venter.sorted
samtools index venter.sorted.bam

allelotype \
  --index-prefix hg19_v3.0.2/lobstr_v3.0.2_hg19_ref/lobSTR_ \
  --strinfo hg19_v3.0.2/lobstr_v3.0.2_hg19_strinfo.tab \
  --command classify \
  --noise_model models/illumina_v3.pcrfree \
  --bam venter.sorted.bam \
  --min-border 5 --min-bp-before-indel 7 --maximal-end-match 15 --min-read-end-match 5 \
  --out venter
	  
This will output a file venter.vcf with STR genotype calls.

Extracting Y-STRs and CODIS markers and converting to standard nomenclature

The sources of Y-STR nomenclature are described in Gymrek et al 2013. CODIS nomenclature follows that on the NIST STR fact sheets. Marker nomenclature is also described in a table on the Y-STR/CODIS page.

lobSTR reports alleles as the number of base pairs length difference from reference (GB field of the VCF). However standardized nomenclature for these markers is given in the number of repeat units. Therefore, it is necessary to convert lobSTR results to standard nomenclature before comparing to external datasets. Y-STR and CODIS markers are included in the lobSTR reference bed file and were extracted to separate files for convenience.

You can use the intersectBed tool from bedtools to extract these markers from your VCF and convert to standard notation:

intersectBed -a venter.vcf -b lobSTR_ystr_hg19.bed -wa -wb | \
   cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | \
   cut -f 4 --complement | awk '{print $0 "\t" $6+$4/$5}'
	  
This is a bit of a long command. Here's what it does: Similarly for the CODIS markers, but looking at both alleles reported:
intersectBed -a venter.vcf -b lobSTR_codis_markers_hg19.bed -wa -wb | \
   cut -f 1,2,10,14- | sed 's/:/\t/g' | cut -f 1,2,4,7,11- | sed 's/\//\t/' | \
   awk '{print $0 "\t" $7+$4/$6 "\t" $7+$5/$6}'
	  

Markers with multiple forms

Several Y-STR markers, such as DYS464, DYS385, DYS459, and others, are present at multiple locations on the Y chromosome. That means that more than one allele can be present for these markers. Alleles are always reported from shortest to longest. For example, if at DYS385 the alleles 11 and 14 are present, then you would set DYS385a=11 and DYS385b=14.

These markers are especially tricky to type from sequencing data. If only a single allele is called, you cannot be sure that there is only a single allele present, as the other allele might be missed. Therefore, we standardly discard markers with multiple forms unless we see two alleles with good support (high coverage). Generally we discard DYS464, which can have 6+ alleles, altogether. Often these markers require manual inspection.

Markers with known issues

Example output from the Venter genome

The table below shows the results of running the above commands for both the Y-STR and CODIS loci on the Venter genome. The table gives the resulting allele for each marker that was genotyped, along with notes for some tricky markers. Note that the syntax for supporting reads is "lobSTR allele | read count".
MarkerSupporting readsRef alleleGenotypeYsearch GenotypeNotes
DYS640 0|2 9 9 11 See known issues
DYS570 0|3 17 17 17 -
DYS455 0|1 11 11 11 -
DYS520 4|1 20 21 20 This marker does not match Ysearch. It could be due to stutter, an annotation difference, or mismatch with the Venter entry on Ysearch.
DYS458 4|3 16 17 17 -
DYS450 0|3 8 8 8 -
DYS449.1 4|2 13 14 30 (DYS449.1 + DYS449.2) Discard, since we can only make a call by adding DYS449.1 + DYS449.2
DYS454 0|4 11 11 11 -
DYS481 0|3 22 22 22 -
DYS531 3|1;4|1 11 12 12 -
DYS590 5|2 8 9 9 -
DYS568 0|3 11 11 11 -
DYS391 -4|1 11 10 10 -
DYS635 0|1 23 23 - -
DYS439 -4|1 13 12 12 -
DYS389B.2 4|1 12 13 29 (DYS389B.1 + DYS389B.2) Discard, since we can only make a call by adding DYS389B.1 + DYS389B.2
DYS389I 4|1 12 13 13 -
DYS388 0|2 12 12 12 -
DYS442 0|1 17 17 12 See known issues
DYS438 10|1 10 12 12 -
DYS495 3|1 15 16 - -
DYS436 0|3 12 12 12 -
DYS413a/b 0|3 23 23 23,23 There are three reads mapped to this locus supporting that there are 23 copies. It could be that DYS413a and DYS413b are both “23” but we don’t have enough information to know that, so we’ll only report DYS413a as 23.
DYS641 0|1 10 10 10 -
DYS472 -1|1;0|3 8 8 8 -
DYS565 0|2 12 12 12 -
DYS390 -4|1 24 23 23 -
DYS511 4|1 9 10 10 See known issues
DYS492 3|2 12 13 13 -
DYS638 0|1 11 11 - -
DYS534 4|1 15 16 16 -
DYS617 0|2 12 12 12 -
DYS426 0|1 12 12 12 -
DYS537 0|1 10 10 10 -
YCAIIa/b -8|3;0|4 23 19,23 19,23 There are three reads supporting the allele 19 and 4 reads supporting the allele 23. Therefore we can report YCAIIa=19 and YCAIIb=23.
DYS425 6|2;12|1 12 14 12 See known issue.
DYS395S1a/b 0|2;3|1 15 15,16 15,16 There are two reads supporting the allele 15 and one supporting the allele 16. So we report DYS395S1a=15 and DYS395S1b=16.
DYS385a/b -12|1;-11|1;0|1 14 11,14 11,14 There are two unit-alleles supported: -12 (11) and 0 (14). So we report DYS385a=11, DYS385b=14
DYS461 0|1 12 12 - -
DYS462 0|1 11 11 - -
DYS494 -3|2 10 9 - -
DYS549 -4|1 13 12 - -
DYS594 0|1 10 10 10 -
DYS485 0|2 16 16 - -
DYS714 -10|1 27 25 - -
DYS578 0|2 9 9 9 -
DYS556 0|2 11 11 - -
DYS392 0|2 13 13 13 -
DYS557 0|1 16 16 16 -
DYS459a/b -4|1 10 9 9,10 There is only one read mapped to this locus supporting that there are 9 copies. Since we only have one read, we can’t tell if there are multiple alleles.
Below are the results for the CODIS markers in Venter. Note, unlike the Y-STRs these can be heterozygous. We list below which alleles have evidence in the Venter reads. These will be much less accurate than the Y-STR markers since they require higher coverage to get right. Additionally, we have fewer ground truth datasets for these so there are likely to be some annotation discrepancies with standard nomenclature. If a non-unit and a unit allele were both reported and are close together, only the unit allele is kept.
MarkerSupporting readsRef alleleAlleles with evidenceNotes
D13S317 0|1 11 11 -
PentaE 35|1;37|1 5 12 -
D18S51 -16|2 18 14 -
TPOX 0|4 8 8 -
D3S1358 -1|1 16 15.75 This is likely to be 16, rather than the non-unit allele 15.75.
FGA -4|1;16|1 22 21,26 -
CSF1PO -8|5;-7|1 13 11 -
D7S820 -12|2 13 10 -
D8S1179 -4|1 13 12 -