This page describes several STR call sets generated using capillary electrophoresis, the gold standard in STR genotyping. We use these calls as ground truth datasets for comparing to lobSTR genotype results.
Comparisons below are to calls from high coverage genomes for 184 samples sequenced by the Simons Genome Diversity Project.
Capillary electrophoresis Y-STR genotypes are reported as the number of repeat units. To convert lobSTR calls to this format, see the tutorial on calling Y-STRs using lobSTR.
The lobSTR reference has been carefully designed to follow standard nomenclature at commonly genotyped Y-STRs. Three markers showed annotation discrepancies between lobSTR and HGDP. For markers DYS594 and DYS481, subtract 1 unit from the resulting lobSTR genotypes. Nomenclature for DYS640 could not be resolved between the two callsets and we recommend removing this marker from analysis.
We compared lobSTR calls for the SGDP data to the HGDP Y-STR genotypes. 57 samples and 39 markers overlapped between the two datasets. The call sets showed 99.3% concordance. The following files contain comparison results:
The Marshfield sets are reported as PCR product sizes and must be converted to lobSTR format before comparison. Here we describe the conversion process for the Rosenberg dataset, but it should apply to other Marshfield callsets.
Determining the genomic location of each marker
The file Pemberton_AdditionalFile1_11242009 available from the Rosenberg website provides PCR primers for the Marshfield markers. We used these primers as inputs to UCSC's in silico PCR tool to determine the genomic location of each marker and overlapped these with the lobSTR reference set for hg19. Only markers which had a single repeat motif listed that matched the one in the lobSTR reference were used for analysis.
Converting calls to lobSTR format
Calls were converted from PCR product size to number of base pairs difference from reference using the formula: (allele_product_size - ref_product_size) where allele_product_size is the allele listed in the Rosenberg data and ref_product_size is the PCR product size in hg19. All genotypes listed as "-9" (no call) were removed. In some cases the reference product size did not match that annotated in the Pemberton data, or in silico PCR simply returned no results. In the first case, we used the hg19 product size. In the second case where no product size was available through in silico PCR, we used the product size annotated by Pemberton. mismatched_product_sizes.tab gives a list of loci with product size discrepancies.
Many markers showed consistent differences of one or more repeat units between lobSTR and the capillary calls, likely due to annotation differences, and need to be corrected before performing comparisons. The file marshfield_marker_corrections.tab gives the genomic location, name, and correction that must be applied to lobSTR genotypes before comparison.
We compared lobSTR calls for the SGDP data to the Rosenberg Marshfield genotypes. 105 samples and 481 markers overlapped between the two datasets after filtering for calls with a minimum coverage of 5x and minimum lobSTR quality score of 0.9. Overall the callsets showed 93% genotype concordance. The plot below shows a comparison of genotype calls (r2=0.95). Genotypes are reported as the mean length difference in base pairs from the reference for the two alleles at each diploid locus.
The following files contain comparison results: