Suggested track name:  Paired end accessibility

Description:

This track shows which regions of the genome are more or less accessible to next 
generation sequencing methods using short, paired end reads.  It summarizes whole 
genome sequencing data from Phase 1 of the 1000 Genomes Project.  It shows two 
levels of stringency:  "pilot" stringency regions (see below) cover 94% of non-N bases 
in the genome and "strict" regions cover 72% of non-N bases.  Each site which meets
"strict" criteria also passes the "pilot" criteria.  

This track will be useful (a) for comparing accessibility using current technologies 
to accessibility in the 1000 Genomes Pilot Project, and (b) for population genetic 
analyses (such as estimates of mutation rate) that must focus on genomic regions 
with very low false positive and false negative rates.  By contrast, SNP calls from the 
1000 Genomes Project are filtered using the VQSR method (implemented in GATK) 
without regard to the thresholds applied here.  VQSR assesses the evidence for 
polymorphism at sites where there is evidence, but says nothing about remaining 
sites.

Methods:

The total depth of mapped sequence reads, the average mapping quality score 
and the fraction of reads with mapping quality zero (meaning that this read maps 
equally well to more than one location in the genome) are tabulated from 1103 
.bam files in the 1000 Genomes Phase 1 low coverage data release:  
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/phase1.alignment.index.  
This combines low coverage whole genome sequence information from 1094 
individuals, giving a genome average total depth of coverage of 5132 reads.  
Both "pilot" and "strict" tracks are .bed file conversions of the "pass" regions 
from .fasta mask files in directory  ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes
/ftp/phase1/analysis_results/supporting/accessible_genome_masks.  See a 
README file in that directory for details.

The "pilot" criteria require a depth of coverage between 2566 and 10264 inclusive 
(between one half and twice the average depth) and that no more than 20% of 
covering reads have mapping quality zero.  These are equivalent to the criteria 
used for analyses in the 1000 Genomes Pilot paper (2010).  The "strict" criteria 
require a depth of coverage between 2566 and 7698 inclusive, no more than 0.1% 
of reads with mapping quality zero, and an average mapping quality of 56 or greater.  
This definition is quite stringent and focuses on the most unique regions of the 
genome.  In regions which pass the strict criteria, only ~2% of sites called in an 
initial analysis are rejected as likely false positives by VQSR.  Since approximately 
one half of 1000 Genomes Project individuals are males, the depth of coverage 
is generally lower on the X chromosome.  Coverage thresholds on the X are 
adjusted by a factor of 3/4 and on Y by a factor of 1/2.  The "pilot" criteria were 
not evaluated for the Y chromosome.

1000 Genomes Phase 1 sequencing was done between 2008 and 2010 using Illumina 
(86.4%), AB SOLiD (13%) and Roche LS 454 (0.6%) sequencing technologies.  45% 
of the Illumina coverage is in approximately 100 bp paired end reads, 31.5% in 76 bp 
reads, 15% in 51 bp reads and 8.5% in 36 bp reads.  All AB SOLiD data are 50 bp 
mate paired reads.  Paired end sequence reads were mapped against the hg19 human 
genome reference sequence using bwa version 0.5.5, bfast version 0.6.4e and ssaha 
version 2.5 respectively.  Full details are at  ftp://ftp-trace.ncbi.nlm.nih.gov/1000genomes
/ftp/phase1/README.phase1_alignment_data  and in supplementary materials to the 
Phase 1 paper (2012).  The mapping target consists of the 22 autosomes plus X and 
Y chromosomes (both pseudo-autosomal regions on the Y are masked by Ns), the 
revised CRS mitochondrial sequence (NC_012920), and 59 unplaced contigs.  It does 
NOT include the human herpevirus 4 sequence (used for cell line transformation) 
or approximately 5 Mb of additional "decoy" sequence compiled from other human 
entries in GenBank.  These two were added to the mapping target in July 2011 and 
will be included in the mapping during 1000 Genomes Phase 2.  For details, see:  
ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference
_assembly_sequence/README_human_reference_20110707.

Credits:  

Mary Kate Trost, Goncalo Abecasis, Tom Blackwell;  University of Michigan Center 
for Statistical Genetics.

References:

The 1000 Genomes Project Consortium (2010).  A map of human genome variation 
from population-scale sequencing.  Nature, v.467, n.7319, pp.1061-1073.

The 1000 Genomes Project Consortium (2012).  An integrated map of genetic 
variation from 1,092 human genomes.  (submitted)