Drosophila simulans w501 genome sequence.

We created a new assembly of the Drosophila simulans genome using 142 million paired short-read sequences and previously published data for strain w(501).  We effectively increase the assembly length across all chromosome arms by ∼17% on average with the largest improvement in sequence content on X chromosome (an additional 6.41 Mb, which is 44% larger than the previous reference). The substantial increase in coverage of the X chromosome translates to a 17% gain in the total number of full-length orthologous gene matches. We also see proportionally large gains in the assembly length compared with previous reference on chromosome 4.  Over all, our assembly represents a higher-quality genomic sequence with greater coverage, fewer misassemblies, and, by several indexes, fewer sequence errors.

As of Feb 2015, our assembly, with an accompanying annotation, has been integrated into Flybase as D. simulans Release 2 (http://flybase.org/static_pages/feature/previous/articles/2015_02/Dsim_r2.01.html).

Flybase used the Genbank version of the genome here: http://www.ncbi.nlm.nih.gov/genome?term=txid7240[orgn]. An arguably annoying aspect of Genbank is that they will not accept any sequences <200nt, and thus the “complete” genome is actually missing 11973 unassembled contigs totaling ~1.8Mb.  If you would like to include these contigs to improve mapping of short reads to the genome, you can download them here: [unassembled_contigs.fa.gz].

A folder containing branch-specific divergence estimates for each gene by annotated site classification can be downloaded here: [Hu_etal_divergence_estimates.zip]

Associated manuscript:

Hu TT, Eisen MB, Thornton KR, Andolfatto P. 2013. A second generation assembly of the Drosophila simulans genome provides new insights into patterns of lineage-specific divergence. Genome Research, 23:89-98. http://www.ncbi.nlm.nih.gov/pubmed/22936249.