Genome Assemblies: Difference between revisions

From EchinoWiki
Jump to navigation Jump to search
imported>Echinobase
imported>Echinobase
(13 intermediate revisions by the same user not shown)
Line 7: Line 7:


=== '''Assembly_3.1 (Spur_3.1)'''===
=== '''Assembly_3.1 (Spur_3.1)'''===
=== '''Assembly 2.6(Spur 2.6)'''===
=== '''Assembly 2.6 (Spur 2.6)'''===
=== '''Assembly_2.5(Spur_2.5)'''===
=== '''Assembly_2.5 (Spur_2.5)'''===
=== '''Assembly_2.1(Spur_2.1)'''===
=== '''Assembly_2.1 (Spur_2.1)'''===
=== '''Assembly_0.5(Spur_0.5)'''===
 
Spur_2.1 combines BAC reads and WGS reads and utilizes BAC tiling path information. Contaminations identified in Spur_2.0 were removed. Compared to previous assembly releases, Spur_2.1 is more continuous and has fewer false duplications. The Spur_2.1 release was assembled from 2-fold average coverage in sequence reads from Bacterial Artificial Chromosomes (BAC) and 6-fold coverage in Whole Genome Shotgun (WGS) with the HGSC Atlas-2.0 genome assembly system at Baylor College of Medicine. The BAC reads were produced by the Clone-Array Pooled Shotgun Sequencing method (CAPSS) from BAC clones selected based on a minimal FingerPrinted Contigs (FPC) tiling path.In CAPSS pooled BAC reads are assigned to individual BACs by deconvolution. Each BAC assembly was enriched with WGS reads that overlap with the individual BAC reads. The mixed reads sets were assembled locally with Atlas. Sets of overlapping BAC clones were identified based on shared WGS reads and sequence overlaps. The overlapping enriched BACs were then merged together to form the backbone of genome assembly. The merged BAC assemblies were further scaffolded using information from mate pairs, BAC clone vector locations, and BAC tiling path information. Finally contigs from the WGS assembly Spur_0.5 were used to fill gaps in BAC assembly to produce Spur_2.0 release. Extensive contamination analysis was done on Spur_2.0 release. Spur_2.1 release was produced by removing contaminated sequences from Spur_v2.0 release. The Spur_2.1 release includes a set of contigs (continuous blocks of sequence) and scaffolds. Scaffolds include sequence contigs that can be ordered and oriented with respect to each other (multi-contig scaffolds) as well as contigs that could not be linked (single-contig scaffolds or singletons). The N50 of the scaffolds associated with BACs is 216 Kb.The N50 of all scaffolds is 142 Kb. The total length of all contigs greater than 1kb is 804 Mbps. When the gaps between contigs in scaffolds are included, the total pan of the assembly is 907 Mbps. The estimated size of the genome based on the assembly is 814 Mbps.The Spur_2.1 assembly was compared with other available sea urchin sequence data (ESTs, Unigene clusters) to determine the extent of coverage (completeness). A preliminary examination showed over 90% of the sequences in this data set is represented, indicating that the shotgun libraries used to sequence the genome were comprehensive. Typical errors in draft genome sequences include misassemblies of repeat sequences, collapses of repeat regions, and artificial duplications in polymorphic regions. However base accuracy in contigs is usually very high with most errors near the ends of contigs. These data can be downloaded from
 
ftp://ftp.hgsc.bcm.edu/Spurpuratus/fasta/Spur_v2.1/
 
=== '''Assembly_0.5 (Spur_0.5)'''===
 
Spur_0.5 is a preliminary assembly of the California purple sea urchin, S. purpuratus, using whole genome shotgun(WGS) reads with the Atlas genome assembly system at the Baylor college of Medicine Human Genome Sequencing Center.The products of the Atlas assembler are a set of contigs and scaffolds. The total length of all contigs greater than 1kb is 768Mb, the N50 of the contigs larger than 1kb is 10.18 kb and the N50 of the scaffolds is 47.98 kb. The total span of the assembly is 1.13 Gb, which is 240 Mb larger than the estimated genome size. The sequence coverage is 6.2X.A preliminary examination showed that over 90% of the sequences in other available sea urchin sequence data sets (Unigene clusters) is represented in the Spur_0.5 assembly. By comparison to 25 NCBI HTGS_PHASE2 BACs( total 2.9Mb), some types of inconsistency were found: several cases of short non-merging overlaps were observed, most at the tail of scalffolded contigs. this may due to polymorphism such that the merging criteria were not met.several short contigs were found aligning in the middle of long alignment gaps of large scaffolded contigs (7 cases), these large gaps come from scaffolding with only short (2 ~6k) and large (50k, 150k)inserts but no middle sized (10 ~ 15 k) inserts, resulting in unfilled large gaps and artificial expansion of total sequence size in the super contigs. Other minor inconsistencies included three cases of differences between genome contigs and PHASE2 BACS, and two possible misjoins. Checking the three contigs in detail did not identify misassemblies. One possible misjoin is in a repeat region and one is a possible local misordering of a short 2k contig in the middle long scaffolded contig.


== '''''Patiria miniata''''' ==
== '''''Patiria miniata''''' ==
Line 53: Line 60:
With a goal of solving the polymorphism issues of the data while maintaining the sequence continuity, The Pmin_1.0 assembly was generated in the following steps:
With a goal of solving the polymorphism issues of the data while maintaining the sequence continuity, The Pmin_1.0 assembly was generated in the following steps:


1) 454 reads were assembled by CABOG using settings less strignent than the default (unitigger=bog utgErrorRate=0.03 ovlErrorRate=0.08 cnsErrorRate=0.08 cgwErrorRate=0.14 doExtendClearRanges=0)
#454 reads were assembled by CABOG using settings less strignent than the default (unitigger=bog utgErrorRate=0.03 ovlErrorRate=0.08 cnsErrorRate=0.08 cgwErrorRate=0.14 doExtendClearRanges=0)
 
#Both contig and degenerate sequences from the previous step were chopped into fake reads with ~11x coverage (500bp long; 460bp overlap; 80bp minimal length) for ctgs and 8x coverage(450bp long; 400bp overlap; 80bp minimal length) for degs. The fake reads were then assembled by Newbler with the option of -large.
2) Both contig and degenerate sequences from the previous step were chopped into fake reads with ~11x coverage (500bp long; 460bp overlap; 80bp minimal length) for ctgs and 8x coverage(450bp long; 400bp overlap; 80bp minimal length) for degs. The fake reads were then assembled by Newbler with the option of -large.
#Both 454 and iIlumina pair end reads were mapped to the contigs from the previous step. We used BLAT to map the 454 data and bwa(aln+samse) to map the Illumina data, both with the default options. Based on the mapping locations of the paired ends, contigs were then ordered and oriented into scaffolds using ATLAS-Link.
 
#ATLAS-GapFill was then used to assemble the reads locally in an attempt to fill the gaps among the contigs within the scaffolds.This final step produced 770.5Mb sequences with contig N50 size of 9.5kb and scaffold N50 size of 50.3kb.
3) Both 454 and iIlumina pair end reads were mapped to the contigs from the previous step. We used BLAT to map the 454 data and bwa(aln+samse) to map the Illumina data, both with the default options. Based on the mapping locations of the paired ends, contigs were then ordered and oriented into scaffolds using ATLAS-Link.
 
4) ATLAS-GapFill was then used to assemble the reads locally in an attempt to fill the gaps among the contigs within the scaffolds.This final step produced 770.5Mb sequences with contig N50 size of 9.5kb and scaffold N50 size of 50.3kb.


<u>Conditions for use</u>
<u>Conditions for use</u>
Line 74: Line 78:


*Any redistribution of the data should carry this notice.
*Any redistribution of the data should carry this notice.
<u>Description of files</u>
There are 2 directories.
<ol style="list-style-type:upper-roman">
  <li>Contigs/ directory</li>
This directory has 2 files for assembled contigs in the genome, there is
no chromosome assignment for the contigs in Pmin_1.0.
Pmin_1.0.20120411.contigs.agp (agp file)
Pmin_1.0.20120411.contigs.fa (fasta file)
The Pmin_1.0.20120411.contigs.agp file describes the positions and
orientations of the contigs in the group. It takes the standard NCBI
format.
  <li>LinearScaffolds/ directory</li>
This directory has 1 file
Pmin_1.0.20120411.linear.fa
The sequences are linearized scaffolds where the gaps between adjacent
contigs within a scaffold are filled with 'N's and the captured gap size
is estimated from the clone insert size.
</ol>
<u>Sequence Statistics</u>
{| class="wikitable"
! Scaffolds/Contigs
! Number
! N50 (kb)
! Bases (Mb)
! Gap (Mb)
|-
| All Scaffolds
| 60,336
| 50.3
| 811.6
| 41.1
|-
| All Contigs
| 181,436
| 9.5
| 770.5
| N/A
|}
<u>History</u>
Pmin_1.0 (Apr, 2012) This release was the first assembly of the Patiria Miniata genome.


== '''''Lytechinus variegatus''''' ==
== '''''Lytechinus variegatus''''' ==

Revision as of 15:08, 4 December 2019

Echinoderm Genome Assemblies by Species


Strongylocentrotus purpuratus

Assembly_3.1 (Spur_3.1)

Assembly 2.6 (Spur 2.6)

Assembly_2.5 (Spur_2.5)

Assembly_2.1 (Spur_2.1)

Spur_2.1 combines BAC reads and WGS reads and utilizes BAC tiling path information. Contaminations identified in Spur_2.0 were removed. Compared to previous assembly releases, Spur_2.1 is more continuous and has fewer false duplications. The Spur_2.1 release was assembled from 2-fold average coverage in sequence reads from Bacterial Artificial Chromosomes (BAC) and 6-fold coverage in Whole Genome Shotgun (WGS) with the HGSC Atlas-2.0 genome assembly system at Baylor College of Medicine. The BAC reads were produced by the Clone-Array Pooled Shotgun Sequencing method (CAPSS) from BAC clones selected based on a minimal FingerPrinted Contigs (FPC) tiling path.In CAPSS pooled BAC reads are assigned to individual BACs by deconvolution. Each BAC assembly was enriched with WGS reads that overlap with the individual BAC reads. The mixed reads sets were assembled locally with Atlas. Sets of overlapping BAC clones were identified based on shared WGS reads and sequence overlaps. The overlapping enriched BACs were then merged together to form the backbone of genome assembly. The merged BAC assemblies were further scaffolded using information from mate pairs, BAC clone vector locations, and BAC tiling path information. Finally contigs from the WGS assembly Spur_0.5 were used to fill gaps in BAC assembly to produce Spur_2.0 release. Extensive contamination analysis was done on Spur_2.0 release. Spur_2.1 release was produced by removing contaminated sequences from Spur_v2.0 release. The Spur_2.1 release includes a set of contigs (continuous blocks of sequence) and scaffolds. Scaffolds include sequence contigs that can be ordered and oriented with respect to each other (multi-contig scaffolds) as well as contigs that could not be linked (single-contig scaffolds or singletons). The N50 of the scaffolds associated with BACs is 216 Kb.The N50 of all scaffolds is 142 Kb. The total length of all contigs greater than 1kb is 804 Mbps. When the gaps between contigs in scaffolds are included, the total pan of the assembly is 907 Mbps. The estimated size of the genome based on the assembly is 814 Mbps.The Spur_2.1 assembly was compared with other available sea urchin sequence data (ESTs, Unigene clusters) to determine the extent of coverage (completeness). A preliminary examination showed over 90% of the sequences in this data set is represented, indicating that the shotgun libraries used to sequence the genome were comprehensive. Typical errors in draft genome sequences include misassemblies of repeat sequences, collapses of repeat regions, and artificial duplications in polymorphic regions. However base accuracy in contigs is usually very high with most errors near the ends of contigs. These data can be downloaded from

ftp://ftp.hgsc.bcm.edu/Spurpuratus/fasta/Spur_v2.1/

Assembly_0.5 (Spur_0.5)

Spur_0.5 is a preliminary assembly of the California purple sea urchin, S. purpuratus, using whole genome shotgun(WGS) reads with the Atlas genome assembly system at the Baylor college of Medicine Human Genome Sequencing Center.The products of the Atlas assembler are a set of contigs and scaffolds. The total length of all contigs greater than 1kb is 768Mb, the N50 of the contigs larger than 1kb is 10.18 kb and the N50 of the scaffolds is 47.98 kb. The total span of the assembly is 1.13 Gb, which is 240 Mb larger than the estimated genome size. The sequence coverage is 6.2X.A preliminary examination showed that over 90% of the sequences in other available sea urchin sequence data sets (Unigene clusters) is represented in the Spur_0.5 assembly. By comparison to 25 NCBI HTGS_PHASE2 BACs( total 2.9Mb), some types of inconsistency were found: several cases of short non-merging overlaps were observed, most at the tail of scalffolded contigs. this may due to polymorphism such that the merging criteria were not met.several short contigs were found aligning in the middle of long alignment gaps of large scaffolded contigs (7 cases), these large gaps come from scaffolding with only short (2 ~6k) and large (50k, 150k)inserts but no middle sized (10 ~ 15 k) inserts, resulting in unfilled large gaps and artificial expansion of total sequence size in the super contigs. Other minor inconsistencies included three cases of differences between genome contigs and PHASE2 BACS, and two possible misjoins. Checking the three contigs in detail did not identify misassemblies. One possible misjoin is in a repeat region and one is a possible local misordering of a short 2k contig in the middle long scaffolded contig.

Patiria miniata

V2.0 Assembly

We sought to improve the Patiria miniata genome assembly with additional PacBio sequences. We generated a new PacBio read dataset at the Duke University Sequencing Center using our reference individual DNA. The read dataset contains 2 million reads and 15.8 billion bp. The read N50 is 10.4 Kb. We used PBJelly2 to combine the PacBio reads with the previously assembled contigs. The results were an improvement in contig size and number with only a small reduction in the number of scaffolds (Table). The P. miniata Gene v2.0 set was generated using MAKER2 pipeline from v2.0 genome assembly.


Pm v1.0 Pm v2.0
Scaffold number 60,183 57,698
Scaffold N50 52,6141 76,341
Contig number 179,756 131,779
Contig N50 9,466 18,676

V1.0 Assembly

What's New

Pmin_1.0 is the latest (as of Apr 11, 2012) assembly of the genome of Patiria Miniata. The assembly tools CABOG (Celera Assembler), Newbler, ATLAS-Link, and ATLAS-GapFill were used to assemble a combination of 454 reads (fragment and 2.5kb insert paired ends;~15x coverage) and Illumina reads (300bp insert and 2.5kb insert paired ends;~70x coverage).

Introduction

This information is for the first release (Pmin_1.0) of the draft genome sequence of the Patiria miniata . This is a draft sequence and may contain errors so users should exercise caution.Typical errors in draft genome sequences include misassemblies of repeated sequences, collapses of repeated regions, and unmerged overlaps(e.g. due to polymorphisms) creating artificial duplications.

With a goal of solving the polymorphism issues of the data while maintaining the sequence continuity, The Pmin_1.0 assembly was generated in the following steps:

  1. 454 reads were assembled by CABOG using settings less strignent than the default (unitigger=bog utgErrorRate=0.03 ovlErrorRate=0.08 cnsErrorRate=0.08 cgwErrorRate=0.14 doExtendClearRanges=0)
  2. Both contig and degenerate sequences from the previous step were chopped into fake reads with ~11x coverage (500bp long; 460bp overlap; 80bp minimal length) for ctgs and 8x coverage(450bp long; 400bp overlap; 80bp minimal length) for degs. The fake reads were then assembled by Newbler with the option of -large.
  3. Both 454 and iIlumina pair end reads were mapped to the contigs from the previous step. We used BLAT to map the 454 data and bwa(aln+samse) to map the Illumina data, both with the default options. Based on the mapping locations of the paired ends, contigs were then ordered and oriented into scaffolds using ATLAS-Link.
  4. ATLAS-GapFill was then used to assemble the reads locally in an attempt to fill the gaps among the contigs within the scaffolds.This final step produced 770.5Mb sequences with contig N50 size of 9.5kb and scaffold N50 size of 50.3kb.

Conditions for use

These data are made available before scientific publication with the following understanding:

  • The data may be freely downloaded, used in analyses, and repackaged in databases.
  • Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of this data (Baylor College of Medicine Human Genome Sequencing Center) are properly acknowledged. Please cite the BCM-HGSC web site or publications from BCM-HGSC referring to the genome sequence.
  • The BCM-HGSC plans to publish the assembly and genomic annotation of the dataset, including large-scale identification of regions of evolutionary conservation and other features.
  • Any redistribution of the data should carry this notice.

Description of files

There are 2 directories.

  1. Contigs/ directory
  2. This directory has 2 files for assembled contigs in the genome, there is no chromosome assignment for the contigs in Pmin_1.0. Pmin_1.0.20120411.contigs.agp (agp file) Pmin_1.0.20120411.contigs.fa (fasta file) The Pmin_1.0.20120411.contigs.agp file describes the positions and orientations of the contigs in the group. It takes the standard NCBI format.
  3. LinearScaffolds/ directory
  4. This directory has 1 file Pmin_1.0.20120411.linear.fa The sequences are linearized scaffolds where the gaps between adjacent contigs within a scaffold are filled with 'N's and the captured gap size is estimated from the clone insert size.

Sequence Statistics

Scaffolds/Contigs Number N50 (kb) Bases (Mb) Gap (Mb)
All Scaffolds 60,336 50.3 811.6 41.1
All Contigs 181,436 9.5 770.5 N/A

History

Pmin_1.0 (Apr, 2012) This release was the first assembly of the Patiria Miniata genome.

Lytechinus variegatus

Assembly LvPtE5C

Assembly LvMSCB

Assembly 2.2 (Lvar_2.2)

Assembly 0.4 (Lvar_0.4)