Infographic: The Sequencing and Assembly of the Human Genome

ABOVE: MODIFIED FROM © ISTOCK.COM, FILO

After sequencing fragments of DNA to obtain reads, most genomic pipelines follow one of two steps. The reads can be de novo assembled to construct longer stretches called contigs from scratch, with overlapping sequences on the ends dictating which read pieces belong next to each other (below left). Alternatively, reads can also be aligned to a reference genome to identify small genetic variations (below right). Where de novo assembly can be thought of as assembling a puzzle without the use of the picture on the box, alignment is the equivalent of piecing together a puzzle by looking at that picture. However, because a singular reference genome fails to capture all of the genetic diversity across humans, some sections of DNA might not be able to align to the reference genome well.

Infographic comparing assembly versus alignment

modified from © istock.com, filo

The evolution of sequencing

There have been numerous sequence modalities developed in the last quarter century, but major advances include Sanger sequencing, sequencing by synthesis, nanopore long-read sequencing from Oxford Nanopore, and, most recently, high-fidelity single-molecule real-time sequencing from PacBio. These differ in the length of reads they generate, their efficiency, and accuracy, with technologies generally evolving to support faster, cheaper, and more-precise sequencing.

	SANGER SEQUENCING The first sequencing technology invented, and no longer used in modern projects, Sanger sequencing relies on tagging the ends of various sizes of DNA fragments with complementary fluorescent nucleotides. Fragments are then separated by size using gel electrophoresis and the final nucleotides’ fluorescence is read by a laser. The full sequence is inferred by piecing together the end nucleotides of the different-sized fragments. YEARS IN USE: 1980–2010 READ LENGTH: ~500–1,000 bases CONS: Low throughput, time intensive
	SEQUENCING BY SYNTHESIS Sequencing by synthesis (SBS) is the most commonly used type of sequencing today. It relies on synthesizing complementary DNA strands using fluorescently tagged nucleotides and capturing the output signal on a high-resolution camera. Hundreds of thousands of DNA fragments can be read at once, but SBS is limited to short lengths of DNA, making it challenging to assemble whole genomes de novo. YEARS IN USE: 2002–today READ LENGTH: ~100–500 bases CONS: Limited to short reads
modified from © istock.com, Dmitry Kovalchuk	NANOPORE SEQUENCING Oxford Nanopore devices pull DNA through a bioengineered pore to produce electrical current fluctuations that are then translated into a sequence. This approach generates long reads that can be used for de novo genome assembly or to identify larger structural variations that may not be possible with short reads, but it is less accurate than other sequencing technologies. YEARS IN USE: 2002–today READ LENGTH: ~10 kb–1 Mb CONS: Error-prone
	HIGH-FIDELITY SEQUENCING Only recently released by PacBio, high-fidelity (HiFi) single-molecule real-time (SMRT) sequencing relies on similar fluorescence strategies as SBS. Like nanopore sequencing, HiFi produces long reads that can be used for de novo genome assembly or to identify structural variants, but it achieves improved accuracy by circularizing a long DNA molecule so that it can be read dozens of times in a single run. YEARS IN USE: 2020–today READ LENGTH: ~10 kb CONS: Currently very expensive

Visualizing a pangenome

Unlike a linear reference genome, a graph genome allows a single region of the genome to take on a diverse set of sequences. For regions with high genetic diversity, a graph genome can better capture the many human DNA sequences that might exist.