Last June, in a legendary bit of scientific heroism, a graduate student from the University of California stitched together a massive collection of DNA sequence pieces to create the first public draft of the human genome sequence. The student, James Kent, had worked around the clock to create a computer program that could assemble the draft in time for the Human Genome Project to declare completion on June 26, 2000.
Now, Kent and his colleague David Haussler describe the creation of that computer program, revealing the surprisingly simple ideas behind the most important puzzle-solving exercise in recent history. The program, called GigAssembler, had to trim and assemble the nearly 400,000 pieces of human DNA sequence generated by the HGP over a decade.
To perform this daunting task, GigAssembler used a so-called "greedy" algorithm that assembles sequence pieces according to best fit first. GigAssembler can consult a wide variety of information to determine how pieces fit - including sequence overlap, gene data, and "maps" generated by the Human Genome Project. For example, if two sequence segments code parts of the same gene, GigAssembler scores a fit.
Using these principles, as well as cleverly designed strategies to resolve conflicting fits, GigAssembler successfully assembled the first public genome draft containing 2.7 billion base pairs (88% of the genome). Since then, GigAssembler has performed further assemblies incorporating up to 92% of the human genome.