Category Archives: assembly and scaffolding

A Genome Into the Computer

What is a genome?

A genome is the sequence of DNA that makes up the genetic code of every living  organism. In real life DNA is made up of simple chemical compounds bound together sequentially. In the abstract world of the computer DNA is represented as a string of characters (for example “ACGGTGCT”).

what is a genome

simple view of a genome

The genome is the underlying medium upon which all information about an organism is stored. There are many resources in books and on the web to help you gain a more complete understanding of the biological significance of a genome (wikipedia, ncbi, a tour). However the focus of these posts is not going to be about what the genome does, but rather about what the genome is and how we obtain and work with the genome on a computer.

Obtain a genome … IRL

When I mention that I work with genomes on a computer, most people think I’m crazy and don’t quite understand how I can work with something found inside a cell, on a Dell ( check out that alliteration). Fortunately there is a plethora of resources out there that can explain how we obtain a genome in real life.

Professor wikipedia is always a good start to begin to understand a new topic and the previous link provides a somewhat technical overview of DNA sequencing. The US government is heavily involved in many of the earliest and largest DNA sequencing projects and the website genome.gov provides a concise textual introduction to genome sequencing.

Now for a quick overview in my own words: The process of reading DNA from a cell and transfering that information to a computer is known as “sequencing”. The ability to sequence DNA has been around for almost 30 years, but the techniques and speed at which it is done has changes drastically since its inception.

In the 1990′s to early 2000s DNA sequencing was very expensive, and sequencing the genome of a large mammal (like the human) costs hundreds of millions of dollars. Several for profit companies began automating the processes and developing new techniques. Now it is possible to sequence the human genome for about 10-20 thousand dollars.

overview of DNA sequencing

DNA sequencing

 

I’ve created a simple graphic to explain how sequencing is done. In essence its a 3 step process; first the DNA is isolated from a cell(s) and loaded into the machine. Next the machine uses some tricks to get each base in the DNA to emit a color and it takes a whole bunch of pictures. Finally the image files are processed and an algorithm figures out what bases they represent, the string of DNA bases is written to a file on a computer.

Summary

Thank you for reading the second post in my series dedicated to the computer and the genome.  The reader should now have at least a basic understanding of what a genome is from a biological perspective (by following the provided links), and how DNA is sequenced and represented on a computer. The following post will go into how raw sequenced data is assembled into a genome.

Genome on a Computer

A genome on a computer

For the purpose of this and subsequent posts we will build a hypothetical genome, using very simple assumptions which we will change as needed:

NCBI Entry

genome from ncbi

  • genome is only made up of the following characters A,C,G,T (base pairs)
  • the genome is 10-mer unique
  • it has a length of 100 base pairs

The first assumption is relatively benign and is made by many of the most complex tools used in computational genomics. The third assumption is made simply because 100 characters is a convenient amount to work with on a blog (the human genome is > 3 gigabases, that 3 with 9 zeros) .

The second assumption is slightly more advanced and is the first sighting of a very challenging aspect of working with genomes. We will have to use some jargon to discuss it; k-mer is used to describe any sequence of k bases. So therefore the second assumption simply means that there is no 10 sequential bases in the genome that are the same as any other 10.

example

Through these posts I will try to include as many visual examples as possible. Many of them will be using python, I will post all information necessary to recreate the examples on your own.

This first example will give the reader a visualisation of what a genome is on a computer. First I will define an arbitrary genome sequence (taken from here for those who are interested)

genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"

I claimed the genome was 10-mer unique and the following code should demonstrate that.

for x in range(0, len(genome) - 9):
     print genome[x:x+10]

The output of that script is below and you can visually inspect it to be sure that there are no 10-mer duplicates.

AGACAGACAT
GACAGACATA
ACAGACATAG
CAGACATAGG
AGACATAGGA
GACATAGGAG
ACATAGGAGA
CATAGGAGAT
ATAGGAGATT
TAGGAGATTG
AGGAGATTGC
GGAGATTGCT
GAGATTGCTG
AGATTGCTGT
GATTGCTGTA
ATTGCTGTAG
TTGCTGTAGA
TGCTGTAGAA
GCTGTAGAAA
CTGTAGAAAC
TGTAGAAACA
GTAGAAACAA
TAGAAACAAA
AGAAACAAAA
GAAACAAAAA
AAACAAAAAT
AACAAAAATA
ACAAAAATAT
CAAAAATATA
AAAAATATAC
AAAATATACG
AAATATACGA
AATATACGAG
ATATACGAGT
TATACGAGTA
ATACGAGTAT
TACGAGTATA
ACGAGTATAA
CGAGTATAAT
GAGTATAATA
AGTATAATAT
GTATAATATT
TATAATATTG
ATAATATTGC
TAATATTGCA
AATATTGCAT
ATATTGCATA
TATTGCATAA
ATTGCATAAA
TTGCATAAAT
TGCATAAATT
GCATAAATTA
CATAAATTAG
ATAAATTAGG
TAAATTAGGG
AAATTAGGGT
AATTAGGGTG
ATTAGGGTGT
TTAGGGTGTG
TAGGGTGTGC
AGGGTGTGCA
GGGTGTGCAC
GGTGTGCACA
GTGTGCACAA
TGTGCACAAA
GTGCACAAAA
TGCACAAAAT
GCACAAAATA
CACAAAATAT
ACAAAATATC
CAAAATATCA
AAAATATCAG
AAATATCAGA
AATATCAGAG
ATATCAGAGA
TATCAGAGAG
ATCAGAGAGA
TCAGAGAGAT
CAGAGAGATG
AGAGAGATGA
GAGAGATGAG
AGAGATGAGC
GAGATGAGCT
AGATGAGCTG
GATGAGCTGG
ATGAGCTGGC
TGAGCTGGCA
GAGCTGGCAA
AGCTGGCAAC
GCTGGCAACA

We can verify programatically that the example genome is indeed 10 unique by the following code;

# save the genome to a variable.
genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"

# make dictionary to track kmers.
kmer_cnt = {}

# loop over every kmer.
for x in range(0, len(genome) - 9):
	# get kmer.
	kmer = genome[x:x+10]

	# add to dictionary
	if kmer not in kmer_cnt:
		kmer_cnt[kmer] = 0
	kmer_cnt[kmer] += 1

# count number of kmers that appeared more than once.
cnt = 0
for kmer in kmer_cnt:
	if kmer_cnt[kmer] > 1:
		cnt += 1

# report count.
print "there was %i repetative kmers" % cnt

Summary

This post should give you an idea of what a genome is from a computational perspective. Basically on a computer we represent a genome by a string of A,C,G,T characters, and our current working example is 10-mer unique, so no consecutive 10 characters will ever be repeated.

The next post will discuss how in real life a genome can be obtained using a genome sequencer and how we work with that information on a computer.