A Genome Into the Computer

What is a genome?

A genome is the sequence of DNA that makes up the genetic code of every living  organism. In real life DNA is made up of simple chemical compounds bound together sequentially. In the abstract world of the computer DNA is represented as a string of characters (for example “ACGGTGCT”).

what is a genome

simple view of a genome

The genome is the underlying medium upon which all information about an organism is stored. There are many resources in books and on the web to help you gain a more complete understanding of the biological significance of a genome (wikipedia, ncbi, a tour). However the focus of these posts is not going to be about what the genome does, but rather about what the genome is and how we obtain and work with the genome on a computer.

Obtain a genome … IRL

When I mention that I work with genomes on a computer, most people think I’m crazy and don’t quite understand how I can work with something found inside a cell, on a Dell ( check out that alliteration). Fortunately there is a plethora of resources out there that can explain how we obtain a genome in real life.

Professor wikipedia is always a good start to begin to understand a new topic and the previous link provides a somewhat technical overview of DNA sequencing. The US government is heavily involved in many of the earliest and largest DNA sequencing projects and the website genome.gov provides a concise textual introduction to genome sequencing.

Now for a quick overview in my own words: The process of reading DNA from a cell and transfering that information to a computer is known as “sequencing”. The ability to sequence DNA has been around for almost 30 years, but the techniques and speed at which it is done has changes drastically since its inception.

In the 1990′s to early 2000s DNA sequencing was very expensive, and sequencing the genome of a large mammal (like the human) costs hundreds of millions of dollars. Several for profit companies began automating the processes and developing new techniques. Now it is possible to sequence the human genome for about 10-20 thousand dollars.

overview of DNA sequencing

DNA sequencing

 

I’ve created a simple graphic to explain how sequencing is done. In essence its a 3 step process; first the DNA is isolated from a cell(s) and loaded into the machine. Next the machine uses some tricks to get each base in the DNA to emit a color and it takes a whole bunch of pictures. Finally the image files are processed and an algorithm figures out what bases they represent, the string of DNA bases is written to a file on a computer.

Summary

Thank you for reading the second post in my series dedicated to the computer and the genome.  The reader should now have at least a basic understanding of what a genome is from a biological perspective (by following the provided links), and how DNA is sequenced and represented on a computer. The following post will go into how raw sequenced data is assembled into a genome.

Genome on a Computer

A genome on a computer

For the purpose of this and subsequent posts we will build a hypothetical genome, using very simple assumptions which we will change as needed:

NCBI Entry

genome from ncbi

  • genome is only made up of the following characters A,C,G,T (base pairs)
  • the genome is 10-mer unique
  • it has a length of 100 base pairs

The first assumption is relatively benign and is made by many of the most complex tools used in computational genomics. The third assumption is made simply because 100 characters is a convenient amount to work with on a blog (the human genome is > 3 gigabases, that 3 with 9 zeros) .

The second assumption is slightly more advanced and is the first sighting of a very challenging aspect of working with genomes. We will have to use some jargon to discuss it; k-mer is used to describe any sequence of k bases. So therefore the second assumption simply means that there is no 10 sequential bases in the genome that are the same as any other 10.

example

Through these posts I will try to include as many visual examples as possible. Many of them will be using python, I will post all information necessary to recreate the examples on your own.

This first example will give the reader a visualisation of what a genome is on a computer. First I will define an arbitrary genome sequence (taken from here for those who are interested)

genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"

I claimed the genome was 10-mer unique and the following code should demonstrate that.

for x in range(0, len(genome) - 9):
     print genome[x:x+10]

The output of that script is below and you can visually inspect it to be sure that there are no 10-mer duplicates.

AGACAGACAT
GACAGACATA
ACAGACATAG
CAGACATAGG
AGACATAGGA
GACATAGGAG
ACATAGGAGA
CATAGGAGAT
ATAGGAGATT
TAGGAGATTG
AGGAGATTGC
GGAGATTGCT
GAGATTGCTG
AGATTGCTGT
GATTGCTGTA
ATTGCTGTAG
TTGCTGTAGA
TGCTGTAGAA
GCTGTAGAAA
CTGTAGAAAC
TGTAGAAACA
GTAGAAACAA
TAGAAACAAA
AGAAACAAAA
GAAACAAAAA
AAACAAAAAT
AACAAAAATA
ACAAAAATAT
CAAAAATATA
AAAAATATAC
AAAATATACG
AAATATACGA
AATATACGAG
ATATACGAGT
TATACGAGTA
ATACGAGTAT
TACGAGTATA
ACGAGTATAA
CGAGTATAAT
GAGTATAATA
AGTATAATAT
GTATAATATT
TATAATATTG
ATAATATTGC
TAATATTGCA
AATATTGCAT
ATATTGCATA
TATTGCATAA
ATTGCATAAA
TTGCATAAAT
TGCATAAATT
GCATAAATTA
CATAAATTAG
ATAAATTAGG
TAAATTAGGG
AAATTAGGGT
AATTAGGGTG
ATTAGGGTGT
TTAGGGTGTG
TAGGGTGTGC
AGGGTGTGCA
GGGTGTGCAC
GGTGTGCACA
GTGTGCACAA
TGTGCACAAA
GTGCACAAAA
TGCACAAAAT
GCACAAAATA
CACAAAATAT
ACAAAATATC
CAAAATATCA
AAAATATCAG
AAATATCAGA
AATATCAGAG
ATATCAGAGA
TATCAGAGAG
ATCAGAGAGA
TCAGAGAGAT
CAGAGAGATG
AGAGAGATGA
GAGAGATGAG
AGAGATGAGC
GAGATGAGCT
AGATGAGCTG
GATGAGCTGG
ATGAGCTGGC
TGAGCTGGCA
GAGCTGGCAA
AGCTGGCAAC
GCTGGCAACA

We can verify programatically that the example genome is indeed 10 unique by the following code;

# save the genome to a variable.
genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"

# make dictionary to track kmers.
kmer_cnt = {}

# loop over every kmer.
for x in range(0, len(genome) - 9):
	# get kmer.
	kmer = genome[x:x+10]

	# add to dictionary
	if kmer not in kmer_cnt:
		kmer_cnt[kmer] = 0
	kmer_cnt[kmer] += 1

# count number of kmers that appeared more than once.
cnt = 0
for kmer in kmer_cnt:
	if kmer_cnt[kmer] > 1:
		cnt += 1

# report count.
print "there was %i repetative kmers" % cnt

Summary

This post should give you an idea of what a genome is from a computational perspective. Basically on a computer we represent a genome by a string of A,C,G,T characters, and our current working example is 10-mer unique, so no consecutive 10 characters will ever be repeated.

The next post will discuss how in real life a genome can be obtained using a genome sequencer and how we work with that information on a computer.

The Computer and The Genome

Introduction

Welcome to the introductory post of a series I will be writing which looks at what a genome is (on a computer) and the computational issues associated with obtaining and publishing them. This is the main focus of my research towards my PhD and I will continue to update and expand on these posts as I learn more.

This and subsequent posts will be directed toward a non-technical audience with little background in computer science and biology, therefore you will be spared nuanced points, and arcane vocabulary as much as possible. However I will post bits of Python code which I will use to demonstrate points and readers are encouraged to follow along (python is found on basically all computers).  I welcome comments, clarifications and contributions to this article. Please register to post a comment or send me an email.

This series is inspired by a short book The Computer and The Brain by mathematician John von Neumann.

John Von Neumann

John von Neumann

 

Tammar Genome Published … finally

Appologies for not updating the blog recently, however I want to point out that the Tammar Genome has finally been published in Genome Biology! I contributed to the effort by developing a genome improvement algorithm to incorporate new sequencing data, and analyzing the repeats and small rna.

There is a nature blog post , and an Australian news article which gives a brief overview and links to other overviews. I will be starting a series of blog posts which should highlight my work for the tammar genome and give some insight into genome assembly and scaffolding in general.

about ready to run for it

meow look at my genome

SGI Altix UV progress report

After a semester of working with the computer, I am extremely happy with it. Really the astonishing thing about the machine is that there is no difference between it, and the desktop PC I’m using to write this post on. The same code that I write, or execute here will work on the machine. I get amazing performance in Bowtie, BWA and other threaded alignment tools.

Gnome view of SGI Altix

SGI Altix UV 100

HTS Analysis with SGI

Tha Beast

Uncrating and bringing into the building.

Stay tuned for a series of updates describing my experience with using an SGI Altix UV 100 with 512GB RAM, 48 Cores to process and analyze sequencing data generated on 454, SoliD and Illumina platforms. Check out this picture of our lab manager and myself taking the computer for a joy ride around campus… Or rather removing it from its packaging to fit it through the buildings doors. In subsequent posts I will describe the whole process of choosing, obtaining and using this machine to support data analysis and bioinformatics algorithm development.

RepeatMasker on the cluster

Many organisms genomes contain a high percentage of so called “repetitive elements”, and the study of these elements is a very active area of research. After a research group has created a draft assembly of some organisms genome, usually the next step is to start annotating various genomic features such as genes and repetitive elements. One tool, RepeatMasker, by the Institute for System Biology has emerged as a defacto standard in de novo, and database repeat identification and classification.

I’ve added some more code to the NextGenScripts page, one little script helps split fasta formatted files into smaller pieces. The other is a script to run RepeatMasker on a cluster, this speeds up the programs execution time greatly. None of my code mentions how to install or work with RepeatMasker, so please follow the above link to install the software first.

Preferred Tools and NextGenScripts

I added a new page titled Preferred Tools which is nothing more than a list of tasks and the tools I use to accomplish them. I hope it is useful to people out there. Secondly I started adding some code to my repositories. You can check out both pages by clicking links in the right hand menu.

Download Whole Genomes Quickly from NCBI

Here is a quick command line script that should work on any linux system with wget. This command will download chromosomes 1- 8 for the possum. You would need to modify the list of numbers to be “X Y M Un” to get the non-numeric chromosomes.


for i in 01 02 03 04 05 06 07 08; do wget "ftp://ftp.ncbi.nih.gov/genomes/Monodelphis_domestica/CHR_${i}/mdm_ref_chr${i}.fa.gz"; done;

You can do this for any species or really any organized FTP site. Here is the link to the NCBI genome ftp site.

Welcome

Welcome to my updated portfolio. Please stay tuned as I add information about my current work, publications and even some code.