A genome on a computer
For the purpose of this and subsequent posts we will build a hypothetical genome, using very simple assumptions which we will change as needed:
- genome is only made up of the following characters A,C,G,T (base pairs)
- the genome is 10-mer unique
- it has a length of 100 base pairs
The first assumption is relatively benign and is made by many of the most complex tools used in computational genomics. The third assumption is made simply because 100 characters is a convenient amount to work with on a blog (the human genome is > 3 gigabases, that 3 with 9 zeros) .
The second assumption is slightly more advanced and is the first sighting of a very challenging aspect of working with genomes. We will have to use some jargon to discuss it; k-mer is used to describe any sequence of k bases. So therefore the second assumption simply means that there is no 10 sequential bases in the genome that are the same as any other 10.
example
Through these posts I will try to include as many visual examples as possible. Many of them will be using python, I will post all information necessary to recreate the examples on your own.
This first example will give the reader a visualisation of what a genome is on a computer. First I will define an arbitrary genome sequence (taken from here for those who are interested)
genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"
I claimed the genome was 10-mer unique and the following code should demonstrate that.
for x in range(0, len(genome) - 9):
print genome[x:x+10]
The output of that script is below and you can visually inspect it to be sure that there are no 10-mer duplicates.
AGACAGACAT GACAGACATA ACAGACATAG CAGACATAGG AGACATAGGA GACATAGGAG ACATAGGAGA CATAGGAGAT ATAGGAGATT TAGGAGATTG AGGAGATTGC GGAGATTGCT GAGATTGCTG AGATTGCTGT GATTGCTGTA ATTGCTGTAG TTGCTGTAGA TGCTGTAGAA GCTGTAGAAA CTGTAGAAAC TGTAGAAACA GTAGAAACAA TAGAAACAAA AGAAACAAAA GAAACAAAAA AAACAAAAAT AACAAAAATA ACAAAAATAT CAAAAATATA AAAAATATAC AAAATATACG AAATATACGA AATATACGAG ATATACGAGT TATACGAGTA ATACGAGTAT TACGAGTATA ACGAGTATAA CGAGTATAAT GAGTATAATA AGTATAATAT GTATAATATT TATAATATTG ATAATATTGC TAATATTGCA AATATTGCAT ATATTGCATA TATTGCATAA ATTGCATAAA TTGCATAAAT TGCATAAATT GCATAAATTA CATAAATTAG ATAAATTAGG TAAATTAGGG AAATTAGGGT AATTAGGGTG ATTAGGGTGT TTAGGGTGTG TAGGGTGTGC AGGGTGTGCA GGGTGTGCAC GGTGTGCACA GTGTGCACAA TGTGCACAAA GTGCACAAAA TGCACAAAAT GCACAAAATA CACAAAATAT ACAAAATATC CAAAATATCA AAAATATCAG AAATATCAGA AATATCAGAG ATATCAGAGA TATCAGAGAG ATCAGAGAGA TCAGAGAGAT CAGAGAGATG AGAGAGATGA GAGAGATGAG AGAGATGAGC GAGATGAGCT AGATGAGCTG GATGAGCTGG ATGAGCTGGC TGAGCTGGCA GAGCTGGCAA AGCTGGCAAC GCTGGCAACA
We can verify programatically that the example genome is indeed 10 unique by the following code;
# save the genome to a variable.
genome = "AGACAGACATAGGAGATTGCTGTAGAAACAAAAATATACGAGTATAATATTGCATAAATTAGGGTGTGCACAAAATATCAGAGAGATGAGCTGGCAACA"
# make dictionary to track kmers.
kmer_cnt = {}
# loop over every kmer.
for x in range(0, len(genome) - 9):
# get kmer.
kmer = genome[x:x+10]
# add to dictionary
if kmer not in kmer_cnt:
kmer_cnt[kmer] = 0
kmer_cnt[kmer] += 1
# count number of kmers that appeared more than once.
cnt = 0
for kmer in kmer_cnt:
if kmer_cnt[kmer] > 1:
cnt += 1
# report count.
print "there was %i repetative kmers" % cnt
Summary
This post should give you an idea of what a genome is from a computational perspective. Basically on a computer we represent a genome by a string of A,C,G,T characters, and our current working example is 10-mer unique, so no consecutive 10 characters will ever be repeated.
The next post will discuss how in real life a genome can be obtained using a genome sequencer and how we work with that information on a computer.
