share
SkepticsDoes the DNA of one sperm contain 37.5 MB of information?
[+45] [2] Oliver_C
[2012-09-18 10:36:01]
[ biology physiology sperm dna ]
[ https://skeptics.stackexchange.com/questions/10954/does-the-dna-of-one-sperm-contain-37-5-mb-of-information ]

sperm factoid [ Source [1]]

In an episode of the BBC show QI - Quite Interesting [2] (Series J, Episode 1) Stephen Fry said [3]:

How much information do you think is in the DNA of one little sperm...?

It's 37.5 MB...

...a normal male ejaculation, if there is such a thing, is equivalent of 15,875 GB. That's about 7500 laptops worth of information...


The shows Twitter page [4] summarized it:

A sperm has 37.5 MB of DNA info.
One ejaculation transfers 15,875 GB of data, equivalent to that held on 7,500 laptops.


(with "200 million sperm per ejaculation" [5] one would actually get 7150 TB;
but I'm more interested in where the 37.5 MB number comes from)


My Question:

(15) I was very annoyed when this was broadcast, as the claim of 7500TB is clearly false. Each sperm is approximately a random shuffle of 1/2 the DNA of the parent and so 200 million selections of 1/2 the parental DNA is not going to multiply the information contained by 200 million! The 37.5MB sounds reasonable order of magnitude, exact values will depend on how you encode the information etc. I would crunch the numbers myself but would that be acceptable as an answer? - Nick
(2) Wikipedia seems to claim a figure between 2MB (haploid difference from standard reference) and 700-800MB (complete haploid genome). Not sure what set of approximations QI used to get 37.5MB. - Nick
If it's 2 bytes per base pair and there are 3.2 billion base pairs, then it would be 763 MB. Makes me wonder where "37.5" comes from. - Oliver_C
@Olicer_C the entropy in DNA is less than 2 bits per base pair, ~1.75 but that only lowers it to 625MB-667MB. If you take just differences from human reference sequence, you can get down to 2MB. 37.5MB seems rather odd size. - Nick
(6) "equivalent of 15,875 GB" is total BS, exact copy of same information is not extra information. It's like saying that copying "lorem ipsum" few thousand times is "equivalent to contents of the Library of Congress" - vartec
(1) It depends on the encoding used and the contents of the DNA. I can create an encoding whereby if the first bit is 1, then the DNA is my DNA, if it's 0 then the DNA follows, and in that case it would be 1 bit for my DNA and more for the others. So the answer is: it depends - Andreas Bonini
@coleopterist Dr.Frank Scali has divided the 763MB figure down to only include protein coding DNA, (which is ~1% of the genome.) I don't think that is a good approach to information transfer, since the rest of the genome contains functional regulatory elements (See the ENCODE project genome.ucsc.edu/ENCODE) which estimates 80% is functional. However, 100% of the bases are readable as information. He ahas also naively multiplied up by MB/sperm to get total data, which is incorrect, as the sperm all share the same parental cell DNA - Nick
This really depends upon the encoding scheme, if you are using ASCII to encode then you are talking 8 bits for base pair which is going to up the numbers a lot. A lot of genomics data is passed around as ASCII so it's not unreasonable to use it for the calculation either. - rjzii
(1) I'd just like to point out that even though a lot of data is repeating in the calculation, physically speaking, it is still passed down (you cannot do such thing as compress sperm DNA information). - Zonata
(6) @RobZ: Depends if they really talk about information or data. In coding theory the information content won't be changed by a lossless encoding scheme. It just determines how much data you need to represent it. Of course, 200 million times the same messages gives you 200 million times the data size, but no more additional information. - Martin Scharrer
(1) Of course it is absolutely naïve to calculate an amount of information from the length of the DNA of a cell. Information stored in DNA does not barely depend on its sequence, but it depends on WHERE the sequence is, how the DNA is spatially folded, whether it is modified (e.g. methylated), what transcription factors and what proteins are present in that specific cell and so on. These type of DNA/computer comparisons -albeit very common- are just (pointless IMO) exercises in style. - nico
I've heard the sperm count of humans is poor compared to other animals. - Golden Cuy
(2) @Vartec - Actually its like saying that you could copy Lorum Ipsum enough times to fill a datastore with the same volume of information as contained in the LOC. There is no claim that the data being transferred is not redundant at all. - Chad
@Chad: the quoted question is "How much information", more copies isn't the same as more information. - vartec
(1) @Vartec - the question is only about one sperm an entire ejaculate... however I would still say that even though it is mostly the same information over and over you can not know what it is until you read it, and you can read each one, so each one is infact a unique copy of information. If you down load the same 1mb file 1024 times it is still 1g of data that was downloaded. If it were a pointer to the information then I would agree. - Chad
I would not qualify the transport of a data storage unit as data transfer... - inf3rno
I cant even understand? Who pulled out this 37mb out of the air and was like yeah its the equivelent...blaahhhh, how do you even convert it into comouter data lol?? - user14801
I think the real question here is: why can't we use DNA to encode information? In other words, store 37MB of our own data by creating an artificial sperm or modifying an existing one. If we modifying existing ones, the, uh, large supply should make, uh, "disk drives" much cheaper and make the "dick drive" pun finally real :) - user14703
[+46] [2012-09-18 16:08:36] Tor-Einar Jarnbjo [ACCEPTED]

I am not sure where these numbers come from and the answer depends on how you encode the genome data and if you define all the redundancy (unnecessary, repetitive data) as "information".

First of all, the humane genome contains somewhere around 3.1 (men) to 3.2 (women) billion base pairs. Since the X chromosome is three times longer than the Y chromosome, women have a higher total genome length than men.

Source: "Human Genome Assembly Information" from the "Genome Reference Consortium" [1]

A base pair is made of two of the four nucleobases adenine, cytosine, guanine and thymine, but only the four combinations AT, TA, CG and GC are possible as the A and T nucleobases won't bond with the C and G nucleobases and vice versa. These four combinations can be encoded with two bits, so that 6.2-6.4 gigabits or about 750 megabytes are required to store an exact copy of the genome.

Now, even if you need 750 megabytes to store the "raw data" from a human genome, at least a computer scientist will have a hard time defining all of this as "information". E.g. if you record 74 Minutes of complete silence on a CD, the disc contains roughly 750 megabytes of "data" as well, but actually no "information". Large parts of the human genome are repetitive, only a very small part actually differ between different individuals and from the difference, several base pair sequences only occur in a few well-defined varieties.

There is actually some research in the field "how to store a human genome as compact as possible", since genome databases most likely are going to expand rapidly and scientists need efficient ways to share data. Some tools are available for this purpose, e.g. DNAzip, which using a ~5 gigabyte dictionary (permanent data) can compress a human genome down to roughly 4 megabytes.

Source: "Human genomes as email attachments" [2]

[1] http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
[2] http://bioinformatics.oxfordjournals.org/content/25/2/274.full

CAG and T are nucleotides, not proteins. Proteins are long strings of amino acids; nucleotides are small cyclic molecules. - matt_black
@matt_black: Aren't they actually nucleobases, to be very precise? - Tor-Einar Jarnbjo
(1) @Tor-EinarJarnbjo: A, C, G and T can be used to identify both the nucleobase (for instance adenine) and the nucleoside (for instance adenosine). - nico
(8) The second number is interesting but not really an answer to the question: the information content is certainly more than 4 MB since you can’t just ignore the dictionary size. - Konrad Rudolph
The correct information content is comparable to the size of the genome, about 1Gbyte. There is only a small factor of redundant or useless information. - Ron Maimon
Speculation: 37.5MB is 5% of 750MB. Why 5%? Until recently it was believed that most of our DNA is "junk", and I often heard that 95% was junk. So whoever came up with "37.5MB" might have dismissed 95% of the 750MB as non-information. - Oliver_C
@RonMaimon No, it’s substantially less. Maybe not 37MB (I don’t remember where this number comes from but it’s frequently quoted in bioinformatics – maybe Oliver is right but I doubt it: most scientists have known quite long that “junk DNA” isn’t holding up to scrutiny). Nevertheless, DNA contains quite a few low-complexity regions and can be compressed down to at least 700 MB. - Konrad Rudolph
(2) I have to say that I’m unhappy that this is the accepted answer. The 37 MB number is in the ballpark of often-quoted numbers in bioinformatics. Whether or not it’s correct it requires some explanation, and this is entirely lacking here. Unfortunately, I can’t for the life of me remember how the number was derived. - Konrad Rudolph
1
[+9] [2012-10-20 18:24:52] terdon

For a simpler answer, you can just look at the size of an ASCI encoded text file containing the human genome's information. This, of course, is not the information content of the genome which, as you can see from the answer above and the comments in this thread, is not that easy to define.

In any case, when biologists work on the genome sequence, it tends to be in the form of FASTA sequences [1]. The human genome as a multi fasta file is ~3Gb. See, for example, the file UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa obtained when extracting this archive [2].

Again, I stress that this is not the information content of the genome. For those of us who are not information theorists though, it gives an easy way of picturing the genome's size in a format we are familiar with: text.

[1] http://en.wikipedia.org/wiki/FASTA_format
[2] ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/NCBI/build37.2/Homo_sapiens_NCBI_build37.2.tar.gz

A good compression algorithm should be able to compress the human genome file and produce something with minimal output size. That might give a better indication as there one would have a file containing a dictionary plus the data to expand the dictionary, which would be the (minimal?) amount of "information" needed. - Raf
2