SkepticsDoes the DNA of one sperm contain 37.5 MB of information?
[+21] [2] Oliver_C
[2012-09-18 10:36:01]
[ biology physiology sperm dna ]
[ ]

sperm factoid [ Source [1]]

In an episode of the BBC show QI - Quite Interesting [2] (Series J, Episode 1) Stephen Fry said [3]:

How much information do you think is in the DNA of one little sperm...?

It's 37.5 MB...

...a normal male ejaculation, if there is such a thing, is equivalent of 15,875 GB. That's about 7500 laptops worth of information...

The shows Twitter page [4] summarized it:

A sperm has 37.5 MB of DNA info.
One ejaculation transfers 15,875 GB of data, equivalent to that held on 7,500 laptops.

(with "200 million sperm per ejaculation" [5] one would actually get 7150 TB;
but I'm more interested in where the 37.5 MB number comes from)

My Question:

(9) I was very annoyed when this was broadcast, as the claim of 7500TB is clearly false. Each sperm is approximately a random shuffle of 1/2 the DNA of the parent and so 200 million selections of 1/2 the parental DNA is not going to multiply the information contained by 200 million! The 37.5MB sounds reasonable order of magnitude, exact values will depend on how you encode the information etc. I would crunch the numbers myself but would that be acceptable as an answer? - Nick
(2) Wikipedia seems to claim a figure between 2MB (haploid difference from standard reference) and 700-800MB (complete haploid genome). Not sure what set of approximations QI used to get 37.5MB. - Nick
If it's 2 bytes per base pair and there are 3.2 billion base pairs, then it would be 763 MB. Makes me wonder where "37.5" comes from. - Oliver_C
@Olicer_C the entropy in DNA is less than 2 bits per base pair, ~1.75 but that only lowers it to 625MB-667MB. If you take just differences from human reference sequence, you can get down to 2MB. 37.5MB seems rather odd size. - Nick
(3) "equivalent of 15,875 GB" is total BS, exact copy of same information is not extra information. It's like saying that copying "lorem ipsum" few thousand times is "equivalent to contents of the Library of Congress" - vartec
(1) It depends on the encoding used and the contents of the DNA. I can create an encoding whereby if the first bit is 1, then the DNA is my DNA, if it's 0 then the DNA follows, and in that case it would be 1 bit for my DNA and more for the others. So the answer is: it depends - Andreas Bonini
A Dr.Frank Scali has attempted to debunk this. His figures are ~9.16MB/sperm and ~3000TB/orgasm. (P.S. Oliver, please improve your accept rate if possible.) - coleopterist
@coleopterist Dr.Frank Scali has divided the 763MB figure down to only include protein coding DNA, (which is ~1% of the genome.) I don't think that is a good approach to information transfer, since the rest of the genome contains functional regulatory elements (See the ENCODE project which estimates 80% is functional. However, 100% of the bases are readable as information. He ahas also naively multiplied up by MB/sperm to get total data, which is incorrect, as the sperm all share the same parental cell DNA - Nick
This really depends upon the encoding scheme, if you are using ASCII to encode then you are talking 8 bits for base pair which is going to up the numbers a lot. A lot of genomics data is passed around as ASCII so it's not unreasonable to use it for the calculation either. - rob
(1) I'd just like to point out that even though a lot of data is repeating in the calculation, physically speaking, it is still passed down (you cannot do such thing as compress sperm DNA information). - Zonata
(5) @RobZ: Depends if they really talk about information or data. In coding theory the information content won't be changed by a lossless encoding scheme. It just determines how much data you need to represent it. Of course, 200 million times the same messages gives you 200 million times the data size, but no more additional information. - Martin Scharrer
(1) Of course it is absolutely naïve to calculate an amount of information from the length of the DNA of a cell. Information stored in DNA does not barely depend on its sequence, but it depends on WHERE the sequence is, how the DNA is spatially folded, whether it is modified (e.g. methylated), what transcription factors and what proteins are present in that specific cell and so on. These type of DNA/computer comparisons -albeit very common- are just (pointless IMO) exercises in style. - nico
I've heard the sperm count of humans is poor compared to other animals. - Andrew Grimm
@Vartec - Actually its like saying that you could copy Lorum Ipsum enough times to fill a datastore with the same volume of information as contained in the LOC. There is no claim that the data being transferred is not redundant at all. - Chad
@Chad: the quoted question is "How much information", more copies isn't the same as more information. - vartec
(1) @Vartec - the question is only about one sperm an entire ejaculate... however I would still say that even though it is mostly the same information over and over you can not know what it is until you read it, and you can read each one, so each one is infact a unique copy of information. If you down load the same 1mb file 1024 times it is still 1g of data that was downloaded. If it were a pointer to the information then I would agree. - Chad
I would not qualify the transport of a data storage unit as data transfer... - inf3rno
I cant even understand? Who pulled out this 37mb out of the air and was like yeah its the equivelent...blaahhhh, how do you even convert it into comouter data lol?? - user14801
[+25] [2012-09-18 16:08:36] Tor-Einar Jarnbjo [ACCEPTED]

I am not sure where these numbers come from and the answer depends on how you encode the genome data and if you define all the redundancy (unnecessary, repetitive data) as "information".

First of all, the humane genome contains somewhere around 3.1 (men) to 3.2 (women) billion base pairs. Since the X chromosome is three times longer than the Y chromosome, women have a higher total genome length than men.

Source: "Human Genome Assembly Information" from the "Genome Reference Consortium" [1]

A base pair is made of two of the four nucleobases adenine, cytosine, guanine and thymine, but only the four combinations AT, TA, CG and GC are possible as the A and T nucleobases won't bond with the C and G nucleobases and vice versa. These four combinations can be encoded with two bits, so that 6.2-6.4 gigabits or about 750 megabytes are required to store an exact copy of the genome.

Now, even if you need 750 megabytes to store the "raw data" from a human genome, at least a computer scientist will have a hard time defining all of this as "information". E.g. if you record 74 Minutes of complete silence on a CD, the disc contains roughly 750 megabytes of "data" as well, but actually no "information". Large parts of the human genome are repetitive, only a very small part actually differ between different individuals and from the difference, several base pair sequences only occur in a few well-defined varieties.

There is actually some research in the field "how to store a human genome as compact as possible", since genome databases most likely are going to expand rapidly and scientists need efficient ways to share data. Some tools are available for this purpose, e.g. DNAzip, which using a ~5 gigabyte dictionary (permanent data) can compress a human genome down to roughly 4 megabytes.

Source: "Human genomes as email attachments" [2]


CAG and T are nucleotides, not proteins. Proteins are long strings of amino acids; nucleotides are small cyclic molecules. - matt_black
@matt_black: Aren't they actually nucleobases, to be very precise? - Tor-Einar Jarnbjo
yes, indeed. Thanks for the correction. - matt_black
(1) @Tor-EinarJarnbjo: A, C, G and T can be used to identify both the nucleobase (for instance adenine) and the nucleoside (for instance adenosine). - nico
(6) The second number is interesting but not really an answer to the question: the information content is certainly more than 4 MB since you can’t just ignore the dictionary size. - Konrad Rudolph
The correct information content is comparable to the size of the genome, about 1Gbyte. There is only a small factor of redundant or useless information. - Ron Maimon
Speculation: 37.5MB is 5% of 750MB. Why 5%? Until recently it was believed that most of our DNA is "junk", and I often heard that 95% was junk. So whoever came up with "37.5MB" might have dismissed 95% of the 750MB as non-information. - Oliver_C
@RonMaimon No, it’s substantially less. Maybe not 37MB (I don’t remember where this number comes from but it’s frequently quoted in bioinformatics – maybe Oliver is right but I doubt it: most scientists have known quite long that “junk DNA” isn’t holding up to scrutiny). Nevertheless, DNA contains quite a few low-complexity regions and can be compressed down to at least 700 MB. - Konrad Rudolph
I have to say that I’m unhappy that this is the accepted answer. The 37 MB number is in the ballpark of often-quoted numbers in bioinformatics. Whether or not it’s correct it requires some explanation, and this is entirely lacking here. Unfortunately, I can’t for the life of me remember how the number was derived. - Konrad Rudolph
@KonradRudolph: It's derived by compressing the genes, using a model where the non-coding RNA is junk. This is known to be nonsense today. The information content is about within about a factor of 4 of 1 gigabyte for sure, nearly all the RNA is coding and active in regulation (though RNA networks), and it is not very compressible without a huge fixed-length dictionary which should count as part of the data. Not all bioinformaticists understand this yet, although most biologists would agree. It was proposed by John Mattick in 2000-2002. - Ron Maimon
(2) @Ron “Not all bioinformaticists understand this yet, although most biologists would agree” – you must be kidding. Or you really don’t know. If anything, it’s the other way round. And uses of “junk DNA” were known long before 2000. - Konrad Rudolph
@KonradRudolph: The bioinformaticists don't understand the magic of nongenetic information--- they focus on genes. The biologists are more open minded. The uses known before 2000 were generally of a form that could allow massive compression, only Mattick suggested the RNA is computing with a 1 gigabyte RAM (or at least, this is how I phrase it, the RAM business is probably due to me, but Mattick called it "Network complexity" and predicted many fragile double-binding interactions of noncoding RNA with pure computational consequences, which is essential for an RNA brain). - Ron Maimon
(3) @Ron With all due respect, you haven’t got the faintest clue what you’re talking about. You also seem to think that bioinformaticians and biologists aren’t talking to each other, or that biologists are intentionally keeping information from bioinformaticians or that the latter are phenomenally stupid. - Konrad Rudolph
@Ron Incidentally, you are most probably right about where the 37 MB number comes from (even though I initially said something else further up). If you had a source for that information it would make an excellent answer here … - Konrad Rudolph
@KonradRudolph: I don't use sources, I use my brain. If you like source over brain, you won't like what I have to say. Regarding the bioinformaticians, I am one now (I got a job two weeks ago), and generally, the bioinformatics literature is clueless about the gigabyte size RNA computer in the nucleus. The biologists generally don't recognize it either, but they at least have a name for it now "heterogenous nuclear RNA". This can be easily seen to be the "intelligent designer" inside the cell, and it can easily be predicted without observation to make an enormous computer, as mattick did. - Ron Maimon
(4) @Ron It’s really hard to gauge your level of knowledge here but you are either unaware of, or for some reason not mentioning, DNA regulation. Which is incredibly well known and studied – sometimes under the fancy name “regulome”. But whatever the name, the study of transcription regulation has been around forever. The study of regulation via (and of) small RNAs is more recent but also very well established. Nothing of this is arcane. - Konrad Rudolph
(2) @RonMaimon Your statement that bioinformatics doesn't deal with non-coding RNA is just plain wrong. Bioinformatics are used for quite a while now to find regulatory RNA like e.g. riboswitches or miRNAs. The importance of RNA in regulation is well-known, despite all those articles in the mainstream press that rediscover that "junk" DNA isn't junk every year. - Fabian
@KonradRudolph: I know the little noncoding things like miRNA and the "regulome", these are a tiny tip of the iceberg. I am certain by now that the function of the transcribed non-coding RNA (about 80-90% of the genome) is to self-double-bind and self-splice in a sequence dependent manner to make a computer with about 100Megs of data per cell, more in neurons, and terabytes in egg cells, and this size computation doesn't come from the known regulatory motifs, so it is a prediction of the theory about new regulation. It requires noncoding RNA to double bind and rewrite itself constantly. - Ron Maimon
@Fabian: The bioinformatics only deals with a small fraction of the noncoding RNA honestly. e.g., people think HERV were originally exogenous, the unique motifs of primates are not given their proper function, and generally, bioinformatics, aside from John Mattick, is clueless that there is a big RNA brain in the nucleus, directing everything. This is not the statement that there are regulatory elements, but a statement of closed loop computation with megabytes of RAM. This is a prediction nobody else made (maybe Mattick), and it could have been made (and was) with no direct data. - Ron Maimon
@Fabian: The only people I found who honestly recognized the gap between the biologist's dogmatic fictions and the actual information content and behavior of cells were the intelligent design advocates, particularly Behe. They were right about intelligence (at least if you identify computation with intelligence, as I do), but it is possible now to see that the intelligent genome designer is not some supernatural agent, but ordinary RNA. In the absence of RNA computing, biology is nonsense. - Ron Maimon
(2) @RonMaimon Dry or wet lab, we're all biologists! We read many of the same papers. Anyway, your claim that ncRNA is unknown to bioinformaticians is simply wrong. While you correctly cite John Mattic, you should realize that in 2000-2 he had no real evidence, just a few interesting examples and a VERY elegant theory which was actually expressed in terms of network theory. Much closer, in fact, to bioinformatics than traditional biology. Or look at the ENCODE papers. Most 1st and last authors were bioinformaticians not wet-lab biologists. - terdon
@terdon: Mattick gets it, and he's a theorist, I agree, but he's ahead of everyone else. I read the ENCODE papers, as far as I can tell these guys don't get it, and further they seem to be obsessed with encrypting their data formats into unreadable binary (like bigbed) to the extent that you are stuck with their ridiculous proprietary software and suboptimal genome browser. Mattick's "network theory" is just a primitive way of saying computation, and in computational terms, the paradox of protein memory-less-ness is starker. I wrote a paper on this, but didn't put it up after reading Mattick. - Ron Maimon
@KonradRudolph I agree with your comment re dictionary size. For example, the dictionary could contain the data to encode a typical/average/nominal human, and the "4 megabytes" would therefore be simply the difference between an actual individual human compared to the average human. - ChrisW
A 'reductio ad absurdum' would be to say, "There are 7 billion genetically-unique humans in existence, therefore we can encode the human genome in 33 bits: because 33 bits is enough to uniquely identify each of 8,589,934,592 people." - ChrisW
[+3] [2012-10-20 18:24:52] terdon

For a simpler answer, you can just look at the size of an ASCI encoded text file containing the human genome's information. This, of course, is not the information content of the genome which, as you can see from the answer above and the comments in this thread is not that easy to define.

In any case, when biologists work on the genome sequence, it tends to be in the form of FASTA sequences [1]. The human genome as a multi fasta file is ~3Gb. See, for example, the file UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa obtained when extracting this archive [2].

Again, I stress that this is not the information content of the genome. For those of us who are not information theorists though, it gives an easy way of picturing the genome's size in a format we are familiar with: text.