Question

sperm factoid ^{[
Source

^[1]]}

In an episode of the BBC show QI - Quite Interesting ^[2] (Series J, Episode 1) Stephen Fry said ^[3]:

How much information do you think is in the DNA of one little sperm...?

It's 37.5 MB...

...a normal male ejaculation, if there is such a thing, is equivalent of 15,875 GB. That's about 7500 laptops worth of information...

The shows Twitter page ^[4] summarized it:

A sperm has 37.5 MB of DNA info.
One ejaculation transfers 15,875 GB of data, equivalent to that held on 7,500 laptops.

(with "200 million sperm per ejaculation" ^[5] one would actually get 7150 TB;
but I'm more interested in where the 37.5 MB number comes from)

My Question:

Does the DNA of one sperm contain 37.5 MB of information?

[1] http://twaggies.com/2012/08/no-865-qikipedia/
[2] http://www.qi.com/
[3] http://www.youtube.com/watch?v=Zcc7h0GAAas
[4] http://twitter.com/qikipedia/status/224269273730789376
[5] http://www2.oakland.edu/biology/lindemann/spermfacts.htm

Answer 1

I am not sure where these numbers come from and the answer depends on how you encode the genome data and if you define all the redundancy (unnecessary, repetitive data) as "information".

First of all, the humane genome contains somewhere around 3.1 (men) to 3.2 (women) billion base pairs. Since the X chromosome is three times longer than the Y chromosome, women have a higher total genome length than men.

Source: "Human Genome Assembly Information" from the "Genome Reference Consortium" ^[1]

A base pair is made of two of the four nucleobases adenine, cytosine, guanine and thymine, but only the four combinations AT, TA, CG and GC are possible as the A and T nucleobases won't bond with the C and G nucleobases and vice versa. These four combinations can be encoded with two bits, so that 6.2-6.4 gigabits or about 750 megabytes are required to store an exact copy of the genome.

Now, even if you need 750 megabytes to store the "raw data" from a human genome, at least a computer scientist will have a hard time defining all of this as "information". E.g. if you record 74 Minutes of complete silence on a CD, the disc contains roughly 750 megabytes of "data" as well, but actually no "information". Large parts of the human genome are repetitive, only a very small part actually differ between different individuals and from the difference, several base pair sequences only occur in a few well-defined varieties.

There is actually some research in the field "how to store a human genome as compact as possible", since genome databases most likely are going to expand rapidly and scientists need efficient ways to share data. Some tools are available for this purpose, e.g. DNAzip, which using a ~5 gigabyte dictionary (permanent data) can compress a human genome down to roughly 4 megabytes.

Source: "Human genomes as email attachments" ^[2]

[1] http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
[2] http://bioinformatics.oxfordjournals.org/content/25/2/274.full

Answer 2

For a simpler answer, you can just look at the size of an ASCI encoded text file containing the human genome's information. This, of course, is not the information content of the genome which, as you can see from the answer above and the comments in this thread, is not that easy to define.

In any case, when biologists work on the genome sequence, it tends to be in the form of FASTA sequences ^[1]. The human genome as a multi fasta file is ~3Gb. See, for example, the file UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa obtained when extracting this archive ^[2].

Again, I stress that this is not the information content of the genome. For those of us who are not information theorists though, it gives an easy way of picturing the genome's size in a format we are familiar with: text.

[1] http://en.wikipedia.org/wiki/FASTA_format
[2] ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/NCBI/build37.2/Homo_sapiens_NCBI_build37.2.tar.gz