Is the Scientific Discovery of DNA Fingerprint by Chance or by Design?

DNA fingerprinting is a microbiological technique widely used to find a DNA sequence specific for a microbe. It involves slicing the genomes of the microbe into DNA fragments with manageable sizes, sorting the DNA pieces by length and finally identifying a DNA sequence unique to the microbe, using probe-based assays. This unique DNA is referred to as DNA fingerprint of the microbe under study. In this paper, we introduce a probabilistic model to estimate the chance of identifying the DNA fingerprint from the genome of a microbe when the DNA fingerprinting method is employed. We derive a closed-form functional relationship between the chance of finding the fingerprint and factors that can be experimentally controlled either in part, fully or not at all. Because the odds of finding a specific DNA fingerprint can only be improved by experimental design to a certain degree, in a broader sense, we show that the discovery of a DNA fingerprint is a process governed more by chance than by design. Nevertheless, the results can be potentially used to guide experiments in maximizing the chance of finding a DNA fingerprint of interest.


Introduction
In recent years, the application of polymerase chain reaction (PCR) fingerprinting assays has become more common in the accurate and rapid identification of microorganisms (Ben-Ezra, J., Johnson, D.A., Rossi, J., Cook, N., and Wu, A. (1991)).This DNA probe-based technology allows for both discrimination between species and differentiation of isolates belonging to a single species.It is based on either direct amplification of a DNA sequence specific to a microorganism (Belkum, A. (1994)) or generation of an amplified genomic pattern which is highly reproducible (Sobral, B.W.S. and Honeycutt, R.J. (1993)), and which can thus be used as a fingerprint for the species.The development of the former DNA amplification method requires a unique fingerprint of the microorganism.
Several techniques have been developed in the past decade to facilitate the discovery of DNA fingerprints (Belkum, A. (1994)).The methods typically involve slicing large number of copies of species genomes into small pieces using a sitespecific enzyme.The DNA fragments are then sorted out according to their base length using gel electrophoresis.Subsequently a few classes will be selected and subjected to PCR amplification using primer specifically designed based on knowledge regarding the species genome.If the PCR method results in replication of a DNA sequence that can be proven to be specific to the genome, the sequence is deemed to be a fingerprint of the microorganism.
The discovery of a DNA fingerprint is a laborious process.It is impacted by experimental conditions such as the efficiency of the restriction enzyme, number of copies of species genomes used in the experiment and lengths of both the genome and DNA fingerprint.In this paper, we introduce a probabilistic model to estimate the chance of identifying a specific DNA sequence of any given length from the genome of a microbe when the DNA fingerprinting method is employed.We establish a functional relationship between the probability of finding a specific DNA sequence, maximum number of fragments into which the DNA sequence can be sliced by restriction enzyme, cutting efficiency of restriction enzyme and number of copies of the microbe genomes used in the fingerprinting experiment.It is shown that the chance of discovering DNA fingerprint can be greatly improved if the enzyme cutting efficiency can be experimentally controlled within a certain range.The model can be potentially used to guide experiments in maximizing the chance of finding a DNA fingerprint of interest.It can also be used to assess the reproducibility of a specific DNA fingerprint discovery.Because the results developed in the paper also imply that the odds of finding of a DNA fingerprint can only be improved by experimental design to a certain degree, in a broader sense, we prove that the discovery of a specific DNA fingerprint of a microbe is governed more by chance than by design.

Definitions
To facilitate our discussion, we first introduce a few concepts concerning DNA and DNA fingerprinting.DNA is a chemical structure in the chromosomes of living organisms that carries genetic information.It takes the form of a double helix with two strands of genetic material spiraled around each other.Each strand consists a sequence of 4 bases, adenine (A), thymine (T ), guanine (G) and cytosine (C), known as nucleotides.The two strands of DNA are chemically bound at each base.The base A will only bond with T , and G with C. In literature, a DNA sequence is usually described as follows:

C-T -T -A-G-A-C-A-T -A-T G-A-A-T -C-T -G-T -A-T -A
DNA strands are read in a particular direction, from the top to the bottom ends.The two ends are referred to as 5 (five prime) and 3 (3 prime) ends, respectively.To include the directional information of a DNA sequence, the above sequence is often expressed as In this paper, we use the notation C 1 C 2 . . .C n to denote a DNA sequence of n paired bases, with each of the C i taking the pair of either A-T , G-C, T -A or C-G.The genome of a microbe is the entire DNA sequence in the chromosomes of the microorganism cell that includes all genetic information.The following definitions are also necessary for the development of our method.

Definition 1. (DNA Fingerprint).
A sequence of paired nucleotides that is unique to the DNA of a microbe.In this paper, we use Ω to denote a DNA fingerprint.

Definition 2. (Restriction Enzyme).
A chemical compound that locates a specific sequence on a DNA and cuts the molecule at that point.

Definition 3. (Restriction Site).
A specific sequence on a DNA, at which restriction enzyme cuts the DNA.

Definition 4. (Polymerase Chain Reaction [PCR]).
A technique for rapidly multiplying certain segments of DNA; it can produce a million-or billion-fold increase in DNA material within hours.

Definition 5. (Partial Digestion).
A collection of DNA fragments which are generated by cutting the DNA sequence of a microbe genome at specific sites, using the restriction enzyme.The cleaving sites, formally called restriction sites, are locations on the DNA where a specific short DNA resides.The word "partial" reflects the fact that a DNA sequence, in a given period of reaction time, might not be completely fragmented at all cutting sites.Definition 6. (Full Digestion).A collection of DNA fragments of a DNA sequence which is completely fragmented at all restriction sites by a restriction enzyme.
For the rest of the paper, we use the notations Φ, Ω, c and R to denote the entire DNA sequence of a microorganism, DNA fingerprint of the microbe genome, restriction site and enzyme that cuts the restriction site, respectively.

Modeling of DNA fingerprinting process
The scientific process that leads to the discovery of a DNA fingerprint usually involves the following steps: (1) Isolating the DNA genomes of the microorganism of interest; (2) Cutting the DNA into manageable pieces of different sizes, using restriction enzyme; (3) Sorting the DNA pieces by size.The process by which the size separation, "size fractionation," is done is called gel electrophoresis; (4) Selecting a few sorted DNA pieces, and amplifying the segments, using PCR method, with specially designed primer that binds to a particular sequence of DNA; 5) Amplifying the particular sequence.If this sequence turns out to be specific to the microorganism genome, it can serve as a fingerprint of the microorganism.
In the following, we express the DNA sequence of a microbe as where c is a restriction site on Φ, at which the restriction enzyme R slices Φ.The subsequences B i , 1 < i < n, do not contain c, while B 1 and B n may contain one c at 5 and 3 ends, respectively.In a full digestion of Φ, it is cut into n pieces at all cutting sites of c's.Let Ω be the fingerprint of Φ, a sub-string that is unique to Φ. Without loss of generality, we assume that Ω takes the form where > 1 and + m − 1 < n.That is, the fingerprint Ω contains m + restriction sites of c's, with one c being between B −1 and B , another c between B +m−1 and B +m .In addition, there are m − 1 of the c's in between B and B +m−1 .In a full digestion of the sequence, all c's will be cut, making Ω into m pieces.We refer these c's as c 0 , c 1 , . . ., c m .Define X i as random variables that can take value either 0 or 1, with P [X i = 1] = P [the restriction site c i is cut by the restriction enzyme] = p.
The probability p represents the cutting efficiency of the enzyme.It is reasonable to assume that all X i are independent.Therefore these m + 1 variables X i are independently identically distributed (iid) according to a Bernoulli distribution.For a partial digestion of Φ to contain the fingerprint Ω, we need The probability is (2.4)

Upper bound on chance of DNA fingerprint discovery
Note that a typical DNA fingerprinting experiment involves many copies, say, r, of the DNA genome under study.Based on the result in (2.4), the following results can be readily verified.
The probability for a DNA probe-based fingerprinting experiment to lead the discovery of a fingerprint is bounded by which achieves its maximum at The number r is the number of copies of the microbe genome used in the experiment, and m is the maximum number of fragments in a full digestion of the fingerprint DNA Ω.The results in (2.5) and (2.6) suggest that if we could tweak experimental conditions so that the restriction enzyme cutting efficiency can be proportional to the reciprocal of number of fragments in a full digestion of the fingerprint, we could actually maximize our chance for discovering the fingerprint.It is also important to note that the probability in (2.5) is determined not only by the controllable experimental factor r, but also by enzyme efficiency p that can be partially and indirectly manipulated through controlling other experimental factors such as reaction temperature, duration of reaction and etc., and the maximum number of fragments m that a full digestion of the fingerprint Ω possesses.m + 1 represents the number of restriction sites in the fingerprint (2.2).The factor m, inherent to the microbe DNA, is beyond experimenters' control.Therefore, regardless how well the experiment is designed, the discovery of the DNA fingerprint is always a chance event.

Reproducibility of DNA Fingerprint
Companies and scientists apply patents for DNA fingerprints they discovered to protect their intellectual rights.Patent application requires the parties to submit documents detailing experiments that led to the successful findings of the fingerprints.In a recent lawsuit against a company that possesses a patent of the DNA fingerprint of a Microorganism, the patent was argued to be invalid on the ground that five repeat runs, by the plaintiff, of one of the key experiments resulting in a partial digestion of the Microorganism genomes containing the DNA fingerprint did not reproduce the fingerprint.This experiment initially involved the digestion of one billion copies (r = 10 8 ) of the Microorganism genome, using a restriction enzyme.The fingerprint and restriction site consist of 2,500 and 3 pairs of nucleotides, respectively.There are 40 restriction sites residing on the fingerprint, with one at each of the 5 and 3 ends.In other words, this fingerprint can be fractionated into 39 pieces (m = 39) in a full digestion.
In the following, we apply the method developed in the previous section to determining the actual chance of reproducing the DNA fingerprint in five repeat of the original experiment.Let By (2.5), f (p) is an upper bound on the probability for a single repeat of the original experiment to contain the DNA fingerprint when the enzyme cutting efficiency is p.A plot of f (p) against p is depicted in Figure 1.As shown in Figure 1, when p determined by the original experimental conditions is no greater than 0.32, f (p) is close to 1.It drops to 0 for p > 0.4.For addition, regardless how well one might design DNA fingerprinting experiments, the chance of success is in part predetermined by factors, such as m, that are out of scientists' control, and others like p that can only be partially and indirectly controlled.Therefore while well-designed experiments can improve one's odds of success in DNA fingerprinting, ultimately it is the inherent properties of a DNA sequence that dictate the chance of success.In other words, the discovery of a DNA fingerprint of a microbe is governed more by chance than by design.Lastly, although the results were derived based on the assumption that the genome of interest possesses a single copy of DNA fingerprint Ω, they can be readily generalized to the case in which the genome has multiple copies of DNA fingerprint.

Figure 1 :
Figure 1: Upper bound on probability of reproducing DNA fingerprint in a single repeat of the original experiment.