Concept: Statistics power probability of recovery. But, they must be tailored to fit Earned Schedule.
Practice: The first step in statistical analysis is to identify and describe the data. Its “shape” determines the math that can be used on it.
What data should we analyze? Recovery depends on performance, and schedule performance is measured by Schedule Performance Index for time (SPIt).
SPIt measures the amount of Earned Schedule at the Actual Time (ES/AT). Periodic SPIt equals periodic ES because it’s the Earned Schedule at t+1 less the Earned Schedule at t divided by 1 (for a single period AT equals 1). As a single point, rather than an accumulation, periodic SPIt acts as an observation in the data pool.
It’s a reasonable idea—gather historical periodic SPIt and use it as the data for statistical analysis. Unfortunately, it doesn’t work. Walt Lipke investigated periodic SPIt from a number of projects and found that their distribution looks like this:
Figure 1
That rules out some of the most valuable statistical techniques.
Walt did not stop there, however. He found that SPIt can be mathematically transformed into a form that looks like this.
Figure 2
The natural log of SPIt (ln SPIt) approximates a normal distribution
Using the log causes some complexity in subsequent calculations. Periodic and cumulative SPIt must both be converted to their log form. Ultimately, another transformation is required to determine the probability. But, the ability to apply techniques such as variance, standard deviation, and confidence limits outweighs the complexity.
Before proceeding, we need to address another complication from the data.
Statistical analysis assumes the pool of data is large, very large, indeed, infinite. Projects, however, are finite. So, the formulas which follow use the math for a finite sample rather than a whole population. (For details on formulas for a population, see the post on Probability of Recovery: Mathematics of Statistics.)
The sample mean, or average, is simply the sum of the observations divided by the number of them less 1:
…where X̅ is the sample mean; xi is an observation, and n-1 is the number of observations.
As mentioned above, observations are periodic efficiencies, and they are transformed mathematically. They are the natural log of periodic SPIt, or ln periodic SPIt for short.
Sample variance measures how much the data is spread from the mean. It’s calculated with the formula:
…where S2 is variance (S is squared because we’re squaring the differences); xi is an individual observation; X̅ is the mean; the sum (Σ) is of the squared differences from 1 to n-1; and n is the number of observations.
The formula for sample standard deviation is:
…where S is standard deviation; the square root is applied to both sides of variance, and the other terms have already been defined.
For convenience in calculations, Walt uses an alternate version of the formula for standard deviation. Here’s the formula for the sample:
For simplification, substitute p = periodic observation and c = cumulative observation. The standard deviation becomes:
Given the statistical basics, we can move to the probability of recovery.
That probability is derived from the following equation:
…where PRscore is the value that will be converted to probability, X̅ is the sample mean, V is a selected value, and the term S / √n is the Standard Error (SE).
The equation is a variation on a familiar metric: the Z-score. Given a normal distribution, the Z-score measures how far a given point is from the mean. Analogously, the PRscore measures how far the mean is from a selected value. In both cases, the measurement can be converted into a probability.
In this context, the sample mean, X̅, is the log of the cumulative SPIt (ln cumulative SPIt ), or in the simplified equation simply ln p. Why? Because, informally, SPIt = ES/AT. ES is the cumulative total of periods that are earned. The total is divided by the number of periods, AT. That’s the same as the mean: sum of observations / number of observations.
S is the sample standard deviation. There will be an additional “tweak” to S, but for now, only the variable, V, needs further explanation.
V is the threshold value against which performance values such as ln p are compared. Performance values that fall short of V are not powerful enough to ensure on-time completion. So, they induce lower probabilities of recovery.
The math for setting the threshold value is non-trivial. It uses the To Complete Schedule Performance Index (TSPI) plus algebraic manipulation. Before delving into the derivation, here’s the intuition behind it.
Recall that TSPI measures the performance required to complete the project on schedule. Once TSPI exceeds 1.1, the project becomes unrecoverable.
The threshold value, V, is schedule performance relative to 1.1: take the time already earned toward 1.1 and divide it by the time remaining until 1.1 is breached.
The derivation starts with a formula for TSPI derived in an earlier post.
…where ES% = ES/PD, SR = ED/PD, SPIt = ES/AT.
Next, solve the equation for SPIt.
Substitute the threshold value, 1.1, for TSPI and multiply both sides by SR – (ES%/SPIt):
Multiply through by 1.1:
Swap terms:
Multiply both sides by SPIt and divide both sides by 1.1 SR – 1 + ES%:
In this context, SPIt is the threshold performance against which current performance is measured—it’s the variable V. It must be transformed in the same way that both periodic SPIt and cumulative SPIt were transformed before: take its natural log, i.e., ln V.
It’s time to “tweak” S.
As stated earlier, the sample size for projects is finite. Beyond that, the sample size for projects is notably small. Also, as projects move to completion, the probability of recovery converges on a limit (e.g., 0% or 100%).
To address the constraints, first, find the Standard Error. It estimates the variability in the sample mean.
…where SM is the Standard Error; S the sample standard deviation , and N is the number of observations.
Apply an adjustment factor to SM. It ensures convergence.
The adjustment factor is given by the following equation:
…where; PD is the Planned Duration; ES is the Earned Schedule, and N is the Actual Time.
Now, we’re ready to solve for PRscore.
…where ln c is the log of the cumulative SPIt (i.e., the mean); ln V is the threshold performance for comparison, and the “tweaked” S normalizes the difference.
The result is interpreted as a t-score. The t-distribution has a shape similar to the normal distribution, but it’s slightly heavier in the tails. It looks like this:
Figure 3
The t-distribution supports the same statistical techniques as the normal distribution. The mean, variance, and standard deviation are the same because the Central Limit Theorem applies to both “bell-shaped” curves. Only the area under the curve differs, which implies differences in inferred probability.
Why switch from the normal distribution? With a small sample size, the t-distribution gives better results for probabilities.
The t-score represents how far an observation is from the mean, where the distance is measured by the number of standard deviations.
In this context, the observation is the natural log of V (ln V). The PRscore is the distance V is from the mean (ln c) in standard deviations. The probability is the area under the curve starting at V and excluding only the left tail.
Figure 4
Once PRscore is known, the probability can easily be computed using the T-DIST function in Excel. It takes the following parameters: t-score (i.e., PRscore), degrees of freedom (i.e., n-1), and cumulative (distribution function) = True.
The cumulative distribution applies because it is historical (i.e., cumulative) efficiency, rather than periodic efficiency, that is being analyzed. That’s also why Figure 7 illustrates the probability as one-sided. |