The Earned Schedule Exchange


June 29, 2022
Probability of Recovery: Statistics in Action

Concept: Statistics power probability of recovery. But, they must be tailored to fit Earned Schedule.

WindowOfOpportunity_w_PrRcv_Normal_Distribution.png

Practice: The first step in statistical analysis is to identify and describe the data. Its “shape” determines the math that can be used on it.

What data should we analyze? Recovery depends on performance, and schedule performance is measured by Schedule Performance Index for time (SPIt).

SPIt measures the amount of Earned Schedule at the Actual Time (ES/AT). Periodic SPIt equals periodic ES because it’s the Earned Schedule at t+1 less the Earned Schedule at t divided by 1 (for a single period AT equals 1). As a single point, rather than an accumulation, periodic SPIt acts as an observation in the data pool.

It’s a reasonable idea—gather historical periodic SPIt and use it as the data for statistical analysis. Unfortunately, it doesn’t work. Walt Lipke investigated periodic SPIt from a number of projects and found that their distribution looks like this:

 Skew_Curve.jpg

Figure 1

That rules out some of the most valuable statistical techniques.

Walt did not stop there, however. He found that SPIt can be mathematically transformed into a form that looks like this.

Norm_Curve.jpg

Figure 2

The natural log of SPIt (ln SPIt) approximates a normal distribution

Using the log causes some complexity in subsequent calculations. Periodic and cumulative SPIt must both be converted to their log form. Ultimately, another transformation is required to determine the probability. But, the ability to apply techniques such as variance, standard deviation, and confidence limits outweighs the complexity.

Before proceeding, we need to address another complication from the data.

Statistical analysis assumes the pool of data is large, very large, indeed, infinite. Projects, however, are finite. So, the formulas which follow use the math for a finite sample rather than a whole population. (For details on formulas for a population, see the post on Probability of Recovery: Mathematics of Statistics.)

The sample mean, or average, is simply the sum of the observations divided by the number of them less 1:

 EQ_Prcv_mean_sample.png

…where is the sample mean; xi is an observation, and  n-1 is the number of observations.

As mentioned above, observations are periodic efficiencies, and they are transformed mathematically. They are the natural log of periodic SPIt, or ln periodic SPIt for short.

Sample variance measures how much the data is spread from the mean. It’s calculated with the formula:

EQ_Prcv_variance_sample.png

…where S2 is variance (S is squared because we’re squaring the differences); xi is an individual observation; is the mean; the sum (Σ) is of the squared differences from 1 to n-1; and n is the number of observations.

The formula for sample standard deviation is:

EQ_Prcv_sigma_complex_sample1.png 

…where S is standard deviation; the square root is applied to both sides of variance, and the other terms have already been defined.

For convenience in calculations, Walt uses an alternate version of the formula for standard deviation. Here’s the formula for the sample:

EQ_Prcv_sigma_alternate_complex_sample.png

For simplification, substitute p = periodic observation and c = cumulative observation. The standard deviation becomes:

 EQ_Prcv_variance_alternate_formula_p_and_c.png

Given the statistical basics, we can move to the probability of recovery. 

That probability is derived from the following equation:

 EQ_Prcv_PRscore_equation_w_sample.png

…where PRscore is the value that will be converted to probability, is the sample mean, V is a selected value, and the term S / √n is the Standard Error (SE).

The equation is a variation on a familiar metric: the Z-score. Given a normal distribution, the Z-score measures how far a given point is from the mean. Analogously, the PRscore measures how far the mean is from a selected value. In both cases, the measurement can be converted into a probability.

In this context, the sample mean, , is the log of the cumulative SPIt (ln cumulative SPIt ), or in the simplified equation simply ln p. Why?  Because, informally, SPIt = ES/AT. ES is the cumulative total of periods that are earned. The total is divided by the number of periods, AT. That’s the same as the mean: sum of observations / number of observations.

S is the sample standard deviation. There will be an additional “tweak” to S, but for now, only the variable, V, needs further explanation.

V is the threshold value against which performance values such as ln p are compared.  Performance values that fall short of V are not powerful enough to ensure on-time completion. So, they induce lower probabilities of recovery.

The math for setting the threshold value is non-trivial. It uses the To Complete Schedule Performance Index (TSPI) plus algebraic manipulation. Before delving into the derivation, here’s the intuition behind it.

Recall that TSPI measures the performance required to complete the project on schedule. Once TSPI exceeds 1.1, the project becomes unrecoverable.

The threshold value, V, is schedule performance relative to 1.1: take the time already earned toward 1.1 and divide it by the time remaining until 1.1 is breached.

The derivation starts with a formula for TSPI derived in an earlier post.

EQ_Prcv_calc_TSPI.png

…where ES% = ES/PD, SR = ED/PD, SPIt = ES/AT.

Next, solve the equation for SPIt.

Substitute the threshold value, 1.1, for TSPI and multiply both sides by SR – (ES%/SPIt):

EQ_Prcv_calc_TSPI_SR.png

Multiply through by 1.1:

EQ_Prcv_calc_TSPI_1.1.png

Swap terms:

EQ_Prcv_calc_TSPI_Swap.png

Multiply both sides by SPIt and divide both sides by 1.1 SR – 1 + ES%:

EQ_Prcv_calc_TSPI_new_SPIt.png

In this context, SPIt is the threshold performance against which current performance is measured—it’s the variable V. It must be transformed in the same way that both periodic SPIt and cumulative SPIt were transformed before: take its natural log, i.e., ln V.

It’s time to “tweak” S.

As stated earlier, the sample size for projects is finite. Beyond that, the sample size for projects is notably small. Also, as projects move to completion, the probability of recovery converges on a limit (e.g., 0% or 100%).

To address the constraints, first, find the Standard Error. It estimates the variability in the sample mean.

EQ_Prcv_calc_SE_of_sample_mean.png

…where SM is the Standard Error; S the sample standard deviation , and is the number of observations.

Apply an adjustment factor to SM. It ensures convergence.

The adjustment factor is given by the following equation:

EQ_Prcv_calc_AF_N.png

…where; PD is the Planned Duration; ES is the Earned Schedule, and is the Actual Time.

Now, we’re ready to solve for PRscore.

EQ_Prcv_PRscore_tweak_sample.png

…where ln c is the log of the cumulative SPIt (i.e., the mean); ln V is the threshold performance for comparison, and the “tweaked” S normalizes the difference.

The result is interpreted as a t-score. The t-distribution has a shape similar to the normal distribution, but it’s slightly heavier in the tails. It looks like this:

 t_vs_Normal_distribution_w_dots.png

Figure 3

The t-distribution supports the same statistical techniques as the normal distribution. The mean, variance, and standard deviation are the same because the Central Limit Theorem applies to both “bell-shaped” curves. Only the area under the curve differs, which implies differences in inferred probability.

Why switch from the normal distribution? With a small sample size, the t-distribution gives better results for probabilities.

The t-score represents how far an observation is from the mean, where the distance is measured by the number of standard deviations.

In this context, the observation is the natural log of V (ln V). The PRscore is the distance V is from the mean (ln c) in standard deviations. The probability is the area under the curve starting at V and excluding only the left tail.

t_distribution_w_90_PCC.png

Figure 4

Once PRscore is known, the probability can easily be computed using the T-DIST function in Excel. It takes the following parameters: t-score (i.e., PRscore), degrees of freedom (i.e., n-1), and cumulative (distribution function) = True.

The cumulative distribution applies because it is historical (i.e., cumulative) efficiency, rather than periodic efficiency, that is being analyzed. That’s also why Figure 7 illustrates the probability as one-sided. 

Add new comment

All fields are required.

*

*

*

No Comments


June 28, 2022
Probability of Recovery: the Mathematics of Statistics

Concept:  The probability of recovery depends on statistical analysis. 

If you’re already familiar with normal distribution, standard deviation, and confidence intervals, you can skip to the post, Probability of Recovery: Statistics in Action. Otherwise, here is the math behind the stats.

WindowOfOpportunity_w_PrRcv_Normal_Distribution.png


Practice: 
Statistical analysis uses mathematics to identify patterns and trends in data sets.

The first step in statistical analysis is to define and describe the character of the data. Its “shape” determines the math that can be used on it. For instance, the bell-shaped curve of the normal (or Gaussian) distribution (Figure 1) supports the most common statistical techniques.

Normal_Distribution.png

Figure 1

Given the normal distribution, the average, or mean, is calculated as follows:

EQ_PRcv_mean.png

…where μ is the mean; xi is an observation; the sum (Σ) is of all observations from 1 to N; and the sum is divided by N, the number of observations. (Want more info? Check out Khan Academy’s post on normal distributions.)

How much do the observations spread out around the mean? In other words, what’s the average variance between observations and the mean? To find out, first, sum the difference between each observation and the mean.

But, wait, sometimes the difference is positive, and other times it’s negative. We don’t want the positives and negatives to cancel each other. We want to know the total of differences regardless of the sign. So, square each difference, and then add it to the total.

Next, to get the average, we need to divide by the number of observations.

In mathematical terms, the result, called the variance, is expressed succinctly, if somewhat scarily, as:

EQ_PRcv_variance.png

…where σ2 is variance (σ is squared because we’re squaring the differences); xi is an individual observation; μ is the mean; the sum (Σ) is of squared differences from 1 to N; and N is the number of observations.

There’s an alternate formula for variance. It’s useful for spreadsheet calculations. The formula appears at the conclusion of its derivation.

Start with the the numerator of the variance.

 EQ_Prcv_mean_numerator.png

Expand the square:

 EQ_Prcv_mean_numerator_expand_sq.png

Multiply through by Σ and move constants before Σ:

 EQ_Prcv_mean_numerator_expand_sq_sum.png

Simplify the last term, summing 1 from 1 to N. That equals N. To complete the derivation, substitute for the final Σ, and insert missing operators:

 EQ_Prcv_mean_numerator_expand_sq_sum_variance.png

Variance measures how much the individual observations are spread out within the data set. Why not stop there? An example illustrates why an additional calculation is needed.

Say that the data set contains the heights of all Olympic male basketball players measured in centimeters. The variance gives the spread in terms of squared centimeters.  For the 2012 Olympics, the variance is 73.43 square centimeters (11.38 square inches).  Because it’s measured in square centimeters, it’s difficult to understand what the variance is telling us.

What we need is a measurement that’s in the same units as the original data. For that, take the square root of the variance.

EQ_PRcv_sigma_complex.png

That’s called the standard deviation. For the example, it equals 8.57 cm (3.37 inches). That’s a lot more meaningful. Given that the mean height is 200 cm (78.7 inches or about 6½ feet), the actual heights differ from the mean by about 9 cm (3½ in).

The standard deviation is also useful because it supports additional metrics. For instance, from it, we can calculate a range of plausible values for the parameter of interest (in the example, that’s the height). The range is called the confidence interval.

The high and low values of the range are called the Confidence Limits (CL). This time, to explain, let’s start with a scary formula and deconstruct it.

EQ_PRcv_CL.png

…where we’ve already met the terms μ for mean, σ for standard deviation, and N for the number of observations. The other terms are defined below.

The formula says that the range of plausible values starts at the mean and goes above or below it. How far above and below? That’s determined by the margin of error—the likely “swing” in value given the shape of the data.

The margin of error (the part after ±) comprises two components. The first term, Z, represents the desired confidence level. In statistics, confidence level refers to the long-term success rate of the method, i.e., how often it will cover the parameter of interest. A common level is 95%.

In visual terms, the confidence level is represented by the area under the curve.

Normal_Distribution_w_Confidence_Level.png

Figure 2

The same area can be expressed in terms of σ, standard deviation. It looks like this:

Normal_Distribution_w_Confidence_Level_w_Z_score_and_std_dev.png

Figure 3

As you can see, the 95% confidence level is about two standard deviations from the mean. As a check, we can estimate the number of standard deviations with geometry.

Normal_Distribution_w_Confidence_Level_w_Z_score_and_std_dev_est.png

Figure 4

We know that the total area under the curve equals 100%. The area is divided into two equal parts. So, the curve peaks at half the total, 50%. A triangle approximates the area covered by 2σ (shaded).

To calculate the Area of the triangle, we have: (height * base) / 2. We know the Area in this case. It’s half of 95% or 47.5%. We know the height: 50%. So, we can solve for the base. It’s (2 * Area) / height, or (2 * 47.5%) / 50%. That gives us 1.9, which is roughly 2σ. (Correction made 24/07/22.)

The exact number is ±1.96, as calculated using integral calculus. Given the difficulty of such a calculation, most people get the value from a table of pre-calculated numbers.

The most common table has values for a two-sided normal distribution. You can’t just look up 95% and read off the value, however.

First, the table represents the confidence level as a decimal: .95. Second, the table holds values for only one side of the curve at a time. That means it excludes one of the tails, but it also means it includes the other tail (Figure 5). The number to look up is .95 + .025 = .975, which is ±1.96.

Normal_Distribution_w_Confidence_Level_w_Z_score_and_std_dev_w_PCC_one_tail.png

Figure 5

This way of expressing the confidence level is called the Z score. It’s useful when doing calculations because it’s numeric rather than geometric, and that makes computation easier. Plus, it’s based on standard deviation, which is a known quantity.

The second term in the margin of error, σ /√N, represents the amount of spread in the data. If we take that expression and square the top and bottom, we get: σ2 / N. Recall that σ2 is the variance. The revised expression means that the variance is inversely proportional to the number of observations. That makes intuitive sense. For normal distributions, the more observations there are, the smaller the variation.

Next, the confidence level is scaled to the amount of variance. That’s expressed by multiplying the two terms: Z *(σ /√N). Again, it makes intuitive sense. The margin of error should increase for a given confidence level if the amount of variance increases. In the normal distribution, more variance implies greater dispersion around the mean. So, the outer boundaries of observations will likely be wider.

With that, we have an explanation of the “scary” formula for confidence intervals. And with that, we end the introduction to the math behind statistical analysis in Earned Schedule.

Up next: how Walt Lipke applies the math to probability of recovery.

 

Add new comment

All fields are required.

*

*

*

No Comments




Archives