The Precision, Accuracy, and Utility of Numerical Evaluations in Best-Value Procurements (Part 1)
In July 2017, there was a virtual conversation among local government procurement professionals about a “technical tie” on a procurement and how to handle it. Total weighted scores in an RFP evaluation were separated by 4 points out of 700 possible. A using department asked for additional reviews because the evaluation appeared to be a “statistical tie.” Could they? Should they?
There was a wide range of advice. “You have a photo finish, not a tie.” “Additional review will only lead to a protest.” “Assuming the evaluation criteria was proper, the spreadsheet should be a valid reflection of what the evaluation team, wanted. . . clear winner.” “Points are merely a subjective conversion of facts to numbers. . . Poll the evaluation committee on their reasons they recommended it over the rest. . . seek consensus on the award decision.”
Part 1 of this article looks at numerical scoring in proposal evaluations. These evaluations, however subjective, are measurements. Are there problems of precision in the evaluation methodology? Are the ratings accurate? Part 2 will consider the follow-on question needed to answer the questions above. Does the evaluation scoring system effectively model the best-value decision? How does one improve what is essentially a subjective measurement?
False Precision?
These evaluation issues arise most often in best-value procurements, defined by the National Institute of Governmental Purchasing as “[a] procurement method that emphasizes value over price. The best value might not be the lowest cost.”[1]
Time and again, the U.S. General Accountability Office (GAO), that decides the lion’s share of federal bid protest cases, has said in its opinions about best-value procurements that the adjectival or numerical scoring systems are merely guides to intelligent decision-making.[2] The discussion here may illuminate why. Serco, Inc. v. United States,[3] a federal case that bridged between adjectival and numerical ratings, illustrates their potential weaknesses.
Serco involved a procurement by the U.S. General Services Administration for master contracts permitting task orders for a wide variety of technology equipment and services available to the entire federal government. The agreement contemplated $50 billion in orders, so there was broad industry interest. The solicitation was announced as a multiple award procurement, and 62 suppliers submitted technically acceptable proposals. The problem arose when GSA “drew the line” on awards. It ultimately found a “natural break” in the evaluations between the 28th and 29th offerors and made awards accordingly. Multiple suppliers below the cutoff line protested.
The court eventually sustained the challenge to the awards. And the rationale for the decision was instructive for any numerical evaluation. First, the court found fault with the way that past performance information was collected. GSA had retained an independent contractor to survey references with a prepared script. However, the questioners were not trained in the evaluation standards and descriptive terms used to discriminate between the relative merit of past performance. Moreover, the evaluation plan said the company would send the interview report to the suppliers for comments about content; however, it did so only with one. The court noted that surveys like these are permitted, but the way this one was conducted did not provide the information to the evaluation committee needed to meaningfully evaluate. This was essentially a cascading error: inaccurate measurement that fed into a system of aggregated scoring that magnified the error.
Second, with respect to both past performance and the technical evaluation, the agency converted the adjectival ratings to whole number (1-5) numerical scores, weighted and averaged them, and used the resulting overall ratings having three digits on the right of the decimal (measuring in thousandths), despite initial input scores of whole numbers. Citing mathematics and statistics texts, the court found both inaccuracy—the improperly conducted past performance survey—as well as systemic error caused by “attaching talismanic significance to technical calculations that suffer from false precision.” Turning adjectival ratings into whole numbers and then averaging them is akin to rounding errors that limits the precision of the result, a propagation of errors through the evaluation computations. As a result, the natural break points relied on by GSA were essentially statistically insignificant.
Yet, the court opened the door to overcoming that subjectivity in the technical comparisons with a meaningful best-value analysis. The court was disturbed, though, by the fact that offerors with evaluated prices more than two standard deviations higher than the mean price of the 62 offerors were awarded contracts, without any meaningful analysis about whether the perceived superiority was worth the price premium. The court noted in its factual findings that the cost-technical tradeoff documentation had recurring patterns and used repeated language, with bare conclusions that the technical merit justified the increased price. “Simple math, indeed, reveals that the agency apparently was willing to pay premium of as much as $3.6 million for a technical-ranking advantage of a mere one-tenth of a point—seemingly, a huge premium in a procurement in which the lowest-priced award went for $21,059,803.”
Inaccuracy in the Approximation of True Merit?
Serco used statistical concepts like precision. But what else does decision science teach about the use of numerical ratings and their ability to approximate the true merits of proposals?[4]
The figure is a spreadsheet, sometimes called a decision matrix, used in an e-procurement system procurement conducted in the days when “self-funded models” were popular. (I’ll discuss the color codes in Part 2.) The idea of self-funded models was that e-procurement suppliers would be willing to charge government entities a percentage of each order to capture their costs of system development and implementation. (That model accounts for the absence of more traditional price or evaluated costs on the spreadsheet.) But the spreadsheet illustrates a limitation of numerical evaluations. Despite multiple rounds of evaluations, the final score between the top two proposals differed by only 6-points out of 1,485. The issue is the same as that in the “technical tie” scenario that opened this article. Do those numbers represent a true differentiation between proposals?
Numbers on spreadsheets act as proxy variables for what is being evaluated: the true value of the factor. But the assignment of a numerical score to a technical or experience factor is subjective. Agencies commonly use numerical scales to evaluate qualitative criteria, with criteria defined in terms of a target condition or goal. But the application of these standards to proposals by evaluation committee members largely is a subjective process.
Daniel Kahneman in Thinking Fast and Slow[5] makes a compelling case for the conclusion that even expert judgments are overrated. They suffer from overconfidence, inconsistency, and uncertainty. Douglas Hubbard, who writes about empirical methods of reducing uncertainty in decision-making, agrees.[6]
David Ullman, who studies decision-making in engineering design and product development, likewise maintains that averaging scores in a decision matrix injects error because it does not account for uncertainty. Moreover, there is a regression to the mean, so that judgments by more knowledgeable individuals are marginalized by the averaging of scores. Uncertainty can be the result of two little information, misinterpretation of criteria, or variations in the amount of knowledge of evaluators.
The behavioral branch of decision science identifies other impediments to “accurately” measuring subjective criteria—especially relevant in typical procurement evaluations. They include the bandwagon and halo-horns effects. Closely allied with team groupthink, the bandwagon effect is the influence of group dynamics on individual evaluators. Individual evaluations performed before group discussions—a practice promoted by NIGP’s Global Practice on The Evaluation Process[7]—mitigate the effect of that error. The halo-horns effect explains the tendency early in an individual evaluation to find favor or disfavor with a particular offeror’s approach, influencing subsequent scoring. That effect may be a reason why some agencies stagger evaluators’ starting points at other than the first proposal in the list. However, this group approach does not eliminate the possible effect on an individual evaluator’s scoring of a single proposal, where early factor scoring may create favorable or unfavorable impressions that strongly influence later factor scoring of the proposal.[8]
Hubbard (2014) is especially hard on weighted evaluations. He takes on the use of 1-5 rating scales, describing them as ordinal systems. In one sense, the use of ordinal rating schemes, especially where the ratings are simple rankings, directly implicates procurement evaluations. Where three offerors are ranked 1, 2, and 3, for example, use of a mathematical comparison is meaningless for tradeoffs; the mathematical comparison shows the second-rated proposal only half as good as the first, when the actual relative merit may only differ by fractions of a percentage point.
Kahneman and Ullman on the other hand believe that more traditional structured evaluations using a limited number of relevant discriminating factors can improve subjective evaluations.[9] Use of well-constructed evaluation criteria and standards is important and may insulate some weighted ratings from Hubbard’s strongest criticisms of ordinal ratings. Ullman devotes two chapters of Making Robust Decisions to the development and use of criteria.
For example, consider two levels of technical ratings used in a 5-point scale:
For any given technical factor or subfactor, the evaluation criteria include: soundness of the approach; understanding of the project; and risk of unacceptable or late performance. The evaluation standards include language that defines the difference between the good and excellent rating. While this approach helps calibrate subjective judgments of evaluators, as Hubbard properly points out, the numerical differences may not reflect the value differences. Said another way, a proposal that receives an excellent (5) rating may not warrant paying 20% more than a proposal that receives a good (4) rating, when the technical ratings are compared mathematically.
Hubbard advocates for more empirical methods that include quantification of “uncertainty” using percentages. Then, inconsistency and elimination of errors like overconfidence can be mitigated through “calibration of experts.” Hubbard’s and Kahneman’s suggested approaches, though, involve development of multiple scenarios, use of tools like regression analysis, and the collection of dozens of examples in order to derive meaningful predictive algorithms.[10] Likewise, the last half of Ullman’s book uses a decision-support computer program to help measure and visualize the effect of evaluator uncertainty, that Ullman attributes to differences in evaluator knowledge. These empirical methods probably extend well beyond the time available and resource capabilities of procurement offices. Moreover, I’m still not convinced that the variation in individual procurements in a typical procurement office—what Kahneman calls low-validity environments—permits creation of useful algorithms using empirical methods. [11] Still, some procurement offices use the development of and discussion about the evaluation criteria during evaluation planning as a sort of calibration.
Despite the criticism, these evaluation systems are in widespread use. This discussion of precision and accuracy is intended to convince the reader that these evaluations are subjective even though numbers are used. The discussion points the way towards practices that make these numerical methods better at validly modeling award decisions. As will be discussed in Part 2 of this article, the use of individual evaluations followed by collaborative, committee discussion of the scores helps lessen the impact of error such as bandwagon and halo-horns effect, overconfidence, and inconsistency. Part 2 [12] will look more closely at the tradeoffs involved in these awards and highlight practices that improve the fidelity of the evaluations in choosing the proposal representing the best-value to the government.
Notes
[1] National Institute of Governmental Purchasing, NIGP Dictionary of Procurement Terms, available at www.nigp.org/home/find-procurement-resources/dictionary-of-terms.
[2] Wackenhut v. United States, 08-660c, Dec 15, 2008.
[3] Serco, Inc. v. United States, United States of Court of Federal Claims, No. 07-691C, March 5, 2008.
[4] David Ullman, Making Robust Decisions: Decision Management for Technical, Business & Service Teams (Trafford Publishing, 2006), p. 66. To Ullman, the level of “approximation” is the model’s fidelity, that is, the measure of how well the model represents the state of the real-world object.
[5] Daniel Kahneman, Thinking Fast and Slow (Farrar, Straus and Giroux, 2011).
[6] Douglas Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, 3rd Ed. (Wiley, 2014).
[7] Available at http://www.nigp.org/home/find-procurement-resources/guidance/global-best-practices.
[8] Kahneman, chapter 7, where he uses the grading of school papers as an example. He later used a weighed matrix evaluation system for evaluating fitness for combat duty for the Israeli army, using six characteristics that included responsibility, sociability, and masculine pride assessed using standard interview factual questions, e.g. number of different jobs, punctuality, frequency of interactions with friends, interest and participating in sports. The intent was to evaluate as objectively as possible those dimensions. To avoid the halo effect, he instructed evaluators to go through the six traits in a fixed sequence and rate on five-point scales.
[9] Kahneman, chapter 21. Ullman, chapter 6.
[10] An example Kahneman gives is developing a large database that includes information on plans and outcomes for hundreds of transportation projects that can be used to provide “base-rate” information about the likely overruns of cost and time and underperformance of the project.
[11] The closest I’ve seen is the Performance Information Procurement System (PIPS), developed by Arizona State University’s former Performance-Based Studies Research Group. The PIPS process—developed for construction but later used for complex services RFPs, including IT—focuses on past performance and the supplier’s assessment of risk as predictors of successful performance. Dean Kashiwagi’s PIPS model now is promoted by Kashiwagi Solution Model Inc. at www.ksm-inc.com.
[12] Available at https://www.garudax.id/pulse/part-2-precision-accuracy-utility-numerical-richard-pennington