Measuring Quality during testing

For years I have been guilty of something; well a lot of things actually, just ask my wife; but I mean something specific to testing, and that is fueling a misleading metric. For yeas now I have been using tests pass/fail rate as a way of measuring quality thinking it was telling me something useful. but over the last few months I have started to realise that this metric is bloody dangerous!

So this is the metric that I have been using:
Total no of Passed Tests / Total No. of Tests Executed = Quality

The thinking behind this metric is that at any time during the test cycle I am able to report how good the quality of the application under test is looking, assuming that my test pack is focused on testing the aspects that have been agreed by all stakeholders as being a good measure of overall quality (I am not planing on getting into a debate about what “quality” is in this post). My management and stakeholders love it…but that is the problem.

They should hate it, because it misleads them and basically tells them diddly squat about the quality. Yet they love it. I am in many a conversation where people are getting excited by high quality or losing there lunch because of low quality before testing is finished. So much hangs on this metric with them, testers and developers getting praised/slated based on it, and it is all my fault! I had a hand in bringing this metric into the company, and now I have a responsibly to shut it down before it does more damage.

OK so let me show why this metric is misleading. Below is a table that shows a test pack of 100 tests. The tests can be run at a rate of 10 tests a day and therefore takes 10 days to complete all tests. Scenario 1 finds a number of issues early meaning a lot of failed tests early in the cycle, and scenario 2 has a more even distribution through out the cycle.

  Scenario 1 Scenario 2
  Passed Failed No Run Quality Passed Failed No Run Quality
day 1 5 5 90 50% 9 1 90 90%
day 2 7 13 80 35% 16 4 80 80%
day 3 14 16 70 47% 26 4 70 87%
day 4 15 25 60 38% 34 6 60 85%
day 5 25 25 50 50% 43 7 50 86%
day 6 35 25 40 58% 50 10 40 83%
day 7 45 25 30 64% 56 14 30 80%
day 8 55 25 20 69% 60 20 20 75%
day 9 65 25 10 72% 67 23 10 74%
day 10 75 25 0 75% 75 25 0 75%

As you can see, if we look at day 5 of the 10 day cycle we see that we have two very different statuses. In scenario 1 the quality is at 50%, and r my stakeholders are running around panicking, insisting on twice daily “intensive care” meetings and coming down on the development leads about the “poor” quality of the application. In Scenario 2 however the quality is at 86% and my stakeholders are relaxed, happy and praising the development leads for a great job by them… high fives all round!

The crux of all this though is that by the end of the cycle on day 10 in both scenarios the quality is at 75%, meaning that in scenario 1 my stakeholders were caused to over react, burn resource unnecessarily and come down on the developers harder than was needed. Scenario 2 by contrast has my stakeholders to relaxed and praising everyone when in fact the application actually ends up in a worse position that they thought.

What makes it worse is that I cant even be sure of the final 75% mesurmt and here is why; granularity of tests. Lets assume that our application has 5 functional areas that are to be tested,and because I am a creative kind of guy lets call them “functional area 1”, “functional area 2”, “functional area 3”, “functional area 4”, and “functional area 5”. Now using my crystal ball I am also going to tell you that by the end of the testing cycle functional areas 1, 4 and 5 pass and 2, and 3 have failed. Now look at the below table where again two scenarios are display it shows how in 1 the total number of test is 100 and in the second the tests have been written in a much more granular form and have a total of 1000:

  S1 – No of Tests S2 – No of Tests S1 – % of tests S2 – % of tests  Results
Functional Area 1 200 25 20.00% 25.00% passed
Functional Area 2 120 5 12.00% 5.00% failed
Functional Area 3 220 15 22.00% 15.00% failed
Functional Area 4 210 30 21.00% 30.00% passed
Functional Area 5 250 25 25.00% 25.00% passed

As you can see in these two scenarios not only are there more tests on scenario 1, but also the distribution of tests is different as well so when we use the quality metric on the last day, they end up being completely different as shown in the table below.

Scenario 1 Scenario 2
Passed Failed No Run Quality Passed Failed No Run Quality
660 340 0 66% 80 20 0 80%

So if you are using a metric like this to track the in-flight quality of your application then beware of its failings. Even  if you are aware of them be aware your management and stakeholders probably don’t and are likely to be making decisions on these numbers which will most likely be wrong.

I am not even sure there is a way of measuring what the quality of your system will be before you have finished, and even then the result may be subjective, so you are better off in my option of just reporting the facts of your findings in words. by all means use number if you want to support those words, but words should be the play the lead role. Then let this information be the driver for your stakeholders to make the decisions on quality.

4 thoughts on “Measuring Quality during testing

  1. What if we prioritize the test cases into High Medium and Low and then execute the test cases based on priority. We can get a feel on whether the critical requirements are working and then get an overall feel on the application….Again this is also not a fool proof mechanism.

    • George, firstly thank you for your comments. I think that although prioritisation is a good thing for a lot of reasons, I don’t believe that this eliminates the issue of this particular measurement as the measurement is not weighted against the prioritisation that has been set.

Leave a Reply

Your email address will not be published. Required fields are marked *