BlueZannetti
July 11th, 2004, 01:59 PM
With the increasing level of small subsample testing of new AV and AT programs appearing here and elsewhere, I thought that it would be useful to consider the expected variability of these tests from a statistical perspective. There are two overriding questions:
1. Do the performance differences seen in these tests reflect real world performance differences?
2. Do minor or even moderate test result differences discriminate or are those results really the same?
The tests that I am focusing on are those that survey a few hundred to a thousand or so trojans/virus'/other malware with a group of products and then attempt to rate the products based on the results obtained. Although there has been some criticism whether the samples are entirely comprised of verified malware, let's assume they are. In this regard, the results below will be a best case analysis.
The technical problem is as follows:
a. A total global population of N pieces of malware exists
b. These tests use a subsample of n pieces to reflect the global population
c. The scanning programs accurately identify some fraction (m/N) of the N pieces of malware (hopefully close to a 100%). Some fraction are misidentified ((N-m)/N), i.e. the program result is "defective".
d. How big does n have to be to obtain a reliable estimate of ((N-m)/N), that is, to reliably predict actual performance?
This problem is mathematically identical to subsampling a large population of objects for defects. How big a sample do you need to get a good handle on the average defect rate? Without trotting out all the technical details, the response follows a hypergeometric probability distribution function (Google if you're interested in details).
Let's look at a couple of numerical scenarios. Assume a product covers 99% of all threats. Assume that the global population of malware objects to be covered is 50,000. That means there are 500 genuine malware objects in that set of 50000 which the product will not accurately flag. In terms of testing, we need to assess how many samples need to be tested to yield the expected result that this product is 99% effective (i.e of x objects tested, 0.01x are missed). Some results are below. The first column is the number of samples in the testbed. The second column is the % of the global population represented. The final column is the standard deviation of the result relative to the mean given as a percentage. I put the key result in this form since, on average, the correct result will be obtained if enough random samples are taken, but the distribution in those results is strongly dependent on testbed size and this measures gives us some indication of how variable any single test may be.
# samples_______Fraction of_________Std. Dev.
in test bed___total population____of test rel. to mean
100_______________0.20%___________99.4%
250_______________0.50%___________62.8%
500_______________1.00%___________44.3%
2500______________5.00%___________19.4%
5000______________10.0%___________13.3%
10000_____________20.0%____________8.9%
25000_____________50.0%____________4.4%
If the results are normally distributed (in this case they are not, strictly speaking, since results are bounded at zero on the lower end) roughly ~68% of the time test results are expected to be within +/- one standard deviation (it's ~95% for +/- 2 standard deviations).
Obviously, you will obtain reliable predictive results when the standard deviation is small relative to the mean. For this scenario, a 500 random sample testbed is completely abysmal in predicting performance. Whereas the test should indicate a 1% miss rate for the product, values between 0.66 and 1.44% cover only 68% of the expected results, 95% of the expected results are somewhere in the range of a 0.22 to 1.88% miss rate. Put another way, you can fully expect 1/3 of the results to lie between 0.22-0.66 and/or 1.44-1.88% depending upon the specific samples in the testbed. Think about it. We could have 2 products, both with a nominal failure rate of 1%. There is a fairly substantial probability (casually taking 1-2 sigma mid-points here) that Product A could exhibit a failure rate of 0.44% while Product B's failure rate is 1.66% - almost 4 times higher! Yet, these are statistically identical products responding to a small sample challenge that happens to be either on the high or low failure rate wing of the distribution for the single challenge test. You'll see results like this roughly one-third of the time. These are fairly large windows.
Results do improve significantly when the sample size is increased to 2500 and 5000 (note - these are the sample sizes of tests such as those run by av-comparatives.org/ (www.av-comparatives.org/)). Even here, though, fine differences in performance may be within the expected noise of the test.
Note, as the defect rate goes down (i.e. better scan performance), the need for larger testbeds increases dramatically. If the scenario above were played out with products having 99.9% coverage, the results are
# samples_______Fraction of_________Std. Dev.
in test bed___total population____of test rel. to mean
100_______________0.20%__________315.8%
250_______________0.50%__________199.4%
500_______________1.00%__________140.6%
2500______________5.00%___________61.6%
5000______________10.0%___________42.4%
10000_____________20.0%___________28.3%
25000_____________50.0%___________14.1%
By the same token, if the defect rate is high, much smaller testbeds can yield reasonable performance metrics.
There are a multitude of details that one can quibble regarding the analysis above. It is approximate, I realize this. I'm only trying to provide some objectibe indication as to what can happen if a randomly selected small sample testbed is employed. Performance can obviously be skewed well outside these bounds with a nonrandom testbed and those results can be either on the positive or negative side of things.
The numbers above do not mean results presented in this forum and others are wrong. However, they do underscore the expected variability based on the use of a small challenge sample.
Fine differences between products cannot reasonably be expected to be uncovered in these tests and it is possible that greater differences (or similarities) are also not revealed. In some cases the noise is such that these tests will clearly lose all discriminating power. Furthermore, it is very possible that the results can mislead causal readers if the intrinsic noise in the result is not appreciated.
Just a couple of thoughts to consider....
Blue
1. Do the performance differences seen in these tests reflect real world performance differences?
2. Do minor or even moderate test result differences discriminate or are those results really the same?
The tests that I am focusing on are those that survey a few hundred to a thousand or so trojans/virus'/other malware with a group of products and then attempt to rate the products based on the results obtained. Although there has been some criticism whether the samples are entirely comprised of verified malware, let's assume they are. In this regard, the results below will be a best case analysis.
The technical problem is as follows:
a. A total global population of N pieces of malware exists
b. These tests use a subsample of n pieces to reflect the global population
c. The scanning programs accurately identify some fraction (m/N) of the N pieces of malware (hopefully close to a 100%). Some fraction are misidentified ((N-m)/N), i.e. the program result is "defective".
d. How big does n have to be to obtain a reliable estimate of ((N-m)/N), that is, to reliably predict actual performance?
This problem is mathematically identical to subsampling a large population of objects for defects. How big a sample do you need to get a good handle on the average defect rate? Without trotting out all the technical details, the response follows a hypergeometric probability distribution function (Google if you're interested in details).
Let's look at a couple of numerical scenarios. Assume a product covers 99% of all threats. Assume that the global population of malware objects to be covered is 50,000. That means there are 500 genuine malware objects in that set of 50000 which the product will not accurately flag. In terms of testing, we need to assess how many samples need to be tested to yield the expected result that this product is 99% effective (i.e of x objects tested, 0.01x are missed). Some results are below. The first column is the number of samples in the testbed. The second column is the % of the global population represented. The final column is the standard deviation of the result relative to the mean given as a percentage. I put the key result in this form since, on average, the correct result will be obtained if enough random samples are taken, but the distribution in those results is strongly dependent on testbed size and this measures gives us some indication of how variable any single test may be.
# samples_______Fraction of_________Std. Dev.
in test bed___total population____of test rel. to mean
100_______________0.20%___________99.4%
250_______________0.50%___________62.8%
500_______________1.00%___________44.3%
2500______________5.00%___________19.4%
5000______________10.0%___________13.3%
10000_____________20.0%____________8.9%
25000_____________50.0%____________4.4%
If the results are normally distributed (in this case they are not, strictly speaking, since results are bounded at zero on the lower end) roughly ~68% of the time test results are expected to be within +/- one standard deviation (it's ~95% for +/- 2 standard deviations).
Obviously, you will obtain reliable predictive results when the standard deviation is small relative to the mean. For this scenario, a 500 random sample testbed is completely abysmal in predicting performance. Whereas the test should indicate a 1% miss rate for the product, values between 0.66 and 1.44% cover only 68% of the expected results, 95% of the expected results are somewhere in the range of a 0.22 to 1.88% miss rate. Put another way, you can fully expect 1/3 of the results to lie between 0.22-0.66 and/or 1.44-1.88% depending upon the specific samples in the testbed. Think about it. We could have 2 products, both with a nominal failure rate of 1%. There is a fairly substantial probability (casually taking 1-2 sigma mid-points here) that Product A could exhibit a failure rate of 0.44% while Product B's failure rate is 1.66% - almost 4 times higher! Yet, these are statistically identical products responding to a small sample challenge that happens to be either on the high or low failure rate wing of the distribution for the single challenge test. You'll see results like this roughly one-third of the time. These are fairly large windows.
Results do improve significantly when the sample size is increased to 2500 and 5000 (note - these are the sample sizes of tests such as those run by av-comparatives.org/ (www.av-comparatives.org/)). Even here, though, fine differences in performance may be within the expected noise of the test.
Note, as the defect rate goes down (i.e. better scan performance), the need for larger testbeds increases dramatically. If the scenario above were played out with products having 99.9% coverage, the results are
# samples_______Fraction of_________Std. Dev.
in test bed___total population____of test rel. to mean
100_______________0.20%__________315.8%
250_______________0.50%__________199.4%
500_______________1.00%__________140.6%
2500______________5.00%___________61.6%
5000______________10.0%___________42.4%
10000_____________20.0%___________28.3%
25000_____________50.0%___________14.1%
By the same token, if the defect rate is high, much smaller testbeds can yield reasonable performance metrics.
There are a multitude of details that one can quibble regarding the analysis above. It is approximate, I realize this. I'm only trying to provide some objectibe indication as to what can happen if a randomly selected small sample testbed is employed. Performance can obviously be skewed well outside these bounds with a nonrandom testbed and those results can be either on the positive or negative side of things.
The numbers above do not mean results presented in this forum and others are wrong. However, they do underscore the expected variability based on the use of a small challenge sample.
Fine differences between products cannot reasonably be expected to be uncovered in these tests and it is possible that greater differences (or similarities) are also not revealed. In some cases the noise is such that these tests will clearly lose all discriminating power. Furthermore, it is very possible that the results can mislead causal readers if the intrinsic noise in the result is not appreciated.
Just a couple of thoughts to consider....
Blue