BlueZannetti
September 10th, 2009, 09:40 PM
Whenever a test result is released from AV-Comparatives.org (http://www.av-comparatives.org/), there's substantial commentary over which products are up, which are down, and which reside at the top of the numerical heap.
Much less attention is paid to the question of the intrinsic reproducibility of the test protocol. What's the noise level in the result? Everyone would probably agree that a 50% detection level is worse than a 95% detection result. However, what about 95% vs 99%? Are they really different? Is there anyway to develop even a casual sense around "what's the same" vs. "what's different"?
As with any experimental determination, AV detection performance evaluations are impacted by noise. The measurement noise is derived from a number of discrete sources including: The time dependence by which signatures are added to the product relative to sample harvest and signature set lock down.
The specific sample set harvested to establish the test bed. Is it an intrinsically "difficult" set with which all products have difficulty, or the converse?
Decisions made regarding what constitutes a malware sample by the vendor and the tester.
Sample set heterogeneity. In other words, how different are the various members of the sample set? Think of an extreme case in which all members are recast children of a single parent which one product detects and another does not, potentially yielding 100% and 0% detection respectively. That's an unrealistic extreme, but lesser variations are expected to play out in real life - if you don't think that's a real factor, take a moment to look at the distribution of detection statistics at Shadowserver.org (http://www.shadowserver.org/wiki/pmwiki.php/Shadowserver/Shadowserver)
Any other uncontrolled sources of variation.
At this point there's a fairly rich body of data available at AV-Comparatives.org (http://www.av-comparatives.org/). One useful approach to normalizing the collected results is to examine the distribution in detection level changes between successive tests for the same product, measured across all products examined. In other words, for example, the change in % detected for Kaspersky between August 2004 and Feb 2004, then between Feb 2005 and August 2004, and so on. If the performance of the AV's were actually stationary (i.e. actual detection % constant), noise in the data would arise solely from test noise. That situation is not applicable here. The actual performance of the products is expected to show variation over time, so this type of analysis looks at a combination of product and test variability.
Global average shifts in performance should be apparent from the location of the centroid of the distribution as well. If products are improving over time, a bias above 0 should be seen. By the same token, a centroid in the negative region would signify a net drop in performance over time.
An integral form representation of that distribution is shown in the plot below for 141 sequential data pairs spanning results published for the On-Demand tests from Feb 2004 to Feb 2009.
212130
There are a number of immediate numerical results apparent: The distribution itself is reasonably represented by a classical normal distribution, although the wings are a bit wide for a genuinely Gaussian distribution. Since the actual result for any single test is bounded at 100% and we're close to that bound for a number of results, a truly normal distribution is not expected.
The centroid of the distribution is at +0.2%. This means, on average, the detection rates of products were improving by ~ 0.4%/year from 2004 - 2009. So, despite the periodic proclamations of the "death of AV's", their performance has shown steady improvement as an industry class over the past 5 years.
If you attempt to estimate the standard deviation (σ) for the distribution, it comes out to ~ 3%. This holds reasonably well for ±1σ and ±2σ levels on the integrated curve (as it must for σ to be a useful population metric).
Now, this is certainly not a rigorous statistical analysis. The base test has seemed to evolve over the years. Think of it more as a numerically objective view of the collected data.
My own view is that this casual analysis reaffirms the notion that differences in detection levels of "a few percentage points" are actually inconsequential over the long term and shouldn't form the basis of a product shift, or user anxiety for that matter. Certain products are up some years, and down others. A specific example might be KAV/KIS over the August 2007 -February 2009 timeframe. A part of the variation seems related to testbed difficulty - a number of products experienced correlated dips in performance, but the correlation did not hold across all products. Naturally, if the trend is not a transient excursion, but more of a trending shift in performance over time, that's a detail which warrants further evaluation.
Blue
Much less attention is paid to the question of the intrinsic reproducibility of the test protocol. What's the noise level in the result? Everyone would probably agree that a 50% detection level is worse than a 95% detection result. However, what about 95% vs 99%? Are they really different? Is there anyway to develop even a casual sense around "what's the same" vs. "what's different"?
As with any experimental determination, AV detection performance evaluations are impacted by noise. The measurement noise is derived from a number of discrete sources including: The time dependence by which signatures are added to the product relative to sample harvest and signature set lock down.
The specific sample set harvested to establish the test bed. Is it an intrinsically "difficult" set with which all products have difficulty, or the converse?
Decisions made regarding what constitutes a malware sample by the vendor and the tester.
Sample set heterogeneity. In other words, how different are the various members of the sample set? Think of an extreme case in which all members are recast children of a single parent which one product detects and another does not, potentially yielding 100% and 0% detection respectively. That's an unrealistic extreme, but lesser variations are expected to play out in real life - if you don't think that's a real factor, take a moment to look at the distribution of detection statistics at Shadowserver.org (http://www.shadowserver.org/wiki/pmwiki.php/Shadowserver/Shadowserver)
Any other uncontrolled sources of variation.
At this point there's a fairly rich body of data available at AV-Comparatives.org (http://www.av-comparatives.org/). One useful approach to normalizing the collected results is to examine the distribution in detection level changes between successive tests for the same product, measured across all products examined. In other words, for example, the change in % detected for Kaspersky between August 2004 and Feb 2004, then between Feb 2005 and August 2004, and so on. If the performance of the AV's were actually stationary (i.e. actual detection % constant), noise in the data would arise solely from test noise. That situation is not applicable here. The actual performance of the products is expected to show variation over time, so this type of analysis looks at a combination of product and test variability.
Global average shifts in performance should be apparent from the location of the centroid of the distribution as well. If products are improving over time, a bias above 0 should be seen. By the same token, a centroid in the negative region would signify a net drop in performance over time.
An integral form representation of that distribution is shown in the plot below for 141 sequential data pairs spanning results published for the On-Demand tests from Feb 2004 to Feb 2009.
212130
There are a number of immediate numerical results apparent: The distribution itself is reasonably represented by a classical normal distribution, although the wings are a bit wide for a genuinely Gaussian distribution. Since the actual result for any single test is bounded at 100% and we're close to that bound for a number of results, a truly normal distribution is not expected.
The centroid of the distribution is at +0.2%. This means, on average, the detection rates of products were improving by ~ 0.4%/year from 2004 - 2009. So, despite the periodic proclamations of the "death of AV's", their performance has shown steady improvement as an industry class over the past 5 years.
If you attempt to estimate the standard deviation (σ) for the distribution, it comes out to ~ 3%. This holds reasonably well for ±1σ and ±2σ levels on the integrated curve (as it must for σ to be a useful population metric).
Now, this is certainly not a rigorous statistical analysis. The base test has seemed to evolve over the years. Think of it more as a numerically objective view of the collected data.
My own view is that this casual analysis reaffirms the notion that differences in detection levels of "a few percentage points" are actually inconsequential over the long term and shouldn't form the basis of a product shift, or user anxiety for that matter. Certain products are up some years, and down others. A specific example might be KAV/KIS over the August 2007 -February 2009 timeframe. A part of the variation seems related to testbed difficulty - a number of products experienced correlated dips in performance, but the correlation did not hold across all products. Naturally, if the trend is not a transient excursion, but more of a trending shift in performance over time, that's a detail which warrants further evaluation.
Blue