AV Comparatives Detection Statistics - A Crude Meta-analysis

BlueZannetti · Sep 10, 2009

Whenever a test result is released from AV-Comparatives.org, there's substantial commentary over which products are up, which are down, and which reside at the top of the numerical heap.

Much less attention is paid to the question of the intrinsic reproducibility of the test protocol. What's the noise level in the result? Everyone would probably agree that a 50% detection level is worse than a 95% detection result. However, what about 95% vs 99%? Are they really different? Is there anyway to develop even a casual sense around "what's the same" vs. "what's different"?

As with any experimental determination, AV detection performance evaluations are impacted by noise. The measurement noise is derived from a number of discrete sources including:

The time dependence by which signatures are added to the product relative to sample harvest and signature set lock down.

The specific sample set harvested to establish the test bed. Is it an intrinsically "difficult" set with which all products have difficulty, or the converse?

Decisions made regarding what constitutes a malware sample by the vendor and the tester.

Sample set heterogeneity. In other words, how different are the various members of the sample set? Think of an extreme case in which all members are recast children of a single parent which one product detects and another does not, potentially yielding 100% and 0% detection respectively. That's an unrealistic extreme, but lesser variations are expected to play out in real life - if you don't think that's a real factor, take a moment to look at the distribution of detection statistics at Shadowserver.org

Any other uncontrolled sources of variation.

At this point there's a fairly rich body of data available at AV-Comparatives.org. One useful approach to normalizing the collected results is to examine the distribution in detection level changes between successive tests for the same product, measured across all products examined. In other words, for example, the change in % detected for Kaspersky between August 2004 and Feb 2004, then between Feb 2005 and August 2004, and so on. If the performance of the AV's were actually stationary (i.e. actual detection % constant), noise in the data would arise solely from test noise. That situation is not applicable here. The actual performance of the products is expected to show variation over time, so this type of analysis looks at a combination of product and test variability.

Global average shifts in performance should be apparent from the location of the centroid of the distribution as well. If products are improving over time, a bias above 0 should be seen. By the same token, a centroid in the negative region would signify a net drop in performance over time.

An integral form representation of that distribution is shown in the plot below for 141 sequential data pairs spanning results published for the On-Demand tests from Feb 2004 to Feb 2009.

There are a number of immediate numerical results apparent:

The distribution itself is reasonably represented by a classical normal distribution, although the wings are a bit wide for a genuinely Gaussian distribution. Since the actual result for any single test is bounded at 100% and we're close to that bound for a number of results, a truly normal distribution is not expected.

The centroid of the distribution is at +0.2%. This means, on average, the detection rates of products were improving by ~ 0.4%/year from 2004 - 2009. So, despite the periodic proclamations of the "death of AV's", their performance has shown steady improvement as an industry class over the past 5 years.

If you attempt to estimate the standard deviation (σ) for the distribution, it comes out to ~ 3%. This holds reasonably well for ±1σ and ±2σ levels on the integrated curve (as it must for σ to be a useful population metric).

Now, this is certainly not a rigorous statistical analysis. The base test has seemed to evolve over the years. Think of it more as a numerically objective view of the collected data.

My own view is that this casual analysis reaffirms the notion that differences in detection levels of "a few percentage points" are actually inconsequential over the long term and shouldn't form the basis of a product shift, or user anxiety for that matter. Certain products are up some years, and down others. A specific example might be KAV/KIS over the August 2007 -February 2009 timeframe. A part of the variation seems related to testbed difficulty - a number of products experienced correlated dips in performance, but the correlation did not hold across all products. Naturally, if the trend is not a transient excursion, but more of a trending shift in performance over time, that's a detail which warrants further evaluation.

Blue

cqpreson · Sep 10, 2009

Thank you for information.Test is improving,so AV need to improve.

bollity · Sep 11, 2009

A lecture of math

Fajo · Sep 11, 2009

Ok Massive wall of text. Now my head hurts.
English please.

THANKS BLUE!!!!!!!

trjam · Sep 11, 2009

sounds like to me someone who normally does good, didnt this time around.

Pedro · Sep 11, 2009

BlueZannetti said:

Everyone would probably agree that a 50% detection level is worse than a 95% detection result. However, what about 95% vs 99%? Are they really different? Is there anyway to develop even a casual sense around "what's the same" vs. "what's different"?
Click to expand...

BlueZannetti said:

At this point there's a fairly rich body of data available at AV-Comparatives.org. One useful approach to normalizing the collected results is to examine the distribution in detection level changes between successive tests for the same product, measured across all products examined. In other words, for example, the change in % detected for Kaspersky between August 2004 and Feb 2004, then between Feb 2005 and August 2004, and so on. If the performance of the AV's were actually stationary (i.e. actual detection % constant), noise in the data would arise solely from test noise. That situation is not applicable here. The actual performance of the products is expected to show variation over time, so this type of analysis looks at a combination of product and test variability.
Click to expand...

However wouldn't it be more accurate to compare those products which are closer, those in the 95-99% range? Wouldn't including the others actually add noise to your analysis?
My reasoning is, if you're including AV's with lower performances, which probably have bigger % changes from test to test (hence variance etc.).

Or in other words, you're arguing small differences in % are not that important, but you base that on data that has bigger differences.
I actually think you're right, it's not that big a deal a 95% rate vs. 97%, although if it is consistent it does matter. But if we do this analysis, we need some precautions so we don't end up making the data fit our prior judgement, and i think this applies.
What do you think?

BTW, great post as always Blue!

BlueZannetti · Sep 19, 2009

Pedro said:

However wouldn't it be more accurate to compare those products which are closer, those in the 95-99% range? Wouldn't including the others actually add noise to your analysis?

My reasoning is, if you're including AV's with lower performances, which probably have bigger % changes from test to test (hence variance etc.).
Click to expand...

Well, look at the case of Avira over the years. That's a product which is a high performer, right? Look at it's stats from Feb 2004 - Aug 2005 (% detection range is 71.6 - 93.0%). It occupies the wings as well.

Or in other words, you're arguing small differences in % are not that important, but you base that on data that has bigger differences.
I actually think you're right, it's not that big a deal a 95% rate vs. 97%, although if it is consistent it does matter. But if we do this analysis, we need some precautions so we don't end up making the data fit our prior judgement, and i think this applies.
Click to expand...

I took as global a view as possible and looked at it over the long haul. Let's take a look at the test average over the years:

You see the % detected, averaged over all products, dropped 3.8% moving from Feb 2008 to Aug 2008. Some good performers followed this average behavior (BitDefender, F-Secure, Kaspersky, McAfee (yikes!), MS One Care, NOD32) while others maintained or inched up (Avira, AVK/G-Data, Avast, AVG, Norton, TrustPort). The average bounced back somewhat by Feb 2009 (0.7% recovery), as did many of the component test subjects. For some of the products, that dip was a one-off event and they recovered completely

In many discussions held here, a separation in scores of the order of 4% is often made much of. However, if you look at the numbers, that type of difference can appear as very short term transient (i.e. as noise). That noise may be a genuine case of the vendor taking their eye off the situation (example - McAfee from Feb 2008 - Feb 2009), or it could be an embedded bias in the test bed as it is developed. Either way, it is a short term transient that can be gone by the time results are in and we are able to react. As such, one has to consider whether it's prudent to react to small fluctuations (I'd say no).

You are right, consistency matters. However, to get a sense of how that really shakes out, one has to data mine the test reports over the long term, not take the latest example, proclaim a winner and then immediately move on.

Blue

dw2108 · Sep 11, 2009

Was this an LTG extrap? If so, back to SSM!

Dave

Pedro · Sep 11, 2009

BlueZannetti said:

You are right, consistency matters. However, to get a sense of how that really shakes out, one has to data mine the test reports over the long term, not take the latest example, proclaim a winner and then immediately move on.

Blue
Click to expand...

You won't hear anything from me there, absolutely agreed.
The Avira example threw me off, i need to get back to this when i get out of work.

One quick question: did you compile those numbers yourself or is it available at AVC? I only ever read individual test reports, and in pdf

BlueZannetti · Sep 12, 2009

Pedro said:

One quick question: did you compile those numbers yourself or is it available at AVC? I only ever read individual test reports, and in pdf
Click to expand...

I've basically compiled the overall stats myself for a while, updating the spreadsheet as new results appear.

Blue

sourav_gho · Sep 18, 2009

Hey Blue,
One of the best comparisons could also be the RAP test Graph Quadrant Compiled from February to August 2009. It is compiled over long period of time, so I think it is also an effective measure of an AV.

BlueZannetti · Sep 18, 2009

sourav_gho said:

Hey Blue,
One of the best comparisons could also be the RAP test Graph Quadrant Compiled from February to August 2009. It is compiled over long period of time, so I think it is also an effective measure of an AV.
Click to expand...

Absolutely. That's a nice quick way to graphically assess overall effectiveness of a product and it provides a similar breakout.

I was trying to answer a somewhat different question - the noise in the test protocol (regardless of source) - which ultimately does come down to assigning an overall product effectiveness from the perspective of their numerical detection level.

Both approaches tend to develop the examination as a cluster analysis problem. The VB RAP approach is fairly explicit in terms of presenting the results explicitly as a graphical two dimensional cluster based view.

It's pretty easy to visually discriminate the product sets at the above/below 70% reactive detection level. You're left with a rather ill defined swarm of points in that upper right hand quadrant. One could leave it at that, or try to apply some additional discrimination tests within that set (i.e. are there subfamilies within that set).

The approach I used is at the one dimensional level. I really only considered what's equivalent to the Y axis ranking of the VB RAP presentation and then tried to put some objectively determined error bars on those ranked points along the Y axis. Large error bars will obviously take the discrete points and merge them into a less defined zones (this is really one of the reasons that readers should pay more attention to the certification levels provided that the raw numerical results). The same thing would happen in two dimensions. Rather than small points, results would be shown as rather large circles with extensive overlap if they're close together.

Both approaches attempt to provide some level of quantification to the question of "How far apart do the results have to be to be considered really different?"

Blue

Log in or Sign up

AV Comparatives Detection Statistics - A Crude Meta-analysis

BlueZannetti Registered Member

cqpreson Registered Member

bollity Registered Member

Fajo Registered Member

trjam Registered Member

Pedro Registered Member

BlueZannetti Registered Member

dw2108 Registered Member

Pedro Registered Member

BlueZannetti Registered Member

sourav_gho Registered Member

Attached Files:

RAP-quadrant-Feb-Aug09.gif

BlueZannetti Registered Member

Log in or Sign up

AV Comparatives Detection Statistics - A Crude Meta-analysis

BlueZannetti Registered Member

cqpreson Registered Member

bollity Registered Member

Fajo Registered Member

trjam Registered Member

Pedro Registered Member

BlueZannetti Registered Member

dw2108 Registered Member

Pedro Registered Member

BlueZannetti Registered Member

sourav_gho Registered Member

Attached Files:

RAP-quadrant-Feb-Aug09.gif

BlueZannetti Registered Member

Useful Searches