Methodology for AV tests

JerryM · Sep 20, 2009

It is evident that various testing organizations, or published tests, do not reach the same conclusions as to the best AV in many instances.
I have not been involved, and have not explored the details. I do wonder how the tests are conducted.

Are tests conducted with a clean system, and then as if browsing or opening an attachment at attempt is made to introduce malware during that time? Or is malware placed on a computer and then the AVs tested to see if they can detect it?

It would seem to me that the most important requirement for an AV is prevention. I am not sure how such things as firewalls and anti-malware components of IS type products enter into the equation/tests. But with the additional components/modules of an IS application it would seem to be superior in protection to a stand alone AV. I do realize that layered protection can make up the differences, but the suites are never tested, and it might not be feasible considering all the possibilities.

So if I choose an AV or IS based upon only tests like AVC how have I erred, assuming that the application runs well on my system? Yet when tests are published many claim that the tests are not particular meaningful and that detection rates are not the primary criterion. But if detection is not the primary purpose of an AV what is?
Why would I not prefer an AV with 98% detection rates over one that has a 94% rate?

I do recognize that such things as scan speeds, updates, and FPs are important, but such things cannot trump detection rates, within reasonable limits, as I see it.

So far I have never had an infection without such things as sandboxes.

Sorry I don't have the knowledge to ask a more coherent question.

Regards,
Jerry

Thankful · Sep 20, 2009

If you go www.av-comparatives.org and click on
'Comparatives/Reviews' there is a file 'methodology.pdf' that may answer many of your questions.
The file appears under title 'Comparatives & Reviews' and within sentence
'Our sorting and testing methodology and the FAQ's can be found here (PDF). [08/2008]'

JerryM · Sep 20, 2009

Thanks, Thankful, I'll take a look.

Regards,
Jerry

kwismer · Sep 21, 2009

JerryM said:

It is evident that various testing organizations, or published tests, do not reach the same conclusions as to the best AV in many instances.
Click to expand...

long ago (15 years or more if i'm not mistaken) a very smart guy by the name of vesselin bontchev wrote "there is no best av, there are only very good ones and ones you should probably just ignore". he wrote this, by the way, while working with the virus test center at university of hamburg. in fact, i believe it was actually IN the documentation of one of the VTC's tests.

I have not been involved, and have not explored the details. I do wonder how the tests are conducted.
Click to expand...

there are all kinds of differences from one testing organization to another. and even though the anti-malware testing standards organization hopes to produce testing procedures that testing organizations will follow (and thus make their methodologies more consistent with each other) there will still be differences in the malware sample sets themselves (you need samples to test with and every organization has their own malware library).

as such, don't ever expect one organization to be able to reproduce the results of another organization - it's just never going to happen.

So if I choose an AV or IS based upon only tests like AVC how have I erred, assuming that the application runs well on my system? Yet when tests are published many claim that the tests are not particular meaningful and that detection rates are not the primary criterion. But if detection is not the primary purpose of an AV what is?
Why would I not prefer an AV with 98% detection rates over one that has a 94% rate?
Click to expand...

there isn't a significant enough difference (detection-wise) between those two options to make a decision. you need more data. detection capability is definitely important, but (referring to the earlier vess quote) those are both very good products so you're going to need additional criteria in order to narrow things down to just one.

scott1256ca · Sep 21, 2009

You would think that 98% should be more significant than 94% if they are testing over a sample size of 1.6million samples

Don't forget that the testers don't all test the exact same build with the exact same sig database, or on the same date (or same time with same cloud if appropriate), so the results are bound to vary because of this as well.

Read the post by BlueZanneti (sp?) talking about analyzing the results over time as well. That gives you an idea of how to determine which AV's are likely to do well in the near future.

These tests are generally a snapshot of how well the AV would protect on that day, where you want protection over months or years (when you might change your AV)

Also, I see some mention the focus on prevention in a companies suite, as opposed to just how well the AV does. While I don't disagree that having an AV shouldn't be your only protection, I also don't see any reason why I should put up with a mediocre AV just because the rest of the suite is decent. It isn't an excuse for a mediocre AV.

If can disable the mediocre AV, and plug in Avast! or Avira to get protection on at least that level. Why should I not? I know there can be some advantages to one manufacturer providing the suite in terms of overlap of responsibilities (i.e. if the AV has a classical HIPS component, and so does the firewall, which should I use, or should I use both?), but a little testing should tell me if they are compatible or not. I guess the worry is that using 2 components will leave some gap in my security that I am unaware of.

I'd rather see the tests include the range of possible intrusions, but if you apply my mix and match technique, there would be too many combinations to test and find the weaknesses.

BlueZannetti · Sep 21, 2009

scott1256ca said:

You would think that 98% should be more significant than 94% if they are testing over a sample size of 1.6million samples
Click to expand...

You might think that..., but there's a good chance it would be wrong...

Don't forget that the testers don't all test the exact same build with the exact same sig database, or on the same date (or same time with same cloud if appropriate), so the results are bound to vary because of this as well.
Click to expand...

Correct. But this is giving you a sense of the intrinsic variability in the result due to the fluid nature of all factors (test bed, quality of the test bed, signatures, and so on)

Read the post by BlueZanneti (sp?) talking about analyzing the results over time as well. That gives you an idea of how to determine which AV's are likely to do well in the near future.
Click to expand...

Well..., that wasn't really the objective of the post on AV Comparatives Detection Statistics - A Crude Meta-analysis. The stepping off point for that analysis was the assumption that over short timeframes (and I'm positing that 6 months is short), the inherent capabilities of a product are stationary.

That is, all factors being equal (test bed quality, heterogeneity, field harvest of samples at the vendor, etc.), you shouldn't see dramatic shifts in product performance, therefore the shifts you see are "noise" in the result. Now, that "noise" can be a vendor taking their eyes off the ball for a short time to yield a real, but very transient, spike down is detection statistics. McAfee from Feb 2008 - Feb 2009 is an example of this as detection went from 94.9% to 84.4% to 99.1%. Naturally, the opposite of really focusing for a short time to get a high level of detection can occur as well. The punchline though is that both changes are transient and if you're making product decisions based on this, particularly if you're looking at only the latest test, you're really simply reacting to noise.

I tried to put some firm numbers around the "noise" (all sources, including vendor based), and it seems that the standard deviation of the AV-Comparatives test is ~3%, which reflects the more casual statement attributed to Vesselin Bontchev (still at F-Prot I assume) above.

These tests are generally a snapshot of how well the AV would protect on that day, where you want protection over months or years (when you might change your AV)
Click to expand...

Exactly!

Also, I see some mention the focus on prevention in a companies suite, as opposed to just how well the AV does. While I don't disagree that having an AV shouldn't be your only protection, I also don't see any reason why I should put up with a mediocre AV just because the rest of the suite is decent. It isn't an excuse for a mediocre AV.
Click to expand...

Except that the AV is really the heart for everything else in these types of products.

If can disable the mediocre AV, and plug in Avast! or Avira to get protection on at least that level. Why should I not? I know there can be some advantages to one manufacturer providing the suite in terms of overlap of responsibilities (i.e. if the AV has a classical HIPS component, and so does the firewall, which should I use, or should I use both?), but a little testing should tell me if they are compatible or not. I guess the worry is that using 2 components will leave some gap in my security that I am unaware of.
Click to expand...

More seriously, you need to weigh whether an intrinsic compatibility issue may emerge on an update and wreak havoc out of the blue.

I'd rather see the tests include the range of possible intrusions, but if you apply my mix and match technique, there would be too many combinations to test and find the weaknesses.
Click to expand...

Operationally, as you note, too many combinations....

Blue

BlueZannetti · Sep 21, 2009

One additional point....

One aspect of any test that I'm sure everyone knows in principle, but we sometimes lose sight of it, is that the detection statistics weigh all samples equally. The problem with this is that, a priori, we have no way of ascertaining whether or not this is true and plenty of reasons to believe that it's false.

This statement is true even if the testbed were extensively culled of junk samples, which is extremely difficult given the shear size of testbeds now employed. The ballooning testbed size is an attempt to be as comprehensive as possible, but it does tend to let junk into the evaluation. However, even if all of the junk were eliminated, we're still left with the implicit equal weighting applied to all samples, which is a real issue in looking at the numbers.

The problem is..., there's no reasonable scheme to use to apply a rational weighting factor (taking into account intrinsic maliciousness (whatever that is, perhaps it's damage potential), frequency seen in the field, propensity to propagate, and so on).

In other words, the raw numbers are a lot less cooked than many seem to believe.

Blue

dawgg · Sep 21, 2009

JerryM said:

Yet when tests are published many claim that the tests are not particular meaningful and that detection rates are not the primary criterion. But if detection is not the primary purpose of an AV what is?
Click to expand...

IMO, prevention (of infection) is the primary purpose of security software. AVs are a branch of this, historically being only "detection", but now also diversing into using other forms and methods of protection.

So, detection a proxy of "prevention" abilities of AVs, arguably one of the most important, but not the only of importance.

Dynamic tests (of live malware in the wild) is possibly a better test to see real-world, user experience of infections because they show actual protection abilities of software (be it AV or IS), although these are much harder to do with a significant number of malware and has much more space for caveats and criticism - especially because these tests often lack assessment of "usability" with legitimate software and even when there is assessment of usability, there is a large subjective judgment on the tester's part (do the alerts give enough description, are they easy for a newbie to read and define what should be the next course of action - which brings the other argument of what settings should be used - Advanced of Automatic etc)

I think AVC will be releasing a dynamic test "soon", as it says in the AVC August 2009 report.

andyman35 · Sep 21, 2009

All these tests are an interesting general guide but that's it. As BlueZanetti mentioned the maliciousness of the missed samples isn't shown on the whole.

So,for example AV A may score 99.4% but miss all the most damaging malware in it's 0.6% failures,while AV B,with a 90.5% rating may detect all the nastiest stuff and it's 9.5% failures be comprised of relatively harmless items.

Of course you'd hope that by aggregating numerous tests over a period of time this would lessen such a likelihood.

Inspector Clouseau · Sep 21, 2009

Correct. Most of the "modern" malware is in a so called "1 to 2 days life cycle". I'm not saying that detecting these samples after weeks isn't important but it's more important that you detect them as they appear. Originally that was the purpose of the so called Wildlist Organisation. However, they only listed malware that has been spotted "in the wild" and the list wasn't updated fast enough to reflect real world detections. It doesn't make sense to update this list a month later.

Testing AV Products is easy. Testing AV products in the proper way is next to impossible as long you don't work in this business yourself. For the following reasons:

1. Every AV tester needs to determine between clean and malicious files without the help of AV scanners. (He needs to be able to disassemble / reverse engineer malware and he has to draw it's own conclusion why something is malicious or isn't.) Because otherwise you rely on work for the very same companies that you are going to test. That's basically the same if you wanna test lets say the quality of food. You simply can't give that food to the very same companies (or their competition for that matter) to get results if it's good or not. You have to determine that yourself. Sure, pre-selecting samples with the help of av scanners is a bonus and almost impossible to avoid with the lots of malware currently. However, if someone complains that some sample is not detected you need to be able to defend your findings and "Kaspersky (or someone else) detected it" is NOT a technical suitable explanation. Because they may find parts of a recently virus cleaned file. If the full Virusbody (including the jump into the malicious code) isn't present anymore the file should NOT BE CONSIDERED TO BE MALICIOUS. Because it isn't. Even if other scanners say so. The only way to find that out is firing up IDA Pro or Ollydebug and disassemble that and find out based on what others finding that. Then you have to research if this code

a) is malicious as such
b) gets activated under specific circumstances

Good luck with that without being a professional researcher (or having researchers working for you)

The next important point is (someone pointed that already out) that you need to know priorities. Means what is current malware. Because testing old stuff (and believe me... even 3 months old stuff you can already consider STONE OLD currently) isn't the problem currently. Just go and download server side created malware. You'd be shocked how many AV companies that do well in these AV tests miss this kind of stuff. So the whole test becomes almost meaningless to you if your favorite AV, the one you're using in your corporate network, scores constantly high in all kinds of av tests and then people getting infected with so called 0-1 day malware. Sure, that stuff gets probably added A MONTH LATER to the so called AV Tester Collections but still where's the test result when that stuff came out BRAND NEW?

There's so many details that you need to know to be a successful AV tester it's more than all text together that has been posted here in the past week in the entire forum. You know that stuff or you don't. Most of the testers don't.

Mike

Sjoeii · Sep 21, 2009

Interesting Mike.
So this is something Vipre is working hard on?

subset · Sep 21, 2009

Inspector Clouseau said:

Most of the "modern" malware is in a so called "1 to 2 days life cycle".
Click to expand...

Panda points out that 52 % of new viruses only last for 24 hours and that there are about 37.000 pieces of new malware every day.
http://www.pandasecurity.com/homeusers/media/press-releases/viewnews?noticia=9804
Are these informations approximately correct?

If so, the scan time for big AV testers collection would most likely take longer than the lifetime of the major part of the in the wild malware is.

Cheers

Miyagi · Sep 21, 2009

Which pretty much debates the AV testing as a whole. But before I get in deep kimchee, the end-users need to understand the methodology rather than picking up someone's shrimp tail out of the sea. Dissect the testing like how BlueZannetti did with his latest graph and ask questions to the experts.

JerryM · Sep 21, 2009

Miyagi said:

Which pretty much debates the AV testing as a whole. But before I get in deep kimchee, the end-users need to understand the methodology rather than picking up someone's shrimp tail out of the sea. Dissect the testing like how BlueZannetti did with his latest graph and ask questions to the experts.
Click to expand...

While that would be great, many users myself included, do not have the expertise to conduct tests and do not really understand the methodology. Accordingly we have to rely on published tests, and it is sometimes difficult to know how to properly discern what is best.

For me I have relied on AV Comparatives for several years, as I have confidence in the tests. Although I agree that prevention is the primary criterion, it would seem that detection rates somewhat parallel prevention.

Regards,
Jerry

lafiel · Sep 22, 2009

subset said:

Panda points out that 52 % of new viruses only last for 24 hours and that there are about 37.000 pieces of new malware every day.
http://www.pandasecurity.com/homeusers/media/press-releases/viewnews?noticia=9804
Are these informations approximately correct?

If so, the scan time for big AV testers collection would most likely take longer than the lifetime of the major part of the in the wild malware is.

Cheers
Click to expand...

It can't be helped.
A real collections that tested by AV vendors consists of billions of samples.
And they all must be detected, because no one will give you the assurance you do not catch "24 hours life" malware that was spread half year or two years ago.

Rocko · Sep 22, 2009

NSS Labs' new anti-malware test methodology addresses many of the topics discussed on this thread: live testing, reputation system impact, proactive protection measurement, and over-time testing, as well as dynamic execution.

You'll find the results and a lengthy methodology discussion in the test report here: http://nsslabs.com/anti-malware

The short key findings from this difficult real-world test is that nobody is getting 99 or 100%.
# Products vary greatly in their ability to stop socially engineered malware
# Proactive 0-hour protection ranged from 29% to 64%
# Overall protection varied between 62% and 93%

dawgg · Sep 22, 2009

Rocko said:

NSS Labs' new anti-malware test methodology addresses many of the topics discussed on this thread: live testing, reputation system impact, proactive protection measurement, and over-time testing, as well as dynamic execution.

You'll find the results and a lengthy methodology discussion in the test report here: http://nsslabs.com/anti-malware

The short key findings from this difficult real-world test is that nobody is getting 99 or 100%.
# Products vary greatly in their ability to stop socially engineered malware
# Proactive 0-hour protection ranged from 29% to 64%
# Overall protection varied between 62% and 93%
Click to expand...

Thanks, interesting test.

Fly · Sep 22, 2009

Rocko said:

NSS Labs' new anti-malware test methodology addresses many of the topics discussed on this thread: live testing, reputation system impact, proactive protection measurement, and over-time testing, as well as dynamic execution.

You'll find the results and a lengthy methodology discussion in the test report here: http://nsslabs.com/anti-malware

The short key findings from this difficult real-world test is that nobody is getting 99 or 100%.
# Products vary greatly in their ability to stop socially engineered malware
# Proactive 0-hour protection ranged from 29% to 64%
# Overall protection varied between 62% and 93%
Click to expand...

I have some doubts about the validity of those tests.

Trend Micro performs rather good. Too good to be true, I think.

Avira is not even part of the test.

I'll dismiss this test as irrelevant.

scott1256ca · Sep 22, 2009

re: "NSS Labs"

Does Trend Micro's "in the cloud" detection rely particularly heavily on the "reputation" component? This seems to have given them quite a boost in detection. I wonder what it would do for false positives (which seems to be excluded from this test). The claim is that it boosted their detection by 23% which to me would imply that this is looking at blacklisted elements. Would that be correct?

Trend Micro also benefits greatly from the "time to block" numbers. This certainly favours "in the cloud" testing, and maybe that is a good thing. What I didn't see here, and maybe I missed it, was anything related to WHO was in the cloud for those products which use one. In other words, if a particular vendor had a large cloud, that would certainly influence the results. The size of that cloud would have to be representative of the cloud the average user would see. Would that be correct? I'm not very familiar with "in the cloud" products so maybe I am making incorrect assumptions.

Also this testing assigns no weighting as to "how bad" the malware is. At least not that I saw. As has been pointed out in this thread, that might be pretty significant, if one product views a set of behaviours as threatening and blocks the "malware" and another product doesn't view it as threatening and allows it. Of course, then you need to decide if it was malware or not.

Kees1958 · Sep 22, 2009

Inspector Clouseau said:

Correct. Most of the "modern" malware is in a so called "1 to 2 days life cycle". I'm not saying that detecting these samples after weeks isn't important but it's more important that you detect them as they appear.
Mike
Click to expand...

Mike may I ask you a hypothetical question:

When you take a top tier AV and only get the 'in the wild blacklist' (say of the last two months), what would be its practical/effective reduction in protection?

Would it be
wild (=last 2 months) versus wild + old + zoo = 98% versus 99% protection?

To make your statements even better to understand
What when I would reduce the in the wild to only the blacklist fingerprints of the last two weeks in stead of the last two months, what would be the protection reduction of
wild (=last two weeks) versus etc = still 98% versus 99% or what?

Thanks in advance

regards Kees

trjam · Sep 22, 2009

I think it would be just as close for some for malware that is one or two days old. Look at Matt with Remove-Malware, this isnt about his videos but the malware he uses. He comments that the links are fresh sometimes a few hours old and the AVs hold up well to them. So I think it is the same or close to it.

But IC is right, even Matts samples are inactive in just a few hours, they move so quick.

subset · Sep 22, 2009

lafiel said:

And they all must be detected, because no one will give you the assurance you do not catch "24 hours life" malware that was spread half year or two years ago.
Click to expand...

Yes, but if you would compare the detections (and infections...) of a considerable number of users within e.g. one month after a test - how many of these samples were part of the collection the tester has used?
Less or more than half of it?

Malware industry says "speed kills".
AV industry says "that's why we like on demand testers most, they give us time to respond".

Cheers

Log in or Sign up

Methodology for AV tests

JerryM Registered Member

Thankful Savings Monitor

JerryM Registered Member

kwismer Registered Member

scott1256ca Registered Member

BlueZannetti Registered Member

BlueZannetti Registered Member

dawgg Registered Member

andyman35 Registered Member

Inspector Clouseau AV Expert

Sjoeii Registered Member

subset Registered Member

Miyagi Registered Member

JerryM Registered Member

lafiel AV Expert

Rocko Registered Member

dawgg Registered Member

Fly Registered Member

scott1256ca Registered Member

Kees1958 Registered Member

trjam Registered Member

subset Registered Member

Log in or Sign up

Methodology for AV tests

JerryM Registered Member

Thankful Savings Monitor

JerryM Registered Member

kwismer Registered Member

scott1256ca Registered Member

BlueZannetti Registered Member

BlueZannetti Registered Member

dawgg Registered Member

andyman35 Registered Member

Inspector Clouseau AV Expert

Sjoeii Registered Member

subset Registered Member

Miyagi Registered Member

JerryM Registered Member

lafiel AV Expert

Rocko Registered Member

dawgg Registered Member

Fly Registered Member

scott1256ca Registered Member

Kees1958 Registered Member

trjam Registered Member

subset Registered Member

Useful Searches