Methodology for AV tests

Discussion in 'other anti-virus software' started by JerryM, Sep 20, 2009.

Thread Status:
Not open for further replies.
  1. JerryM

    JerryM Registered Member

    Joined:
    Aug 31, 2003
    Posts:
    4,306
    It is evident that various testing organizations, or published tests, do not reach the same conclusions as to the best AV in many instances.
    I have not been involved, and have not explored the details. I do wonder how the tests are conducted.

    Are tests conducted with a clean system, and then as if browsing or opening an attachment at attempt is made to introduce malware during that time? Or is malware placed on a computer and then the AVs tested to see if they can detect it?

    It would seem to me that the most important requirement for an AV is prevention. I am not sure how such things as firewalls and anti-malware components of IS type products enter into the equation/tests. But with the additional components/modules of an IS application it would seem to be superior in protection to a stand alone AV. I do realize that layered protection can make up the differences, but the suites are never tested, and it might not be feasible considering all the possibilities.

    So if I choose an AV or IS based upon only tests like AVC how have I erred, assuming that the application runs well on my system? Yet when tests are published many claim that the tests are not particular meaningful and that detection rates are not the primary criterion. But if detection is not the primary purpose of an AV what is?
    Why would I not prefer an AV with 98% detection rates over one that has a 94% rate?

    I do recognize that such things as scan speeds, updates, and FPs are important, but such things cannot trump detection rates, within reasonable limits, as I see it.

    So far I have never had an infection without such things as sandboxes.

    Sorry I don't have the knowledge to ask a more coherent question.

    Regards,
    Jerry
     
  2. Thankful

    Thankful Savings Monitor

    Joined:
    Feb 28, 2005
    Posts:
    6,564
    Location:
    New York City
    If you go www.av-comparatives.org and click on
    'Comparatives/Reviews' there is a file 'methodology.pdf' that may answer many of your questions.
    The file appears under title 'Comparatives & Reviews' and within sentence
    'Our sorting and testing methodology and the FAQ's can be found here (PDF). [08/2008]'
     
  3. JerryM

    JerryM Registered Member

    Joined:
    Aug 31, 2003
    Posts:
    4,306
    Thanks, Thankful, I'll take a look.

    Regards,
    Jerry
     
  4. kwismer

    kwismer Registered Member

    Joined:
    Jan 4, 2008
    Posts:
    240
    long ago (15 years or more if i'm not mistaken) a very smart guy by the name of vesselin bontchev wrote "there is no best av, there are only very good ones and ones you should probably just ignore". he wrote this, by the way, while working with the virus test center at university of hamburg. in fact, i believe it was actually IN the documentation of one of the VTC's tests.

    there are all kinds of differences from one testing organization to another. and even though the anti-malware testing standards organization hopes to produce testing procedures that testing organizations will follow (and thus make their methodologies more consistent with each other) there will still be differences in the malware sample sets themselves (you need samples to test with and every organization has their own malware library).

    as such, don't ever expect one organization to be able to reproduce the results of another organization - it's just never going to happen.

    there isn't a significant enough difference (detection-wise) between those two options to make a decision. you need more data. detection capability is definitely important, but (referring to the earlier vess quote) those are both very good products so you're going to need additional criteria in order to narrow things down to just one.
     
  5. scott1256ca

    scott1256ca Registered Member

    Joined:
    Aug 18, 2009
    Posts:
    144
    You would think that 98% should be more significant than 94% if they are testing over a sample size of 1.6million samples :)

    Don't forget that the testers don't all test the exact same build with the exact same sig database, or on the same date (or same time with same cloud if appropriate), so the results are bound to vary because of this as well.

    Read the post by BlueZanneti (sp?) talking about analyzing the results over time as well. That gives you an idea of how to determine which AV's are likely to do well in the near future.

    These tests are generally a snapshot of how well the AV would protect on that day, where you want protection over months or years (when you might change your AV)

    Also, I see some mention the focus on prevention in a companies suite, as opposed to just how well the AV does. While I don't disagree that having an AV shouldn't be your only protection, I also don't see any reason why I should put up with a mediocre AV just because the rest of the suite is decent. It isn't an excuse for a mediocre AV.

    If can disable the mediocre AV, and plug in Avast! or Avira to get protection on at least that level. Why should I not? I know there can be some advantages to one manufacturer providing the suite in terms of overlap of responsibilities (i.e. if the AV has a classical HIPS component, and so does the firewall, which should I use, or should I use both?), but a little testing should tell me if they are compatible or not. I guess the worry is that using 2 components will leave some gap in my security that I am unaware of.

    I'd rather see the tests include the range of possible intrusions, but if you apply my mix and match technique, there would be too many combinations to test and find the weaknesses.
     
  6. BlueZannetti

    BlueZannetti Registered Member

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    You might think that..., but there's a good chance it would be wrong...
    Correct. But this is giving you a sense of the intrinsic variability in the result due to the fluid nature of all factors (test bed, quality of the test bed, signatures, and so on)
    Well..., that wasn't really the objective of the post on AV Comparatives Detection Statistics - A Crude Meta-analysis. The stepping off point for that analysis was the assumption that over short timeframes (and I'm positing that 6 months is short), the inherent capabilities of a product are stationary.

    That is, all factors being equal (test bed quality, heterogeneity, field harvest of samples at the vendor, etc.), you shouldn't see dramatic shifts in product performance, therefore the shifts you see are "noise" in the result. Now, that "noise" can be a vendor taking their eyes off the ball for a short time to yield a real, but very transient, spike down is detection statistics. McAfee from Feb 2008 - Feb 2009 is an example of this as detection went from 94.9% to 84.4% to 99.1%. Naturally, the opposite of really focusing for a short time to get a high level of detection can occur as well. The punchline though is that both changes are transient and if you're making product decisions based on this, particularly if you're looking at only the latest test, you're really simply reacting to noise.

    I tried to put some firm numbers around the "noise" (all sources, including vendor based), and it seems that the standard deviation of the AV-Comparatives test is ~3%, which reflects the more casual statement attributed to Vesselin Bontchev (still at F-Prot I assume) above.
    Exactly!
    Except that the AV is really the heart for everything else in these types of products.
    More seriously, you need to weigh whether an intrinsic compatibility issue may emerge on an update and wreak havoc out of the blue.
    Operationally, as you note, too many combinations....

    Blue
     
  7. BlueZannetti

    BlueZannetti Registered Member

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    One additional point....

    One aspect of any test that I'm sure everyone knows in principle, but we sometimes lose sight of it, is that the detection statistics weigh all samples equally. The problem with this is that, a priori, we have no way of ascertaining whether or not this is true and plenty of reasons to believe that it's false.

    This statement is true even if the testbed were extensively culled of junk samples, which is extremely difficult given the shear size of testbeds now employed. The ballooning testbed size is an attempt to be as comprehensive as possible, but it does tend to let junk into the evaluation. However, even if all of the junk were eliminated, we're still left with the implicit equal weighting applied to all samples, which is a real issue in looking at the numbers.

    The problem is..., there's no reasonable scheme to use to apply a rational weighting factor (taking into account intrinsic maliciousness (whatever that is, perhaps it's damage potential), frequency seen in the field, propensity to propagate, and so on).

    In other words, the raw numbers are a lot less cooked than many seem to believe.

    Blue
     
  8. dawgg

    dawgg Registered Member

    Joined:
    Jun 18, 2006
    Posts:
    818
    IMO, prevention (of infection) is the primary purpose of security software. AVs are a branch of this, historically being only "detection", but now also diversing into using other forms and methods of protection.

    So, detection a proxy of "prevention" abilities of AVs, arguably one of the most important, but not the only of importance.


    Dynamic tests (of live malware in the wild) is possibly a better test to see real-world, user experience of infections because they show actual protection abilities of software (be it AV or IS), although these are much harder to do with a significant number of malware and has much more space for caveats and criticism - especially because these tests often lack assessment of "usability" with legitimate software and even when there is assessment of usability, there is a large subjective judgment on the tester's part (do the alerts give enough description, are they easy for a newbie to read and define what should be the next course of action - which brings the other argument of what settings should be used - Advanced of Automatic etc)

    I think AVC will be releasing a dynamic test "soon", as it says in the AVC August 2009 report.
     
    Last edited: Sep 21, 2009
  9. andyman35

    andyman35 Registered Member

    Joined:
    Nov 2, 2007
    Posts:
    2,336
    All these tests are an interesting general guide but that's it. As BlueZanetti mentioned the maliciousness of the missed samples isn't shown on the whole.

    So,for example AV A may score 99.4% but miss all the most damaging malware in it's 0.6% failures,while AV B,with a 90.5% rating may detect all the nastiest stuff and it's 9.5% failures be comprised of relatively harmless items.

    Of course you'd hope that by aggregating numerous tests over a period of time this would lessen such a likelihood.
     
  10. Inspector Clouseau

    Inspector Clouseau AV Expert

    Joined:
    Apr 2, 2006
    Posts:
    1,329
    Location:
    Maidenhead, UK
    Correct. Most of the "modern" malware is in a so called "1 to 2 days life cycle". I'm not saying that detecting these samples after weeks isn't important but it's more important that you detect them as they appear. Originally that was the purpose of the so called Wildlist Organisation. However, they only listed malware that has been spotted "in the wild" and the list wasn't updated fast enough to reflect real world detections. It doesn't make sense to update this list a month later.

    Testing AV Products is easy. Testing AV products in the proper way is next to impossible as long you don't work in this business yourself. For the following reasons:

    1. Every AV tester needs to determine between clean and malicious files without the help of AV scanners. (He needs to be able to disassemble / reverse engineer malware and he has to draw it's own conclusion why something is malicious or isn't.) Because otherwise you rely on work for the very same companies that you are going to test. That's basically the same if you wanna test lets say the quality of food. You simply can't give that food to the very same companies (or their competition for that matter) to get results if it's good or not. You have to determine that yourself. Sure, pre-selecting samples with the help of av scanners is a bonus and almost impossible to avoid with the lots of malware currently. However, if someone complains that some sample is not detected you need to be able to defend your findings and "Kaspersky (or someone else) detected it" is NOT a technical suitable explanation. Because they may find parts of a recently virus cleaned file. If the full Virusbody (including the jump into the malicious code) isn't present anymore the file should NOT BE CONSIDERED TO BE MALICIOUS. Because it isn't. Even if other scanners say so. The only way to find that out is firing up IDA Pro or Ollydebug and disassemble that and find out based on what others finding that. Then you have to research if this code

    a) is malicious as such
    b) gets activated under specific circumstances

    Good luck with that without being a professional researcher (or having researchers working for you)

    The next important point is (someone pointed that already out) that you need to know priorities. Means what is current malware. Because testing old stuff (and believe me... even 3 months old stuff you can already consider STONE OLD currently) isn't the problem currently. Just go and download server side created malware. You'd be shocked how many AV companies that do well in these AV tests miss this kind of stuff. So the whole test becomes almost meaningless to you if your favorite AV, the one you're using in your corporate network, scores constantly high in all kinds of av tests and then people getting infected with so called 0-1 day malware. Sure, that stuff gets probably added A MONTH LATER to the so called AV Tester Collections but still where's the test result when that stuff came out BRAND NEW?

    There's so many details that you need to know to be a successful AV tester it's more than all text together that has been posted here in the past week in the entire forum. You know that stuff or you don't. Most of the testers don't.

    Mike
     
  11. Sjoeii

    Sjoeii Registered Member

    Joined:
    Aug 26, 2006
    Posts:
    1,240
    Location:
    52?18'51.59"N + 4?56'32.13"O
    Interesting Mike.
    So this is something Vipre is working hard on?
     
  12. subset

    subset Registered Member

    Joined:
    Nov 17, 2007
    Posts:
    825
    Location:
    Austria
    Panda points out that 52 % of new viruses only last for 24 hours and that there are about 37.000 pieces of new malware every day.
    http://www.pandasecurity.com/homeusers/media/press-releases/viewnews?noticia=9804
    Are these informations approximately correct?

    If so, the scan time for big AV testers collection would most likely take longer than the lifetime of the major part of the in the wild malware is. :argh:

    Cheers
     
  13. Miyagi

    Miyagi Registered Member

    Joined:
    Mar 12, 2005
    Posts:
    426
    Location:
    None
    Which pretty much debates the AV testing as a whole. But before I get in deep kimchee, the end-users need to understand the methodology rather than picking up someone's shrimp tail out of the sea. Dissect the testing like how BlueZannetti did with his latest graph and ask questions to the experts. :cool: :D
     
  14. JerryM

    JerryM Registered Member

    Joined:
    Aug 31, 2003
    Posts:
    4,306
    While that would be great, many users myself included, do not have the expertise to conduct tests and do not really understand the methodology. Accordingly we have to rely on published tests, and it is sometimes difficult to know how to properly discern what is best.

    For me I have relied on AV Comparatives for several years, as I have confidence in the tests. Although I agree that prevention is the primary criterion, it would seem that detection rates somewhat parallel prevention.

    Regards,
    Jerry
     
  15. lafiel

    lafiel AV Expert

    Joined:
    Sep 16, 2009
    Posts:
    6
    Location:
    Minsk, Belarus
    It can't be helped.
    A real collections that tested by AV vendors consists of billions of samples.
    And they all must be detected, because no one will give you the assurance you do not catch "24 hours life" malware that was spread half year or two years ago.
     
  16. Rocko

    Rocko Registered Member

    Joined:
    Mar 16, 2005
    Posts:
    1
    NSS Labs' new anti-malware test methodology addresses many of the topics discussed on this thread: live testing, reputation system impact, proactive protection measurement, and over-time testing, as well as dynamic execution.

    You'll find the results and a lengthy methodology discussion in the test report here: http://nsslabs.com/anti-malware

    The short key findings from this difficult real-world test is that nobody is getting 99 or 100%.
    # Products vary greatly in their ability to stop socially engineered malware
    # Proactive 0-hour protection ranged from 29% to 64%
    # Overall protection varied between 62% and 93%
     
  17. dawgg

    dawgg Registered Member

    Joined:
    Jun 18, 2006
    Posts:
    818
    Thanks, interesting test. :thumb:
     
  18. Fly

    Fly Registered Member

    Joined:
    Nov 1, 2007
    Posts:
    2,201
    I have some doubts about the validity of those tests.

    Trend Micro performs rather good. Too good to be true, I think.

    Avira is not even part of the test.

    I'll dismiss this test as irrelevant.
     
  19. scott1256ca

    scott1256ca Registered Member

    Joined:
    Aug 18, 2009
    Posts:
    144
    re: "NSS Labs"

    Does Trend Micro's "in the cloud" detection rely particularly heavily on the "reputation" component? This seems to have given them quite a boost in detection. I wonder what it would do for false positives (which seems to be excluded from this test). The claim is that it boosted their detection by 23% which to me would imply that this is looking at blacklisted elements. Would that be correct?

    Trend Micro also benefits greatly from the "time to block" numbers. This certainly favours "in the cloud" testing, and maybe that is a good thing. What I didn't see here, and maybe I missed it, was anything related to WHO was in the cloud for those products which use one. In other words, if a particular vendor had a large cloud, that would certainly influence the results. The size of that cloud would have to be representative of the cloud the average user would see. Would that be correct? I'm not very familiar with "in the cloud" products so maybe I am making incorrect assumptions.

    Also this testing assigns no weighting as to "how bad" the malware is. At least not that I saw. As has been pointed out in this thread, that might be pretty significant, if one product views a set of behaviours as threatening and blocks the "malware" and another product doesn't view it as threatening and allows it. Of course, then you need to decide if it was malware or not.
     
  20. Kees1958

    Kees1958 Registered Member

    Joined:
    Jul 8, 2006
    Posts:
    5,857
    Mike may I ask you a hypothetical question:

    When you take a top tier AV and only get the 'in the wild blacklist' (say of the last two months), what would be its practical/effective reduction in protection?

    Would it be
    wild (=last 2 months) versus wild + old + zoo = 98% versus 99% protection?

    To make your statements even better to understand
    What when I would reduce the in the wild to only the blacklist fingerprints of the last two weeks in stead of the last two months, what would be the protection reduction of
    wild (=last two weeks) versus etc = still 98% versus 99% or what?

    Thanks in advance

    regards Kees
     
  21. trjam

    trjam Registered Member

    Joined:
    Aug 18, 2006
    Posts:
    9,102
    Location:
    North Carolina USA
    I think it would be just as close for some for malware that is one or two days old. Look at Matt with Remove-Malware, this isnt about his videos but the malware he uses. He comments that the links are fresh sometimes a few hours old and the AVs hold up well to them. So I think it is the same or close to it.

    But IC is right, even Matts samples are inactive in just a few hours, they move so quick.
     
  22. subset

    subset Registered Member

    Joined:
    Nov 17, 2007
    Posts:
    825
    Location:
    Austria
    Yes, but if you would compare the detections (and infections...) of a considerable number of users within e.g. one month after a test - how many of these samples were part of the collection the tester has used?
    Less or more than half of it?

    Malware industry says "speed kills".
    AV industry says "that's why we like on demand testers most, they give us time to respond". :p

    Cheers
     
Thread Status:
Not open for further replies.
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.