Testing with small sample populations...

Discussion in 'other anti-virus software' started by BlueZannetti, Jul 11, 2004.

Thread Status:
Not open for further replies.
  1. BlueZannetti

    BlueZannetti Administrator

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    With the increasing level of small subsample testing of new AV and AT programs appearing here and elsewhere, I thought that it would be useful to consider the expected variability of these tests from a statistical perspective. There are two overriding questions:

    1. Do the performance differences seen in these tests reflect real world performance differences?
    2. Do minor or even moderate test result differences discriminate or are those results really the same?

    The tests that I am focusing on are those that survey a few hundred to a thousand or so trojans/virus'/other malware with a group of products and then attempt to rate the products based on the results obtained. Although there has been some criticism whether the samples are entirely comprised of verified malware, let's assume they are. In this regard, the results below will be a best case analysis.

    The technical problem is as follows:

    a. A total global population of N pieces of malware exists
    b. These tests use a subsample of n pieces to reflect the global population
    c. The scanning programs accurately identify some fraction (m/N) of the N pieces of malware (hopefully close to a 100%). Some fraction are misidentified ((N-m)/N), i.e. the program result is "defective".
    d. How big does n have to be to obtain a reliable estimate of ((N-m)/N), that is, to reliably predict actual performance?

    This problem is mathematically identical to subsampling a large population of objects for defects. How big a sample do you need to get a good handle on the average defect rate? Without trotting out all the technical details, the response follows a hypergeometric probability distribution function (Google if you're interested in details).

    Let's look at a couple of numerical scenarios. Assume a product covers 99% of all threats. Assume that the global population of malware objects to be covered is 50,000. That means there are 500 genuine malware objects in that set of 50000 which the product will not accurately flag. In terms of testing, we need to assess how many samples need to be tested to yield the expected result that this product is 99% effective (i.e of x objects tested, 0.01x are missed). Some results are below. The first column is the number of samples in the testbed. The second column is the % of the global population represented. The final column is the standard deviation of the result relative to the mean given as a percentage. I put the key result in this form since, on average, the correct result will be obtained if enough random samples are taken, but the distribution in those results is strongly dependent on testbed size and this measures gives us some indication of how variable any single test may be.



    # samples_______Fraction of_________Std. Dev.
    in test bed___total population____of test rel. to mean

    100_______________0.20%___________99.4%
    250_______________0.50%___________62.8%
    500_______________1.00%___________44.3%
    2500______________5.00%___________19.4%
    5000______________10.0%___________13.3%
    10000_____________20.0%____________8.9%
    25000_____________50.0%____________4.4%

    If the results are normally distributed (in this case they are not, strictly speaking, since results are bounded at zero on the lower end) roughly ~68% of the time test results are expected to be within +/- one standard deviation (it's ~95% for +/- 2 standard deviations).

    Obviously, you will obtain reliable predictive results when the standard deviation is small relative to the mean. For this scenario, a 500 random sample testbed is completely abysmal in predicting performance. Whereas the test should indicate a 1% miss rate for the product, values between 0.66 and 1.44% cover only 68% of the expected results, 95% of the expected results are somewhere in the range of a 0.22 to 1.88% miss rate. Put another way, you can fully expect 1/3 of the results to lie between 0.22-0.66 and/or 1.44-1.88% depending upon the specific samples in the testbed. Think about it. We could have 2 products, both with a nominal failure rate of 1%. There is a fairly substantial probability (casually taking 1-2 sigma mid-points here) that Product A could exhibit a failure rate of 0.44% while Product B's failure rate is 1.66% - almost 4 times higher! Yet, these are statistically identical products responding to a small sample challenge that happens to be either on the high or low failure rate wing of the distribution for the single challenge test. You'll see results like this roughly one-third of the time. These are fairly large windows.

    Results do improve significantly when the sample size is increased to 2500 and 5000 (note - these are the sample sizes of tests such as those run by av-comparatives.org/). Even here, though, fine differences in performance may be within the expected noise of the test.

    Note, as the defect rate goes down (i.e. better scan performance), the need for larger testbeds increases dramatically. If the scenario above were played out with products having 99.9% coverage, the results are



    # samples_______Fraction of_________Std. Dev.
    in test bed___total population____of test rel. to mean

    100_______________0.20%__________315.8%
    250_______________0.50%__________199.4%
    500_______________1.00%__________140.6%
    2500______________5.00%___________61.6%
    5000______________10.0%___________42.4%
    10000_____________20.0%___________28.3%
    25000_____________50.0%___________14.1%

    By the same token, if the defect rate is high, much smaller testbeds can yield reasonable performance metrics.

    There are a multitude of details that one can quibble regarding the analysis above. It is approximate, I realize this. I'm only trying to provide some objectibe indication as to what can happen if a randomly selected small sample testbed is employed. Performance can obviously be skewed well outside these bounds with a nonrandom testbed and those results can be either on the positive or negative side of things.

    The numbers above do not mean results presented in this forum and others are wrong. However, they do underscore the expected variability based on the use of a small challenge sample.

    Fine differences between products cannot reasonably be expected to be uncovered in these tests and it is possible that greater differences (or similarities) are also not revealed. In some cases the noise is such that these tests will clearly lose all discriminating power. Furthermore, it is very possible that the results can mislead causal readers if the intrinsic noise in the result is not appreciated.

    Just a couple of thoughts to consider....

    Blue
     
  2. o0--0o

    o0--0o Guest

    In my opinion, it depends whether it is appropriate to use a malware archive with a small number of samples. For instance:

    1.
    If you want to test whether a scanner has a comprehensive signature database the test archive should include many (!) samples.

    2.
    If you want to test whether a scanner has an up-to-date signature database the test archive should only include the samples which have been released during the last few months. Virtually, the same applies if you want to test whether a scanner's heuristics can detect newly released samples before they were included in the scanner's signature database.

    3.
    If you want to test whether a scanner detects replicating ITW malware only ITW samples should be included into the archive.

    4.
    If you want to test whether a scanner has an unpacking engine, detects modified ("camouflaged") malware you need a specifically designed malware archive. Such archive should include modified derivatives of *known* malware samples.


    In summary, different test archives tell you different things about a scanner. Therefore, it is important that a tester provides you with information about the malware samples contained in the test archive. The tester should be able to explain why s/he uses a specific malware archive. In other words, it does not make much sense if a tester merely tells you that a scanner detects x% of the samples contained in a malware archive about which no information is available.
     
  3. BlueZannetti

    BlueZannetti Administrator

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    All these qualifiers are absolutely correct, in a sense I was only addressing situation (1) explicitly above - the comprehensive test.

    On point 2 - 4: all excellent points. Another way to think about this is that for these types of restricted tests, even if the testbed has a small number of samples, the global population for that specialized test is also small, so the relative fraction of covered malware is reasonably high and tester will get a good statistical sampling of the defined population. For example, I haven't studied the www.av-comparatives.org tests in detail, but I do have an impression that they are very cognizant of all these points and do design the challenge tests with them in mind. I'm sure there are a number of others efforts out there that also aim for these goals

    The last paragraph is a key one for all of us and it's why the details are important.

    Thanks for the comments, they put some framing around the calculated numbers that is needed, and which I had neglected to provide.

    Blue
     
  4. IBK

    IBK AV Expert

    Joined:
    Dec 22, 2003
    Posts:
    1,819
    Location:
    Innsbruck (Austria)
  5. BlueZannetti

    BlueZannetti Administrator

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    Thanks IBK - that a nice read and a heck of lot more rigorous with respect to AV/AT testing than my simple calculations. The numbers I posted are rigorous with respect to defect assessment, regardless of application, within the approximations I made.

    My hope in posting this at all is to put small scale tests of these types of applications into some form of accessible perspective for users who have just seen their favorite program either identified as trailing the pack or clearly leading. In many cases there is neither reason to panic nor suddenly feel invincible.

    I've posted a more complete graphic of these types of calculations below (same basic scenario as initially examined - 99% coverage, global population of 50,000 samples, etc.) just to emphasize the very strong dependence when the sample size is on the low side.

    Blue
     

    Attached Files:

    • Test.jpg
      Test.jpg
      File size:
      29.1 KB
      Views:
      356
  6. Firefighter

    Firefighter Registered Member

    Joined:
    Oct 28, 2002
    Posts:
    1,670
    Location:
    Finland
    To BlueZannetti from Firefighter!

    In my mind these figures showed too inaccurate results even when they are actually more accurate estimates.

    According to Process Control Chart Tool Kit v.6.0.2 (J. Heitzenberg; Sof-Ware Tools, Boise, Idaho), when the population is 32000 (= the amount of different infections), Reliability/Confidence level is 90 % and Precision/Accuracy level is 2 %, the sample size is only 1606.

    If we compare that 32000 population to infinite population that never ends, the same proggie will give us with the same Reliability/Confidence level and Precision/Accuracy level as above the sample size only 1691.

    Best regards,
    Firefighter!
     
  7. BlueZannetti

    BlueZannetti Administrator

    Joined:
    Oct 19, 2003
    Posts:
    6,590
    Firefighter!

    I have no doubts about the numbers you quote. I realize even the simple estimates the I made are inflated. The errors clearly can't be normally distributed since the defect count is bounded below by zero, and this will definitely tighten the error estimates on the low end if it it taken into account.

    My objective was rather simple, to illustrate in as simple a way as possible some of the issues related to testing for scanning defects (missed diagnosis of viruses in the case) with a small testbed. From a personal perspective, I wanted to see qualitative trends only. I realize that in attempting to capture qualitative behavior only, it is possible to completely miss what I'm looking for by rendering the model too approximate. I don't believe that's the case here, but you are correct, at a quantitative level these results are rather pessimistic in the small subset region.

    The article cited by IBK above is the place to look at if quantitative results are desired. Quoting from that study "In other words, for two such scanners, the probability of an unfair test for a 500 sample subset (1/60 of all 30,000 viruses, i.e. 1.7%) is more than 30%. And only with subsets of more than 40% of all viruses (~11,000) the "unfair" tests would occur in < 1% of tests. Moreover, the probability of an unfair test decreases very slowly when the subset size is increased - much more slowly than for scanners of very different quality"

    I do believe that the tests that you and others provide here is a starting point for discussion on program performance. I assume that the tests reflect reality for the testbed used - that all testbed samples are genuine malware. How closely that testbed mimics a field-use situation, however, is always up in the air. Neither of us can put firm numbers around that aspect of any small scale test. In particular, are detection rate reversals possible in using a small test bed? The answer to that question is an emphatic yes. So if a tester lists a number of products in terms of effectiveness, let's say it is determined that A is better than B, which is better than C and so on to yield A>B>C>D>...>G, it is very possible that actual the global order could be something like D>F>B>A>...>C and this is all tied to using a small testbed.

    Blue
     
  8. Firefighter

    Firefighter Registered Member

    Joined:
    Oct 28, 2002
    Posts:
    1,670
    Location:
    Finland
    To BlueZannetti from Firefighter!

    I agree with you about those test samples some sized 500 or so, they are too small, but when you have some 1500 - 2000 RANDOMLY picked nasties, you have got enough data to estimate how good products are against infections. This is so against ALL KIND OF nasties. You have to enlarge the infection subgroups size from my 100 - 800 more close that 1000 - 1500 to say more exactly how good certain av:s are against trojans or riskware or anything like that.

    Best regards,
    Firefighter!
     
Loading...
Thread Status:
Not open for further replies.