Redundant Filters for Block Ads

Discussion in 'other software & services' started by Nzyme, Jan 3, 2017.

  1. Nzyme

    Nzyme Registered Member

    Joined:
    Mar 8, 2014
    Posts:
    4
    Hello people, I have been using ABP and uBO for quite sometime now with many filters from many sources enabled. I would like to know if there is a program/web service that can remove the duplicate entries from the filter subscriptions and compile one ultimate list with all unique entries?

    For ex: EasyList has around 10,000 entries and Adguard has 5000. However, is there a way to check which entries are existing in both the filter lists?

    What I would like is use many filter subscriptions with all duplicate entries removed and get one file list which I can then use with ad blocking programs like Adguard, uBO, etc.
     
  2. Brummelchen

    Brummelchen Registered Member

    Joined:
    Jan 3, 2009
    Posts:
    5,931
    uBlock sorts internal, no doubles. watch the list: x out of y, the rest is already present in another list.
     
  3. TheWindBringeth

    TheWindBringeth Registered Member

    Joined:
    Feb 29, 2012
    Posts:
    2,171
    If the lists you want to merge use identical, single-line (non-order-dependent), text-based, filter syntax then an ordinary line deduping tool should work. You may have such a feature built into the text editor that you use. I'm certain I've seen public scripts that can download+merge/dedupe as well (can't remember where).

    On the other hand, and for example, if you attempt to merge a list designed for one blocker with a list designed for another blocker... and those blockers support different filter syntax/interpretation... then you may run into problems. If a blocker supports some form of list specific isolation (such as exception rules that only apply to the list in which they are embedded) then you could also run into problems when merging lists. Also, you will see some lists address example.com blocking via ||example.com^ while other lists prefer to use ||example.com^$third-party.
     
  4. trott3r

    trott3r Registered Member

    Joined:
    Jan 21, 2010
    Posts:
    1,283
    Location:
    UK
    Care to mention a text editor that does this in windows?
     
  5. TheWindBringeth

    TheWindBringeth Registered Member

    Joined:
    Feb 29, 2012
    Posts:
    2,171
    It is such a common operation that all of the full-featured ones should. If it isn't literally built in then a plugin/extension that does it should be available. Searches:

    remove duplicate lines <-- Will find various types of tools and approaches
    remove duplicate lines $nameOfTextEditor <-- Will find discussions of how it is done in a particular one
     
    Last edited: Jan 3, 2017
  6. Brummelchen

    Brummelchen Registered Member

    Joined:
    Jan 3, 2009
    Posts:
    5,931
    it makes no sense to work manually on filter lists used in ABP or uB - the next update - and updates are important - all is gone.

    while merging here some HOSTS list under windows i use textpad to sort out doubles when sorting alphabetically. notepad++ should be able too.
     
  7. TheWindBringeth

    TheWindBringeth Registered Member

    Joined:
    Feb 29, 2012
    Posts:
    2,171
    I believe OP is talking about creating a custom (user-maintained) filter list (aggregated from multiple sources). Which shouldn't be wiped out by filtering tool updates and/or updates retrieved for other lists. He would, of course, want to keep his own custom list updated and thus he should be thinking about the maintenance burden.

    There are some benefits to a manual comparison/merge process. Particularly in the very beginning. For it basically forces you to look at, and develop some understanding of, the source or reference lists. Lists which may block different types of things, have different degrees of coverage and/or freshness, have different tolerances for breaking things (beware exception rules if you favor protection), etc. You'd still use tools to perform set operations (show the entries that appear in both list A and list B, show the entries that appear in List B but not in list A, merge [selected] entries from list B into list A, etc) but you'd use them in a manual fashion.

    Although far more time consuming than doing a full/blind automated merge, it can help you decide which lists you do/don't want to use as sources, which lists you want to adjust before merging, etc.

    OP might want to look over what's available at https://github.com/ryanbr/fanboy-adblock
     
  8. Nzyme

    Nzyme Registered Member

    Joined:
    Mar 8, 2014
    Posts:
    4
    Thank you guys. I think all your responses makes sense to me.
     
  9. Brummelchen

    Brummelchen Registered Member

    Joined:
    Jan 3, 2009
    Posts:
    5,931
    @TheWindBringeth
    ofc i understand this method i wont recommend it, and in special not for uB. its much more easier to use the pre-defined list and then to set exceptions. further more i recommend to use the advance mode ("[x] experienced user" to block/unblock/noop sites in uB.

    noop'ing means "allow domain except not allowed elements". i have blocked domains but those are only few (a hand fill). less time for maintenance means more free time for something else ;)

    i have 83 whitelisted domains from ~78.000 domains in uB (~58.000 in windows HOSTS file). it makes really much more work to have an own list current instead a short whitelist. thats why my windows hosts file is not updated regularly.
     
  10. summerheat

    summerheat Registered Member

    Joined:
    May 16, 2015
    Posts:
    2,199
    @Nzyme : As already mentioned uBlock Origin automatically removes duplicate filters, so there's no need to do this manually. (I don't know if AdBlock Plus does that, too). This is also true for the hosts files used in uB0. However, you can add, e.g., Steven Black's hosts file which consolidates and deduplicates various hosts files available in uB0. The only advantage is to reduce the download size of the filterlist updates (by deactivating the respective other hosts files in uB0) - nothing else.
     
  11. inka

    inka Registered Member

    Joined:
    Oct 21, 2009
    Posts:
    426
    TextPad32 does it for me. Surely many other editors provide sort|removeDuplicate functionality.
     
  12. inka

    inka Registered Member

    Joined:
    Oct 21, 2009
    Posts:
    426
    Dunno whether it is still available but Wladimir Palant (adblockplus site, i guess) had an interactive webpage where you could paste, click GO, and it would remove redundant filter lines.

    Yay, just checked and it's still available (called "find useless filters"):
    https://adblockplus.org/en/tools
    note: calcs are performed client-side, so if you paste in 50K lines, BE PATIENT.
    Might run several minutes, don't presume the browser has fatally "locked up".
     
  13. summerheat

    summerheat Registered Member

    Joined:
    May 16, 2015
    Posts:
    2,199
    Yes, and you can also do it on a Linux system (I don't know if there's something similar for Windows) using sort with the unique flag. It's very fast. I had done that in the past in order to consolidate various hosts files. Excerpt:
    Code:
    # Do some work on the file:
    # 1. Remove MS-DOS carriage returns
    # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0
    # 3. Delete any lines containing the word localhost because we'll obtain that from the original hosts file
    # 4. Replace 127.0.0.1 with 0.0.0.0 because then we don't have to wait for the resolver to fail
    # 5. Scrunch extraneous spaces separating address from name into a single tab
    # 6. Delete any comments on lines
    # 7. Clean up leftover trailing blanks
    # 8. Finally, delete all lines that don't begin with 0.0.0.0 to make sure that all remnants are removed
    # Pass all this through sort with the unique flag to remove duplicates and save the result
    echo "Parsing, cleaning, de-duplicating, sorting..."
    
    sed -e 's/\r//'                        \
        -e '/^127.0.0.1\|0.0.0.0/!d'       \
        -e '/localhost/d'                  \
        -e 's/127.0.0.1\|0.0.0.0//'       \
        -e 's/#.*$//'                      \
        -e 's/^[ \t]*//;s/[ \t]*$//'       \
        -e 's/^/address=\//'                \
        -e 's/$/\/0.0.0.0/'                 \
        -e '/\/\//d'                  \
        < $temphosts1 |
         sort -u > $temphosts2
    But again - why should one do this in the first place? The ABP-compatible filterlists are not static but updated very often, and their deduplication is done by uB0. The deduplication of hosts files (only in order to reduce download size as their deduplication is also done by uB0) can be done by using Steven Black's hosts file - see my post above. So what the OP wants to achieve is as useful as a hole in the head. ;)
     
  14. inka

    inka Registered Member

    Joined:
    Oct 21, 2009
    Posts:
    426
    Hey, nice sed -fu there, summerheat !
    Like a hole in the head? I don't care to (factcheck and) argue the point, but I worry that redundancies, at scale, would delay the startup of each session. When (and how often) does filterlist update occur? Does the extension "discard after reading" each subscribed list, maintaining only a single resultant merged copy? I think not -- else we would not be able to selectively toggle on/off discrete rulesets.

    liveboot scenario brings further consideration:
    After subscribing to multiple lists, and noticing that the ABP extension saves multiple (original+3? by default) backup copies, and noticing that ABP writes to disk each time I visit the ruleset edit GUI (regardless whether any edits have been performed, dammit)... cumulatively that incurs a significant overhead in a liveboot linux session (and bloats the persistence savefile).
     
  15. harsha_mic

    harsha_mic Registered Member

    Joined:
    Mar 11, 2009
    Posts:
    815
    Location:
    India
    Nope. Based on the selection of filters, it removes duplicates and creates a compiled filter-lists. which will be used to load at next start-up. It does not slow down at all! Well, at least when compared to other blockers like ABP.
    More details about it - can be found here, which is best explained by @gorhill
     
  16. summerheat

    summerheat Registered Member

    Joined:
    May 16, 2015
    Posts:
    2,199
    I also had a version for dnsmasq. For a while I used some really huge hosts files that produced more than 1 million (consolidated!) entries in dnsmasq.conf. But you can imagine that I got way too many false positives ... :D:D:D

    See harsha_mic's answer and additionally this site.
     
  17. gorhill

    gorhill Guest

    Image below shows load speed of ABP (top), uBO without selfie (center), uBO with a selfie (bottom). I adjusted the time scale (x axis) to be the same (1500 ms) so the difference can be appreciated at a glance (full size):

    a.png

    ABP: EasyList + EasyPrivacy (no acceptable ads).
    uBO: EasyList + EasyPrivacy + Peter Lowe's + Malware filters + uBlock filters. Duplicate filters are discarded, which are not numerous with these default filter list (meaning the discarding marginally helped load time here.)

    So if using uBO, there are more important things to worry about than load time because of duplicates.
     
  18. boredog

    boredog Registered Member

    Joined:
    Feb 1, 2015
    Posts:
    2,499
    adguard already has two of those filters and you can check a bunch more. but if you check too many you will get a warning you have too many checked and could slow down your browsing experience

    https://easylist.to/
     

    Attached Files:

  19. Brummelchen

    Brummelchen Registered Member

    Joined:
    Jan 3, 2009
    Posts:
    5,931
    what means "selfie" here? thx
     
  20. guest

    guest Guest

    I have found something about 'selfies':
     
  21. Minimalist

    Minimalist Registered Member

    Joined:
    Jan 6, 2014
    Posts:
    14,885
    Location:
    Slovenia, EU
    Is there any information where selfies are stored?
    I use Sandboxie which deletes browser session data when browser is closed. It would be wise to retain selfies so they are not recreated each time I open browser.
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.