Hello people, I have been using ABP and uBO for quite sometime now with many filters from many sources enabled. I would like to know if there is a program/web service that can remove the duplicate entries from the filter subscriptions and compile one ultimate list with all unique entries? For ex: EasyList has around 10,000 entries and Adguard has 5000. However, is there a way to check which entries are existing in both the filter lists? What I would like is use many filter subscriptions with all duplicate entries removed and get one file list which I can then use with ad blocking programs like Adguard, uBO, etc.
uBlock sorts internal, no doubles. watch the list: x out of y, the rest is already present in another list.
If the lists you want to merge use identical, single-line (non-order-dependent), text-based, filter syntax then an ordinary line deduping tool should work. You may have such a feature built into the text editor that you use. I'm certain I've seen public scripts that can download+merge/dedupe as well (can't remember where). On the other hand, and for example, if you attempt to merge a list designed for one blocker with a list designed for another blocker... and those blockers support different filter syntax/interpretation... then you may run into problems. If a blocker supports some form of list specific isolation (such as exception rules that only apply to the list in which they are embedded) then you could also run into problems when merging lists. Also, you will see some lists address example.com blocking via ||example.com^ while other lists prefer to use ||example.com^$third-party.
It is such a common operation that all of the full-featured ones should. If it isn't literally built in then a plugin/extension that does it should be available. Searches: remove duplicate lines <-- Will find various types of tools and approaches remove duplicate lines $nameOfTextEditor <-- Will find discussions of how it is done in a particular one
it makes no sense to work manually on filter lists used in ABP or uB - the next update - and updates are important - all is gone. while merging here some HOSTS list under windows i use textpad to sort out doubles when sorting alphabetically. notepad++ should be able too.
I believe OP is talking about creating a custom (user-maintained) filter list (aggregated from multiple sources). Which shouldn't be wiped out by filtering tool updates and/or updates retrieved for other lists. He would, of course, want to keep his own custom list updated and thus he should be thinking about the maintenance burden. There are some benefits to a manual comparison/merge process. Particularly in the very beginning. For it basically forces you to look at, and develop some understanding of, the source or reference lists. Lists which may block different types of things, have different degrees of coverage and/or freshness, have different tolerances for breaking things (beware exception rules if you favor protection), etc. You'd still use tools to perform set operations (show the entries that appear in both list A and list B, show the entries that appear in List B but not in list A, merge [selected] entries from list B into list A, etc) but you'd use them in a manual fashion. Although far more time consuming than doing a full/blind automated merge, it can help you decide which lists you do/don't want to use as sources, which lists you want to adjust before merging, etc. OP might want to look over what's available at https://github.com/ryanbr/fanboy-adblock
@TheWindBringeth ofc i understand this method i wont recommend it, and in special not for uB. its much more easier to use the pre-defined list and then to set exceptions. further more i recommend to use the advance mode ("[x] experienced user" to block/unblock/noop sites in uB. noop'ing means "allow domain except not allowed elements". i have blocked domains but those are only few (a hand fill). less time for maintenance means more free time for something else i have 83 whitelisted domains from ~78.000 domains in uB (~58.000 in windows HOSTS file). it makes really much more work to have an own list current instead a short whitelist. thats why my windows hosts file is not updated regularly.
@Nzyme : As already mentioned uBlock Origin automatically removes duplicate filters, so there's no need to do this manually. (I don't know if AdBlock Plus does that, too). This is also true for the hosts files used in uB0. However, you can add, e.g., Steven Black's hosts file which consolidates and deduplicates various hosts files available in uB0. The only advantage is to reduce the download size of the filterlist updates (by deactivating the respective other hosts files in uB0) - nothing else.
Dunno whether it is still available but Wladimir Palant (adblockplus site, i guess) had an interactive webpage where you could paste, click GO, and it would remove redundant filter lines. Yay, just checked and it's still available (called "find useless filters"): https://adblockplus.org/en/tools note: calcs are performed client-side, so if you paste in 50K lines, BE PATIENT. Might run several minutes, don't presume the browser has fatally "locked up".
Yes, and you can also do it on a Linux system (I don't know if there's something similar for Windows) using sort with the unique flag. It's very fast. I had done that in the past in order to consolidate various hosts files. Excerpt: Code: # Do some work on the file: # 1. Remove MS-DOS carriage returns # 2. Delete all lines that don't begin with 127.0.0.1 or 0.0.0.0 # 3. Delete any lines containing the word localhost because we'll obtain that from the original hosts file # 4. Replace 127.0.0.1 with 0.0.0.0 because then we don't have to wait for the resolver to fail # 5. Scrunch extraneous spaces separating address from name into a single tab # 6. Delete any comments on lines # 7. Clean up leftover trailing blanks # 8. Finally, delete all lines that don't begin with 0.0.0.0 to make sure that all remnants are removed # Pass all this through sort with the unique flag to remove duplicates and save the result echo "Parsing, cleaning, de-duplicating, sorting..." sed -e 's/\r//' \ -e '/^127.0.0.1\|0.0.0.0/!d' \ -e '/localhost/d' \ -e 's/127.0.0.1\|0.0.0.0//' \ -e 's/#.*$//' \ -e 's/^[ \t]*//;s/[ \t]*$//' \ -e 's/^/address=\//' \ -e 's/$/\/0.0.0.0/' \ -e '/\/\//d' \ < $temphosts1 | sort -u > $temphosts2 But again - why should one do this in the first place? The ABP-compatible filterlists are not static but updated very often, and their deduplication is done by uB0. The deduplication of hosts files (only in order to reduce download size as their deduplication is also done by uB0) can be done by using Steven Black's hosts file - see my post above. So what the OP wants to achieve is as useful as a hole in the head.
Hey, nice sed -fu there, summerheat ! Like a hole in the head? I don't care to (factcheck and) argue the point, but I worry that redundancies, at scale, would delay the startup of each session. When (and how often) does filterlist update occur? Does the extension "discard after reading" each subscribed list, maintaining only a single resultant merged copy? I think not -- else we would not be able to selectively toggle on/off discrete rulesets. liveboot scenario brings further consideration: After subscribing to multiple lists, and noticing that the ABP extension saves multiple (original+3? by default) backup copies, and noticing that ABP writes to disk each time I visit the ruleset edit GUI (regardless whether any edits have been performed, dammit)... cumulatively that incurs a significant overhead in a liveboot linux session (and bloats the persistence savefile).
Nope. Based on the selection of filters, it removes duplicates and creates a compiled filter-lists. which will be used to load at next start-up. It does not slow down at all! Well, at least when compared to other blockers like ABP. More details about it - can be found here, which is best explained by @gorhill
I also had a version for dnsmasq. For a while I used some really huge hosts files that produced more than 1 million (consolidated!) entries in dnsmasq.conf. But you can imagine that I got way too many false positives ... See harsha_mic's answer and additionally this site.
Image below shows load speed of ABP (top), uBO without selfie (center), uBO with a selfie (bottom). I adjusted the time scale (x axis) to be the same (1500 ms) so the difference can be appreciated at a glance (full size): ABP: EasyList + EasyPrivacy (no acceptable ads). uBO: EasyList + EasyPrivacy + Peter Lowe's + Malware filters + uBlock filters. Duplicate filters are discarded, which are not numerous with these default filter list (meaning the discarding marginally helped load time here.) So if using uBO, there are more important things to worry about than load time because of duplicates.
adguard already has two of those filters and you can check a bunch more. but if you check too many you will get a warning you have too many checked and could slow down your browsing experience https://easylist.to/
Is there any information where selfies are stored? I use Sandboxie which deletes browser session data when browser is closed. It would be wise to retain selfies so they are not recreated each time I open browser.