TheWindBringeth
March 30th, 2012, 08:45 PM
I think URL checkers (Malicious URL Blocking, Safe Browsing, SmartSceen Filter, etc) have the potential to be a severe threat to individual and corporate privacy. The degree to which a checker presents such a threat would depend on how it operates, what information it sends and to whom, who can read that information, etc. I feel like there is much to consider, and as I re-evaluate my own approaches this year, I think I should spend a fair amount of time on this issue.
I think there is just one approach that doesn't create a privacy issue: blocking things based on "definitions" that are pulled via secure connection and without passing a user/instance unique identifier to the definitions provider. Given the potential for large numbers of threatening URLs, I feel this approach would likely have limitations in terms of coverage or granularity. Which, given multiple overlapping lines of defense and an ability to cope with false positives, might not be a show stopper. So I consider this ideal approach still on the table for me at least.
At the other end of the spectrum I think there are a number of behaviors which when combined would create the most privacy issues in gross terms (I'll consider net terms later). That is what I'm trying to think through now. I've created the preliminary list below. Is there anything I'm missing?
a) Sends full URLs (scheme, hostname, port, path, and query string)
b) Fails to strip username:password@ if present
c) Performs query via non-secured connection with no or weak encryption/authentication
d) Is proxy based and thus has visibility into all URLs regardless of application being used
e) Sends a user/instance unique ID with each query
f) Sends regular cookies with the query
g) Query response includes active content or otherwise presents a dynamic threat
h) Sends referrer or other information about recently visited sites
i) Performs queries on many schemes (not just HTTP, but others such as ftp, mailto, ...)
j) Performs queries on URLS associated with sites visited via HTTPS
k) Performs queries on local filesystem URLs
l) Performs queries on local private network URLs (e.g. no private IP address checking)
m) Queries ahead (checks URLs you haven't actually clicked on)
n) Doesn't utilize its own exclusion list to reduce reported URLs (checks everything)
o) Doesn't utilize caching (queries each and every time or very frequently)
p) No user control over what is checked (can't create exclusions or set it to ask)
q) No logging to make it easy for a user to review what has been sent
I think there is just one approach that doesn't create a privacy issue: blocking things based on "definitions" that are pulled via secure connection and without passing a user/instance unique identifier to the definitions provider. Given the potential for large numbers of threatening URLs, I feel this approach would likely have limitations in terms of coverage or granularity. Which, given multiple overlapping lines of defense and an ability to cope with false positives, might not be a show stopper. So I consider this ideal approach still on the table for me at least.
At the other end of the spectrum I think there are a number of behaviors which when combined would create the most privacy issues in gross terms (I'll consider net terms later). That is what I'm trying to think through now. I've created the preliminary list below. Is there anything I'm missing?
a) Sends full URLs (scheme, hostname, port, path, and query string)
b) Fails to strip username:password@ if present
c) Performs query via non-secured connection with no or weak encryption/authentication
d) Is proxy based and thus has visibility into all URLs regardless of application being used
e) Sends a user/instance unique ID with each query
f) Sends regular cookies with the query
g) Query response includes active content or otherwise presents a dynamic threat
h) Sends referrer or other information about recently visited sites
i) Performs queries on many schemes (not just HTTP, but others such as ftp, mailto, ...)
j) Performs queries on URLS associated with sites visited via HTTPS
k) Performs queries on local filesystem URLs
l) Performs queries on local private network URLs (e.g. no private IP address checking)
m) Queries ahead (checks URLs you haven't actually clicked on)
n) Doesn't utilize its own exclusion list to reduce reported URLs (checks everything)
o) Doesn't utilize caching (queries each and every time or very frequently)
p) No user control over what is checked (can't create exclusions or set it to ask)
q) No logging to make it easy for a user to review what has been sent