Apache as a proxy: passing content to scripts

Gullible Jones · Jan 17, 2014

I'm trying to implement web content filtering with Apache as a proxy, and would like to pass HTML to a filtering script. Is this possible? If so, can anyone point me to some documentation on it?

Edit: turns out Google actually mentions this in a Q&A PDF. Unfortunately something keeps mangling the link, but I'll post the basics here for futurity.

A (local) Apache proxy configuration looks something like this (note that you have to load mod_ext_filter, mod_deflate, mod_proxy, and mod_proxy_http):

[Update: corrected use of filters]
Code:
Listen 127.0.0.1:8080

<VirtualHost 127.0.0.1:8080>
  ProxyRequests On
  ProxyVia On
  <Proxy *>
    Order deny,allow
    Deny from all
    Allow from 127.0.0.1
  </Proxy>

# Filters go here
ExtFilterDefine filter_blah mode=output intype=text/html \
cmd="/usr/local/bin/filter_blah.pl"

SetOutputFilter INFLATE;filter_blah

</VirtualHost>
Edit: too bad regexes for HTML are a pain...

TheWindBringeth · Jan 16, 2014

Since I never tried to do this myself I ran a few searches after I saw your question. FWIW, one article I found was the following about using ModSecurity in a forward proxy arrangement to check for malware links in a returned html document. Although it doesn't match your "pass HTML to a filtering script" I'll post the link:

http://blog.spiderlabs.com/2011/04/modsecurity-advanced-topic-of-the-week-malware-link-removal.html

Because it is interesting and hints that ModSecurity might have some interesting features. I need to look at it more

Gullible Jones · Jan 17, 2014

Thanks, that's rather interesting...

I was hoping to implement a Privoxy type of thing using Apache (because reinventing the wheel is fun). Unfortunately HTML is really brutal to process; arbitrarily deep nesting tends to make regexes vomit.

The wise thing to do would be to use someone else's library... But that wouldn't teach me much, so I'll probably be reading up on parsers. Assuming I have the time (which I may not).

JeffreyCole · Jan 17, 2014

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

biased · Jan 17, 2014

Why would not the already developed things work? Squid is a nice good examples. And many others. And dansgaurd.

Filtering can be less or as much and deep as you want. Even ran squid on windows 7 haha. But nice on unix box, or i like pfsense too.

is it for learning and challenge to go hand made proxy router?

it is enough for mine self that learning the squid things is more than good. so much to know. so much.

i get you and that ha! time is when you do harder things, just becuase and learn !

Gullible Jones · Jan 17, 2014

@biased: Squid doesn't have an official Windows build any more. I was thinking less gateway and more local web filter (like Proxomitron). Though it would probably be better to have such things running on a dedicated gateway.

@JeffreyCole: thank you very much, and LOL! I'm not versed in the theory of this stuff, so I didn't know that regexes can't handle recursively structured data formats.

Edit: anyway 90% of what I want can be gotten by blanket disabling Javascript on unencrypted sites. I mean, when was the last time you really absolutely needed JS on an HTTP site? But doing this from the browser is easy, and therefore no fun.

Edit 2: okay, if I want this to actually be good for anything, I am going to have to use a library. Cursory Googling says there's no way I can write a decent HTML parser with my current scripting know-how.

Gullible Jones · Jan 17, 2014

I cobbled up a simplistic filter this morning based on Python's HTMLParser:
Code:
                            
#!/usr/bin/env python                                                           

import HTMLParser
import sys

class MyFilter(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.in_killzone = 0

    def handle_starttag(self, tag, attrs):
        if tag == "script" or tag == "iframe":
            self.in_killzone = 1
        else:
            print self.get_starttag_text()

    def handle_endtag(self, tag):
        if tag == "script" or tag == "iframe":
            self.in_killzone = 0
        else:
            print "</"+tag+">"

    def handle_data(self, data):
        if self.in_killzone > 0:
            pass
        else:
            print data

p = MyFilter()
p.feed(sys.stdin.read())
p.close()
As you can see this just strips out Javascript and iframes on unencrypted pages.

Unfortunately it doesn't work on some sites! MSNBC for instance seems to have a way of thwarting it, despite being unencrypted - the browser still gets passed stuff with script tags in it, even though feeding that HTML directly to the filter script results in the tags being removed. I suspect that scripts are being passed to the browser by other means, and then generating the final HTML once already past the proxy.

I'll see about creating a whitelist based sanitizer next I guess...

Edit: hmm...

http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785

This could become rather involved. I haven't got much else to do today, so I'll get cracking on it.

Edit 2: oh wow that's some bad Python I wrote there. Wow.

Edit 3: wait wait wait. This works fine if I don't recompress stuff. But then compressed pages get messed up. Anyone know how to correct the headers for formerly compressed pages?

Edit 4: Okay I was doing it completely wrong. I will update the OP.

Gullible Jones · Jan 17, 2014

Okay, here is a better script:

Code:

#!/usr/bin/env python

from HTMLParser import HTMLParser
import sys

VALID_TAGS = ['html',
              'strong',
              'em',
              'p',
              'ul',
              'li',
              'br',
              'table',
              'div',
              'dt',
              'dd',
              'dl',
              'a',
              'style',
              'tr',
              'th',
              'form',
              'link',
              'body',
              'input',
              'td',
              'tbody',
              'option',
              'img',
              'field',
              'meta',
              'head',
              'fieldset',
              'legend',
              'thead',
              'tbody',
              'pre',
              'textarea',
              'span',
              'h1',
              'h2',
              'h3',
              'h4',
              'font',
              'nav',
              'section',
              'article',
              'footer',
              ]

class Whitelist(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.killzone = 0

    def handle_starttag(self, tag, attrs):
        try:
            del attrs.onload
        except AttributeError:
            pass
        if tag not in VALID_TAGS:
            self.killzone = self.killzone + 1
        else:
            print self.get_starttag_text()

    def handle_endtag(self, tag):
        if tag not in VALID_TAGS:
            self.killzone = self.killzone - 1
        else:
            print 
/
+tag+



    def handle_data(self, data):
        if self.killzone 
 0:
            pass
        else:
            print data

p = Whitelist()
p.feed(sys.stdin.read())
p.close()

Currently posting through it, it works quite well on most sites. Though there is a problem in that the onLoad attribute for some tags can load scripts; I have to figure out a way of ditching that without preventing the page from rendering...

Edit: note that the line for printing the ending tag doesn't show properly above, due to the symbols in it. I'll leave it though, it's fairly obvious what symbols have to be used.

[Update: added exception handling so that removal of onload attributes works.]

TheWindBringeth · Jan 17, 2014

Gullible Jones said:

Though there is a problem in that the onLoad attribute for some tags can load scripts;
Click to expand...

https://en.wikipedia.org/wiki/DOM_events

Gullible Jones said:

I have to figure out a way of ditching that without preventing the page from rendering...
Click to expand...

Have you considered using Content Security Policy directives, like gorhill?

Gullible Jones · Jan 17, 2014

The problem was exception handling, i.e. my mistake. I will update the script again.

Log in or Sign up

Apache as a proxy: passing content to scripts

Gullible Jones Registered Member

TheWindBringeth Registered Member

Gullible Jones Registered Member

JeffreyCole Developer

biased Registered Member

Gullible Jones Registered Member

Gullible Jones Registered Member

Gullible Jones Registered Member

TheWindBringeth Registered Member

Gullible Jones Registered Member

Log in or Sign up

Apache as a proxy: passing content to scripts

Gullible Jones Registered Member

TheWindBringeth Registered Member

Gullible Jones Registered Member

JeffreyCole Developer

biased Registered Member

Gullible Jones Registered Member

Gullible Jones Registered Member

Gullible Jones Registered Member

TheWindBringeth Registered Member

Gullible Jones Registered Member

Useful Searches