Apache as a proxy: passing content to scripts

Discussion in 'other software & services' started by Gullible Jones, Jan 16, 2014.

Thread Status:
Not open for further replies.
  1. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    I'm trying to implement web content filtering with Apache as a proxy, and would like to pass HTML to a filtering script. Is this possible? If so, can anyone point me to some documentation on it?

    Edit: turns out Google actually mentions this in a Q&A PDF. Unfortunately something keeps mangling the link, but I'll post the basics here for futurity.

    A (local) Apache proxy configuration looks something like this (note that you have to load mod_ext_filter, mod_deflate, mod_proxy, and mod_proxy_http):

    [Update: corrected use of filters]

    Code:
    Listen 127.0.0.1:8080
    
    <VirtualHost 127.0.0.1:8080>
      ProxyRequests On
      ProxyVia On
      <Proxy *>
        Order deny,allow
        Deny from all
        Allow from 127.0.0.1
      </Proxy>
    
    # Filters go here
    ExtFilterDefine filter_blah mode=output intype=text/html \
    cmd="/usr/local/bin/filter_blah.pl"
    
    SetOutputFilter INFLATE;filter_blah
    
    </VirtualHost>
    Edit: too bad regexes for HTML are a pain...
     
    Last edited: Jan 17, 2014
  2. TheWindBringeth

    TheWindBringeth Registered Member

    Joined:
    Feb 29, 2012
    Posts:
    2,088
    Since I never tried to do this myself I ran a few searches after I saw your question. FWIW, one article I found was the following about using ModSecurity in a forward proxy arrangement to check for malware links in a returned html document. Although it doesn't match your "pass HTML to a filtering script" I'll post the link:

    http://blog.spiderlabs.com/2011/04/modsecurity-advanced-topic-of-the-week-malware-link-removal.html

    Because it is interesting and hints that ModSecurity might have some interesting features. I need to look at it more :)
     
  3. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    Thanks, that's rather interesting...

    I was hoping to implement a Privoxy type of thing using Apache (because reinventing the wheel is fun). Unfortunately HTML is really brutal to process; arbitrarily deep nesting tends to make regexes vomit.

    The wise thing to do would be to use someone else's library... But that wouldn't teach me much, so I'll probably be reading up on parsers. Assuming I have the time (which I may not).
     
  4. JeffreyCole

    JeffreyCole Developer

    Joined:
    Dec 29, 2012
    Posts:
    433
  5. biased

    biased Registered Member

    Joined:
    Jul 22, 2013
    Posts:
    34
    Why would not the already developed things work? Squid is a nice good examples. And many others. And dansgaurd.

    Filtering can be less or as much and deep as you want. Even ran squid on windows 7 haha. But nice on unix box, or i like pfsense too.

    is it for learning and challenge to go hand made proxy router?

    it is enough for mine self that learning the squid things is more than good. so much to know. so much.

    i get you and that ha! time is when you do harder things, just becuase :thumb: and learn !
     
  6. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    @biased: Squid doesn't have an official Windows build any more. I was thinking less gateway and more local web filter (like Proxomitron). Though it would probably be better to have such things running on a dedicated gateway.

    @JeffreyCole: thank you very much, and LOL! I'm not versed in the theory of this stuff, so I didn't know that regexes can't handle recursively structured data formats.

    Edit: anyway 90% of what I want can be gotten by blanket disabling Javascript on unencrypted sites. I mean, when was the last time you really absolutely needed JS on an HTTP site? But doing this from the browser is easy, and therefore no fun. :)

    Edit 2: okay, if I want this to actually be good for anything, I am going to have to use a library. Cursory Googling says there's no way I can write a decent HTML parser with my current scripting know-how.
     
    Last edited: Jan 17, 2014
  7. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    I cobbled up a simplistic filter this morning based on Python's HTMLParser:

    Code:
                                
    #!/usr/bin/env python                                                           
    
    import HTMLParser
    import sys
    
    class MyFilter(HTMLParser.HTMLParser):
        def __init__(self):
            HTMLParser.HTMLParser.__init__(self)
            self.in_killzone = 0
    
        def handle_starttag(self, tag, attrs):
            if tag == "script" or tag == "iframe":
                self.in_killzone = 1
            else:
                print self.get_starttag_text()
    
        def handle_endtag(self, tag):
            if tag == "script" or tag == "iframe":
                self.in_killzone = 0
            else:
                print "</"+tag+">"
    
        def handle_data(self, data):
            if self.in_killzone > 0:
                pass
            else:
                print data
    
    p = MyFilter()
    p.feed(sys.stdin.read())
    p.close()
    As you can see this just strips out Javascript and iframes on unencrypted pages.

    Unfortunately it doesn't work on some sites! MSNBC for instance seems to have a way of thwarting it, despite being unencrypted - the browser still gets passed stuff with script tags in it, even though feeding that HTML directly to the filter script results in the tags being removed. I suspect that scripts are being passed to the browser by other means, and then generating the final HTML once already past the proxy.

    I'll see about creating a whitelist based sanitizer next I guess...

    Edit: hmm...

    http://stackoverflow.com/questions/699468/python-html-sanitizer-scrubber-filter/812785#812785

    This could become rather involved. I haven't got much else to do today, so I'll get cracking on it.

    Edit 2: oh wow that's some bad Python I wrote there. Wow.

    Edit 3: wait wait wait. This works fine if I don't recompress stuff. But then compressed pages get messed up. Anyone know how to correct the headers for formerly compressed pages?

    Edit 4: Okay I was doing it completely wrong. I will update the OP.
     
    Last edited: Jan 17, 2014
  8. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    Okay, here is a better script:

    Code:
    #!/usr/bin/env python
    
    from HTMLParser import HTMLParser
    import sys
    
    VALID_TAGS = ['html',
                  'strong',
                  'em',
                  'p',
                  'ul',
                  'li',
                  'br',
                  'table',
                  'div',
                  'dt',
                  'dd',
                  'dl',
                  'a',
                  'style',
                  'tr',
                  'th',
                  'form',
                  'link',
                  'body',
                  'input',
                  'td',
                  'tbody',
                  'option',
                  'img',
                  'field',
                  'meta',
                  'head',
                  'fieldset',
                  'legend',
                  'thead',
                  'tbody',
                  'pre',
                  'textarea',
                  'span',
                  'h1',
                  'h2',
                  'h3',
                  'h4',
                  'font',
                  'nav',
                  'section',
                  'article',
                  'footer',
                  ]
    
    class Whitelist(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.killzone = 0
    
        def handle_starttag(self, tag, attrs):
            try:
                del attrs.onload
            except AttributeError:
                pass
            if tag not in VALID_TAGS:
                self.killzone = self.killzone + 1
            else:
                print self.get_starttag_text()
    
        def handle_endtag(self, tag):
            if tag not in VALID_TAGS:
                self.killzone = self.killzone - 1
            else:
                print 
    /
    +tag+
    
    
    
        def handle_data(self, data):
            if self.killzone 
     0:
                pass
            else:
                print data
    
    p = Whitelist()
    p.feed(sys.stdin.read())
    p.close()
    Currently posting through it, it works quite well on most sites. Though there is a problem in that the onLoad attribute for some tags can load scripts; I have to figure out a way of ditching that without preventing the page from rendering...

    Edit: note that the line for printing the ending tag doesn't show properly above, due to the symbols in it. I'll leave it though, it's fairly obvious what symbols have to be used.

    [Update: added exception handling so that removal of onload attributes works.]
     
    Last edited: Jan 17, 2014
  9. TheWindBringeth

    TheWindBringeth Registered Member

    Joined:
    Feb 29, 2012
    Posts:
    2,088
    https://en.wikipedia.org/wiki/DOM_events

    Have you considered using Content Security Policy directives, like gorhill?
     
  10. Gullible Jones

    Gullible Jones Registered Member

    Joined:
    May 16, 2013
    Posts:
    1,461
    The problem was exception handling, i.e. my mistake. I will update the script again. :)
     
Loading...
Thread Status:
Not open for further replies.