ERE Regular Expressions - need a tutor

Discussion in 'all things UNIX' started by Sully, Feb 11, 2013.

Thread Status:
Not open for further replies.
  1. Sully

    Sully Registered Member

    Joined:
    Dec 23, 2005
    Posts:
    3,719
    I don't use regular expressions much at all. And when I do, many times they are a bit proprietary to the language. Especially in windows arena.

    Recently I have been delving deeper into squid and squidguard, attempting to fine tune my blocking/filtering. However, I haven't been having much luck. Since I don't use regex daily, or for that matter much of any Unix based programming/computing, it is an exercise in frustration.

    Is there anyone here who has a good handle on regex that might learn me a thing or two?

    Sul.
     
  2. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747
  3. Sully

    Sully Registered Member

    Joined:
    Dec 23, 2005
    Posts:
    3,719
    Didn't know about that testing tool Thanks.

    Actually found a great ebook on the topic. I was hoping there could be someone who could explain and show based on what I am trying to do (url stuff). The books/tutorials take a more generic approach, much of it based on searching through files or gobs of text. URL seems to require a bit more finesse from what I have seen, I think because a lot of metacharacters are used in the URL itself.

    But, thank you for the links. Your time is appreciated :)

    Sul.
     
  4. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747
    Any specific example? I might be able to help.

    btw, that PowerGREP help file is very concise, definitely read it. ;)
     
  5. Sully

    Sully Registered Member

    Joined:
    Dec 23, 2005
    Posts:
    3,719
    Well, if I have a link like this
    Code:
    http://www.google.com/imgres?um=1&hl=en&safe=off&sa=N&tbo=d&biw=1115&bih=758&tbm=isch&tbnid=E9cdnv058XgDCM:&imgrefurl=http://www.thejunglestore.com/dog&docid=-RZ8g1ltB2fanM&imgurl=http://www.thejunglestore.com/core/media/media.nl%253Fid%253D37516%2526c%253D432681%2526h%253D3c579cf84403f4536d5b&w=1024&h=768&ei=lzMbUYPQBeXAiwKPnoGAAw&zoom=1&ved=1t:3588,r:0,s:0,i:151&iact=rc&dur=732&sig=103448870397244548943&page=1&tbnh=181&tbnw=224&start=0&ndsp=15&tx=136&ty=68
    I have found the methods to use squid to block a domain or a path. However, this one needs a regex it seems. It should be simple enough, but I have seen literally a hundred examples of using regex on this, and also
    Code:
    www.google.com/images?q=
    www.google.com/imghp?q=
    tb0.gstatic.com/images?q=
    I have tried a lot of combinations, but nothing works thus far. I am slowly beginning to understand some of the principles of regex, and have tried permutations such as the following, which should work on the whole or part of the URL.
    Code:
    .*/imgres\?q=.*
    /(www.){0,1}(google\.).*\/(imgres)|(images)\?{0,1}/
    (?:imgres|images).
    /(www\.){0,1}(google\.).*\/(imgres|images)\?{0,1}/
    /(www\.){0,1}(google\.).*\/(imgres)|(images)\?{0,1}/
    these are url_regex 
    I tried also these, which should work on path only
    Code:
    google.com/imghp
    google.com/images
    google\.com/imgres*
    And I got this one to work, which works on a destintion domain. This effectively blocks only google.com but allows mail.google.com
    Code:
    ^(www\.)?(google\.com?(\...)?)
    I know that if I just block the dstdomain like this, it will block all the images.
    Code:
    tb0.gstatic.com
    tb1.gstatic.com
    However, I quite often find myself digging deeper than I need to just to learn. And that is the case here, I am wondering how to structure a regex that will match on
    Code:
    google.com/imgres?
    only. There must be some small thing I am missing here.

    Sul.
     
  6. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747

    Using url_regex:

    Code:
    ^https?://(www\.)?google\.com/imgres.*
    
     
  7. NGRhodes

    NGRhodes Registered Member

    Joined:
    Jun 23, 2003
    Posts:
    2,331
    Location:
    West Yorkshire, UK
  8. m00nbl00d

    m00nbl00d Registered Member

    Joined:
    Jan 4, 2009
    Posts:
    6,623
    I've only been learning JavaScript Regex, but don't you need to escape //? So, it would become:

    Code:
    ^https?:\/\/(www\.)?google\.com/imgres.*
    
     
  9. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747
    Thanks for the link.

    In JavaScript, forward slash is the pattern delimiter, such as this:

    Code:
    string = string.replace([B]/[/B]regex[B]/[/B]g, '');
    therefore all forward slashes in the pattern must be escaped. Here, there is no need to escape slashes.
     
  10. m00nbl00d

    m00nbl00d Registered Member

    Joined:
    Jan 4, 2009
    Posts:
    6,623
    Yes, I know they are the pattern delimiters in JavaScript, which is why they must be escaped. I just wasn't sure about this regex version. :)

    Thanks for the clarification. :thumb:
     
  11. Sully

    Sully Registered Member

    Joined:
    Dec 23, 2005
    Posts:
    3,719
    I haven't had time yet to put this on squid and see what happens, but let me see if I get this right, as to what I understand now.
    Code:
    ^https?://(www\.)?google\.com/imgres.*
    • ^ - means this is the start of the line, so the literal http should be at the beginning of the line
    • https is a literal search string
    • ? after https would mean a choise - either http OR https? (really just a choise of the s or not)
    • :// are more literal stings, part of the URL
    • () parentheses are wrapped around www. to constrain that string, while the ? following it indicates that the preceding is optional. You enclose the www in parentheses because ? would be one character, but when www is enclosed, the ? matches or does not match anything within the parentheses?
    • www\. - the \ escapes the . which is a wildcard. Since we want to match the literal . we must escape it with \
    • google\.com - another literal string where we again escape the . to use it literally.
    • /imgres.* - the /imgres is literal. The . means any character and the * means unlimited, so effectively any number of characters, or the rest of the URL.

    Does that describe it correctly?

    Thanks for the time you took to help.

    Sul.
     
  12. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747
    Yes. All that, is correct. :)
     
  13. Sully

    Sully Registered Member

    Joined:
    Dec 23, 2005
    Posts:
    3,719
    Thanks for the help.

    Its funny how I already had the knowledge to "sort of" do that, and only needed to see how my specific example would be structured to actually understand it.

    One final question then. I see many examples, even supposedly related to squid, that use as m00nbl00d stated, the \ /. Like this
    Code:
    /(www\.){0,1}(google\.).*\/(imgres)|(images)\?{0,1}/
    Why would they structure it with .*\/ then? I got that from the internet, and it was supposed to be for squid syntax, but if I understand you it is for some other regex, like java or perl, but not ere?

    Sul.
     
  14. Sadeghi85

    Sadeghi85 Registered Member

    Joined:
    Dec 20, 2009
    Posts:
    747
    It's not about regex flavors, forward slash by itself isn't a special character in any regex flavor that I know.

    Some languages use slash as pattern delimiter, so it must be escaped as to not confuse the language' interpreter, the regex engine receives unescaped slash in that case. In other words, you're escaping it for the language' interpreter, not for the regex engine.

    It doesn't hurt to always escape slashes though, as the regex engine treats an escaped slash as just an slash.

    I don't use Squid, so I don't know if it needs escaped slashes or not, my guess is that it doesn't but I could be wrong.
     
Loading...
Thread Status:
Not open for further replies.