Tuesday, March 15, 2011

Regexp pattern matching IP and UserAgent in an Huge File

Hi, all. I have a huge log file that has a structure like this:

ip=X.X.X.X
userAgent=Firefox
-----
Referer=hxxp://www.bla.org

I want to create a custom output like this: ip:userAgent

for ex:

X.X.X.X:Firefox

and the pattern will ignore lines which don't start with ip= and userAgent=. (these two must form a pair as i mentioned above.)

I am a newbie administrator and our client needs a sorted file immediately. Any help will be wonderful. Thanks.

From stackoverflow
  • You can use:

    ^ip=((?:[0-9]{1,3}\.){3}[0-9]{1,3})$
    

    And

    ^userAgent=(.*)$
    

    Get the group 1 for both and you will have the desired data.

  • ^ip=(\d+(?:\.\d+){3})[\r\n]+userAgent=(.+)$
    

    Apply in global + multiline mode.

    Group 1 will contain the IP, group 2 will contain the user agent string.

    Edit: The above expression can be simplified a bit, we can remove the IP address format checking - assuming that there will be nothing but real IP addresses in the log file:

    ^ip=(\d+\.?)+[\r\n]+userAgent=(.+)$
    
  • give it a try (this is in no way robust if there are lines where your log file differs from the example snippet above):

    sed -n -e '/^ip=/ {s///
    N
    s/\nuserAgent=/:/
    p 
    }' HugeFile > customoutput
    

0 comments:

Post a Comment