Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting three highest website hitters via awk

Tags:

awk

So I'm trying to make a awk script that determines the most hits in order of the highest three. I am doing this based off a apache web log that looks like

192.168.198.92 - - [22/Dec/2002:23:08:37 -0400] "GET    / HTTP/1.1" 200 6394 www.yahoo.com    "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"
192.168.198.92 - - [22/Dec/2002:23:08:38 -0400] "GET    /images/logo.gif HTTP/1.1" 200 807 www.yahoo.com    "http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"
192.168.72.177 - - [22/Dec/2002:23:32:14 -0400] "GET    /news/sports.html HTTP/1.1" 200 3500 www.yahoo.com    "http://www.some.com/" "Mozilla/4.0 (compatible; MSIE ...)" "-"
192.168.72.177 - - [22/Dec/2002:23:32:14 -0400] "GET    /favicon.ico HTTP/1.1" 404 1997 www.yahoo.com    "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)..." "-"
192.168.72.177 - - [22/Dec/2002:23:32:15 -0400] "GET    /style.css HTTP/1.1" 200 4138 www.yahoo.com    "http://www.yahoo.com/index.html" "Mozilla/5.0 (Windows..." "-"
192.168.72.177 - - [22/Dec/2002:23:32:16 -0400] "GET    /js/ads.js HTTP/1.1" 200 10229 www.yahoo.com    "http://www.search.com/index.html" "Mozilla/5.0 (Windows..." "-"
192.168.72.177 - - [22/Dec/2002:23:32:19 -0400] "GET    /search.php HTTP/1.1" 400 1997 www.yahoo.com    "-" "Mozilla/4.0 JJohnJoJJJJJoJJoJJJJJoJJohJJJJJJJJJJJJohnJohJoJoJJJoJJ

To do this I do:

$1 ~ /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ {
hitCounter[$1]++
notIndexed=1
for(i in ips) {
if (i==$1) { notIndexed=0 }
}
if(notIndexed==1) {
ips[indexx]=$1
indexx++
}
}

This line detects a IP and then increments the hit count for it in the "hitCounter" array which is indexed by the IPs. Following that I check the list of ips, "ips", to see if the hit IP is already in there. If not the IP is added to the "ips" array and the index count is increased by one. In theory by doing this each index in "ips" should correlate with the indices in "hitCounter". Finally I have...

END {

indexxx=0
for (i in hitCounter) {
if (i>hitCounter[firstIP])
    firstIP=ips[indexxx]
else if (i>hitCounter[secondIP])
    secondIP=ips[indexxx]
else
    thirdIP=ips[indexxx]

indexxx++
}

}

It is here that I go through the IP hit counts in "hitCounter", compare them to the hits in the three high hit variables and, if the IP hit is greater then one of the three high hit variable contents, I set it to the current IP.

This seems like it should work to me and I should get "192.168.72.177 192.168.198.92" as the output but instead I get "192.168.198.92 192.168.198.92".

Why?

EDIT: Sorry, this is how I print the final results which is placed right after the "hitCounter" foreach loop...

print "The most hits were from "firstIP" "secondIP" "thirdIP
like image 829
user578086 Avatar asked Dec 05 '25 15:12

user578086


1 Answers

Instead of searching for the IP each time to see if it exists in the list of IP addresses, I'd do this:

$1 ~ /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/ {
    hitCounter[$1]++
}


END {

    for (ip in hitCounter) {
        if (hitCounter[ip] > hitCounter[firstIP])
            thirdIP = secondIP
            secondIP = thirdIP
            firstIP = ip
        else if (hitCounter[ip] > hitCounter[secondIP])
            thirdIP = secondIP
            secondIP = ip
        else
            thirdIP = ip

    }

}

I think part of your confusion was in thinking that i was the value rather than the key in for (i in hitCounter).

like image 116
Dennis Williamson Avatar answered Dec 08 '25 07:12

Dennis Williamson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!