I am trying to get a more dtrace style distribution output when doing awks on large logfiles after a DDoS so that it is easier to read the output:
# tail -1000 access_log | awk '{ print $1 }' | sort | uniq -c | sort -nr | awk '{printf("\n%s ",$0) ; for (i = 0; i<$1 ; i++) {printf("*")};}'
  43 192.168.0.1 *******************************************
  38 192.168.0.2 **************************************
Hopefully it could look something like:
       value  ------------- Distribution ------------- count    
 192.168.0.1  @@@@@@@@@                                43 
 192.168.0.2  @@@@@@@@                                 38 
Where the @'s is a smaller summary of the count vs doing *'s for the number. Getting it to automatically scale per run would be an added bonus vs me having to do maths to figure out how to rank each count.
hOW ARE IP ADDRESSES DISTRIbuTED? IP addresses are distributed in a hierarchical system. As the operator of Internet Assigned Numbers Authority (IANA) functions, ICANN allocates IP address blocks to the five Regional Internet Registries (RIRs) around the world.
The original specifications for TCP/IP grouped IP addresses into sets of consecutive addresses called IP networks. The addresses in a single IP network have the same numeric value in the first part of all addresses in the network.
The total number of host addresses for a network is 2 to the power of the number of host bits, which is 32 (IPv4 address bits) minus the number of network bits. For example, for a /21 (network mask 255.255. 248.0 ) network, there are 11 host bits ( 32 address bits – 21 network bits = 11 host bits ).
Your pipeline is actually pretty good. You really just need it to scale large numbers. I replaced your tail -1000 access_log | awk '{ print $1 }' | with an unsorted file of ip numbers from one of my web servers. Added head -20 to just print the 20 most active ip addresses.
$  sort ip.txt | uniq -c | sort -nr | \
>  awk 'NR==1{scale=$1/50} \
>       {printf("\n%-23s ",$0) ; \
>        for (i = 0; i<($1/scale) ; i++) {
>            printf("*")}; \
>        }' | head -20
The important parts are
NR==1{scale=$1/50} to calculate the
scaling factor to fit the maximum
count into 50 characters, and printf("\n%-23s ",$0) ; uses a
width specifier %-23s to left-align
the count and ip address within a 23
character space.My output looks like this. I masked the IP addresses.
   824 xx.xxx.xx.39    **************************************************
   149 xx.xxx.xxx.176  **********
   138 xx.xxx.xxx.191  *********
   137 xx.xxx.xxx.41   *********
   105 xx.xxx.xxx.8    *******
    97 xx.xxx.xxx.21   ******
    96 xx.xxx.xx.220   ******
    91 xx.xx.xxx.198   ******
    87 xx.xxx.xxx.195  ******
    85 xx.xxx.xx.221   ******
    79 xxx.xxx.xxx.86  *****
    69 xx.xx.xx.12     *****
    68 xxx.xxx.xxx.159 *****
    65 xx.xxx.xxx.66   ****
    63 xx.xxx.xx.28    ****
    60 xx.xxx.xxx.104  ****
    59 xxx.xxx.xxx.242 ****
    59 xxx.xx.xxx.66   ****
    56 xx.xxx.xxx.202  ****
This kind of output has a human-factors problem. People judge graphs like these by the area of the lines (the asterisks). Since this display scales with the magnitude of the numbers, you can't visually compare two of these graphs with any reliability.
Your eyes and brain want you to judge the length of the lines. (I'm not sure where I learned this. Maybe from Tufte's books, or from studying statistics.) But the scaling might mean that the longest line on one graph represents 800, while an identical line on another graph might represent only 100. Your eyes and brain want to believe those two are roughly equal, even though one is eight times as big as the other, and even though you can see the raw numbers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With