During the course of data analysis, there is a point where you will inevitably start generating averages/standard deviations/percentiles etc. in an attempt succinctly summarise your data. Commands such as avg(), stdev(), median() and percX() and help with this.

However, it can sometimes be the case that such statistics on their own won’t give you a true picture of what your data actually represents. For example, suppose we compute some statistics for a set of data, and we get:

From these values we might conclude that our data has many values in the 50-60 range.

We can determine whether or not this is actually the case by plotting a histogram of our values by using the bin command i.e. a plot of the counts of a particular field of our data, grouped into certain bins.

The bin command will group values of a similar magnitude into the same “bin”. The size and number of the bins can be modified by specifying the span and bins options respectively. For example, we can plot 100 bins with a size of 0.1 as follows (note that ‘g’ is the name of the field for this set of data):

``| bin g bins=100 span=0.1 as bin_values``

If we then perform a stats count on these new bin values and sort them:

``````| stats count as bin_count by bin_values
| sort + bin_values``````

We get the following plot:

The data are clearly distributed around two different mean values. In this case the first curve is centred on a value of ~35 and the second curve is centred on a value of ~76. What is remarkable is that in the 400,000 data items used in this example, none of them have a value that lies between 40 and 69 – a fact not at all obvious from the initial set of statistics that we compiled.

In summary, if you are using statistical commands as part of your analysis, use the bin command to quickly check that your data is actually distributed in the way that you believe it to be.

For 2021 we’ve committed to posting a new Splunk tip every week!

If you want to keep up to date on tips like the one above then sign up below: