I would like to run a Splunk query over a long period of time (e.g., months or years), but I am searching enough data that I am only able to search over hours or days of data.
However, for the question I want to answer in Splunk, I would be satisfied with a uniform or statistically unbiased sample of data. In other words, I would prefer the query return N events spread out over the past month, than any N consecutive events.
One way I considered was to only search events with date_minute=0 so as to quickly filter 1/60th of the events, which helps but is not very flexible.
Is there a better way to sample events efficiently in Splunk?
If you are trying to run a search and you are not satisfied with the performance of Splunk, then I would suggest you either report accelerate it or data model accelerate it. Or you can create your own tsidx files (created automatically by report and data model acceleration) with tscollect, then run tstats over it.
I found a related discussion on sampling on the Splunk Answers page below.
http://answers.splunk.com/answers/3743/is-it-possible-to-get-a-sample-set-of-search-results-rather-than-the-full-search-results
An alternative to filtering by date_minute or date_second, is to filter events in a where clause using the _serial property or the random() function. For example,
* | where (_serial % 60) = 0 | ...
or
* | where (random() % 60) = 0 | ...
However, in both cases the search appears to do a full scan of the data. This may still be desirable if you need the flexibility and the result is feeding into a more expensive query. Otherwise, using the date_second approach is significantly faster because events are apparently indexed by that field. For example, the two queries above ran in 3m 20s on a subset of data, where the query below ran in 11s on the same data.
* date_second=0 | ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With