Even though I only started with iDelta a few months ago, I was thrilled to be able to attend Splunk .conf in Las Vegas along with the rest of my team. As a relative Splunk newbie, I had a lot to learn, and it was great to have the opportunity to be taught by so many people with considerable experience!
A definite highlight for me was just being able to get a sense of how versatile Splunk can be by wandering the source=*Pavilion, where there were displays of various Splunk products and use cases from Splunk and partners. There were also a variety of games, involving remote-controlled robots, AI facial expression recognition, and drone flights, which made novel uses of Splunk to record various statistics!
The talks were also really interesting; one that gave me a lot of lightbulb moments was by Martin Müller, from Consist Software Solutions GmbH, called “Fields, Indexed Tokens and You“. This talk explained what goes on behind-the-scenes of a Splunk search, with tips on how to make searches more efficient.
Fields, Indexed Tokens and You: Key takeaways
Splunk breaks up events into tokens which it can then search on: By using major and minor breakers to split up a log file, Splunk creates a set of indexed tokens for each event in the log. An event will be parsed twice and segmented using major and minor breakers. These tokens are stored in the tsidx, and act as pointers to the raw event data.
Say we have an event which begins: 2019-10-21 18:55:05.001. “2019-10-21” would be treated as a whole token as it contains no major breakers, but the minor breakers would also create tokens “2019”, “10” and “21”. A search for any of these four tokens would find this event, even if you didn’t mean “10” in the context of a date.
The most efficient searches have a scan count close to the event count: When you create a search in Splunk, it will use the minor breakers to determine which tokens to search for in the tsidx. Whenever a token is matched, it will point to a raw event. Splunk will then search this raw event to check if it matches the specific requirements of the query. Searching on raw events is resource-intensive, so tokens are used to limit the amount of them that need to be checked.
The job inspector can tell you how many events were scanned (i.e., that matched a token from the search), vs how many events were returned (i.e., where the event matched the specific requirements from the query). A large difference between these numbers indicates an inefficient search.
Be careful of wildcards: One of the first lessons you learn in Splunk is to avoid using leading wildcards, but this talk shows that there are more things to be aware of. Using a wildcard in place of punctuation can create problems with your search if that symbol is also a breaker – for example, a search for 2019*10 won’t match any of the tokens in 2019-10-21.
Even wildcard characters at the end of search terms can cause problems if you don’t think about them. A common search is for web events where the response code is successful, which will have the format 2xx, e.g. 200 or 201. Searching for httpStatusCode=2* seems like a reasonable way to find these events, but Splunk will actually interpret this as a search for all tokens beginning with a 2. For the next 980 years, that will mean all events containing a date!
These are only 3 of the main things I took away from this great talk – for more information, you can replay the .conf slides and audio here: https://conf.splunk.com/files/2019/summit/FN1003.mp4