Simple Haskell: Parsing Logs (Part II)
Regular expressions are probably one of the most useful tools any programmer can know, but they can be brittle, especially when they get long, or if composition is needed. This post presents and alternative using parser and parser combinators, along with a short solution with a data streaming library.
All of the following code snippets are present in this github repo: logan-c-brown/log-parse-2
Parsing, not Searching
While regular expressions can be very good at finding data, putting multiple regular expressions together is not so straighforward. Checking regular expressions for validity during compilation can not be caught during compilation, so regular expressions typically heavily rely on integration and unit tests to ensure they are not mechanically broken. An alternative is using a parser which defines a grammar to validate or break up syntax. For example, JSON must adhere to certain rules for how it “looks” to be considered JSON.
The same methodology applied to JSON can be applied to creating a custom grammar for log files. attoparsec is a pretty solid parser library – it will be used for the following examples.
Looking back at one of the sample log files, they can look like this.
In the log files above, a common syntax would be that they all start with a date in the form of MONTH DAY TIME
. The parser for the log files should begin with a date, which can be modeled as follows.
The data type Text
is used to represent each of the three components of the log’s timestamp. Text
is a more efficient representation of textual data than String
is, so it is preferred. To be able to parse the date using the grammar above, the applicative functions <*>
, <*
, *>
, and <$>
are used to break up the Text data.
<*
means to take the left argument, and ignore the right argument, for exampleJust "a" <* Just "b"
yieldsJust "a"
<*>
helps apply functions, for exampleJust (2 +) <*> Just 2
yieldsJust 4
. Take a look at the applicative laws here.<$>
also helps apply functions, for example(2+) <$> Just 10
yieldsJust 4
. This is just likefmap
.take
takes n-amount of charactersskipSpace
consumes and ignores spacessepBy
ensures that every 2 characters is separated by ‘:’
That is a parser for the date parser for the format in the log! The great part about using a parser is that not only can it be ensured that there is no undefined behavior at runtime, but the type system can be used to merge together many other different parsers.
[\S]+
(but for Parsers)
After the date comes the machine name which should not be spaces. the function not . isSpace
is identical to the anonymous function (\x -> not ( isSpace x) )
. P.takeWhile
is used so that the parsing fails if at least one non-space character is not matched.
Returning non-Text Data
The EventTime
type represents the bracketed machine time in the log, as seens as kernel: [165049.217848]
. A regular expression for this would look like kernel: \[(\d+)\]
. The advantage of using a parser is that the type of the data being parsed can be encoded in the parser.
Whereas the regular expression will only return string inside the capture group (\d+)
, we can enforce the that the “capture group” of our parser is always a number at compile time. On the other hand, if the regular expression was changed to ([^\]]+)
, there is no way for the compiler to know what other parts of the code is now broken.
An example of type-enforcement can be utilized with the double
parser function inside the timeP
parser below.
This parser above looks a bit different than the date parser. To avoid lots of symbols in the code, it can be visually helpful to use the do
syntactic sugar to combine separate parsers. timeP
parses text as follows:
- Ensures that the text fed into the parser begines with “kernel: [”
- The next few characters could be spaces
- The next characters must be a valid number, and the result is transformed from
Text
data into the typeDouble
- The next character must be the closing bracket,
]
- The result is returned as an EventTime wrapped in the applicative,
Parser
, using thepure
function.
After event time comes a device name, which does not exist on all the logs.
Parsing the CPU Information
After the device name comes a cpu number, the conversion between to number is not used, because its fairly unlikely people will be regularly using addition or subtraction on cpu numbers.
In addition to the CPU number, information about the status of the CPU will also be stored. This information is shown as something like Core temperature/speed normal
in the log. Because a CPU should not occupy more than one state, an ADT is a good choice for a model. Ignore the Generic
type class for now, this will be explained during the JSON parsing section.
Each of the cpu states has some sort of representation in the log file. In the next code snippet the <|>
function is used to choose the first successful parsing of the text that mentions the state of the CPU. For the Maybe
type, <|>
can be thought of as coalesce
for those familiar with SQL.
In the cpuStateP
parser, if the text matches “is up” then Online
is returned. Type-safety ensures that the code can’t return anything other than a CpuState
unless all of the parser transformations using (>>)
are changed.
The complete cpu-information parser is combined as:
Remainder (r)
The rest of the parser is either this CPU state or a generic event message such as Unhandled alg: 0x71b
. These are modeled as separate types so in the future more parsing can be added for wifi status, or power status, or anything else in the logs just by adding more statements onto this parser. For now the logs are bucketed into either a cpu-related log, or a generic event log.
Putting the Parser Together
All that is left is to combine all of our separate parsers into one big one. This is done exactly as it was for the rest of the parsers – here the do
syntax is used. The model of a log is completed here as a “successfully” parsed log as a KernelLog
and an unknown log which is the original text representation of the log and a String
which is what attoparsec returns during unsuccessful parses.
All the parsers are combined and are syntactically separated by spaces.
A function to parse from Text
to a Log
representation, where all unsucessful parses are thrown into an UnknownLog
.
JSON
Of all the serialization formats ever created, JSON is one of the most ubuquitous nowadays being that nearly every language supports encoding and decoding from it, and it’s fairly compact, human-readable, and robust. It would suck to do all that log parsing just to dump them back out in their original format for some other program to pick back up and parse again.
To make the flow of information easier, JSON will be used to store the parsed log data. However, to separate the internal and external representations of log messages, a new set of models is created. By creating separate types, if the JSON keys on the exported data ever needed to be changed, we would not have to run through our application again and rename the types. Rather, the types should just be changed on the export, and a translation layer converts between the export format and the parsing format. Here the function to to that is toOutputLog
.
Haskell Writes Code for You
Thanks to type classes, all the boilerplate code to parse and unparse logs does not have to be written. For those familiar with the circe
library in Scala, this is called Automatic Derivation. In order to do this in haskell, each type used in the export model must be able to derive Generic
, including CpuState
as seen earlier.
Once generic is derived with the help of the DeriveGeneric
GHC extension and the GHC.Generics
module, GHC can write all of the JSON encoding functions for us by simply defining the type class instances for OutputLog
If you are ever dealing with converting to and from JSON, I highly encourage this methodlogy to remove any boilerplate and possibly typos when writing a JSON encoder and decoder manually. The conversion function from the internal and external data representations uses pattern matching against the types in OutputLog
All that is left is to write all of the JSON to a file.
Streaming Data
One of the problems with the previous solution was that the program was repeatedly opening and closing files for every log. To solve this, all the log files for each file can be grouped together first, and then each group is written to a file. The downside of this approach is that all the logs need to be stored in memory before being written – a not so great scenario for large log files. Alternatively, a reactive approach to creating and writing log files can ensure minimal memory overhead because not all log files need to be read at once. Another solution could involve chunking the input stream to regulate memory consuption. All of these are great ideas, but we’ll go with grouping the log files first, to keep the solution simple.
Polymorphism
First, a function to group all the logs together is needed. Haskell has functions that are like this such as groups
, but they don’t completely group the data because they are preferenced towards lazy evaluation of lists. Instead a new function is made groupL
.
groupL
is a bit different than the functions used previously in that it doesn’t define any restrictions on the types of the arguments of the function other than that the function used to create an identifier to group must have a defined Ordering. These kinds of functions are called polymorphic, and it is preferenced that any functions written are polymorphic because they enable easier composition and help to satisfy the DRY principle.
For example, groupL are not restricted to using only Log
data.
Conduit
Now that the Logs are grouped, a function is needed to write this group of logs to a file. For this conduit
is used. Although unnecessary for this problem, conduit has some cool syntax and functionality specifically for streaming data to files. Again, polymorphism is retained as much as possible to retain input flexibility.
Outline of sinkGroup
:
- Create a conduit source from the list of logs (or anything else that has a
ToJSON
instance) - Stringify the logs, this converts all the logs to JSON in bytestring, a very efficient data format, and appends on a newline to each string
- Create a
sink
where all of the data is streamed to, which is a file accepting bytestring data. - Tell conduit to pipe our data from a conduit source to a resource sink with
runConduitRes
The .|
syntax is similar to the pipe operator in bash, and its meaning is nearly the same.
The final peice of the puzzle, the main function used in part (I), is slightly modified to accept the new grouping.
sequence_ :: (Foldable t, Monad m) => t (m a) -> m ()
is used to force evaluation for all of the conduits.
T.pack :: String -> T.Text
, from Data.Text
is used to convert the lazy String, to Text data format used in the parser.
This wraps up the creation of a simple kernel log parser.