RSS Feed
2013-12-10 06:27:49 UTC
Ron

Suppose you want to find the most occuring element within a data stream. A simple brute force algorithm would require counting the occurances of each element, then calculating which one had the highest count. This could cause some non-trivial memory problems down the line if the data stream is large. What if your data stream is infinite? It is not possible to do a complete pass over an infinite stream. Because of this, it is also impossible to find "the most occuring element" in an infinite data stream while using a brute force algorithm.

Since it is not possible to find the "most occuring element" in an infinite data stream, the best we can do is approximate it. Luckily there is a wonderful algorithm called space-saving that can approximate the top-k elements in an infinite stream.

Let's run through the algorithm and test it out on some data. I chose Data.Map.Lazy for this example for its ubiquity and ease of use. If you are looking for the best performance, then you would likely need to use a different data structure. Start by naming the module and importing Data.Map.Lazy and a few helper functions from Data.Char and Data.List.

> module Main where
> import qualified Data.Map.Lazy as M
> import Data.Char (toLower)
> import Data.List (minimumBy, maximumBy)

Now we can define the type for our spaceSaving function. The function first requires an Int called k, which defines how many elements the algorithm will keep track of. Next it requires a stream of data of the polymorphic a type; make sure your a has an instance of Ord. Finally the algorithm outputs a Map of data, containing the top-k elements, each paired with a score.

> spaceSaving :: (Ord a) 
>             => Int              -- The amount of results to keep
>             -> [a]              -- The stream 
>             -> M.Map a Integer   -- A map of the top elements as the key, and their count

Our entry into the space-saving function requires a k value and a stream of data. Upon being set into motion, the algorithm begins analyzing the data stream. It keeps track of a maximum k elements and stores them in a map. If there are less than k elements in the map, then it will add the current element into the map and give it a score of 1. If it sees a duplicate element that already exists in the map, then it will increment the score for that specific element. It will continue doing this until the map has k elements in it. Once the map is full, updating of duplicate elements continues as before, however it then runs the "updateLowestKey" function upon detection of an element not currently in the map.

> spaceSaving k = spaceSave M.empty 
>   where
>     spaceSave m [] = m
>     spaceSave m (s:ss) 
>       | M.member s m = update $ M.adjust (+ 1) s m 
>       | M.size m < k = update $ M.insert s 1 m 
>       | otherwise    = update $ updateLowestKey m s 
>      where
>       update f = spaceSave f ss

The function updateLowestKey takes the current lazy map, and the current element, then returns an updated map.

> updateLowestKey :: (Ord k) 
>                  => M.Map k Integer -- Input map
>                  -> k              -- Input Element
>                  -> M.Map k Integer -- Updated map 

The updateLowestKey function takes note of the element with the current lowest score in the map. It then replaces the lowest scored element with the element passed to the updateLowestKey function. It will increment this element by 1 before passing it back to the spaceSave function.

> updateLowestKey m e = M.insert e (lowestValue + 1) lowestKeyRemoved
>   where
>     (lowestKey, lowestValue) = minimum' m
>     lowestKeyRemoved = M.delete lowestKey m

There are two utility functions for this algorithm to work correctly with Data.Map.Lazy. We need a function to calculate the minimum and maximum values from a map then return the found element as a singleton. I found this easiest by converting the map to a list, then using compare to find the minimum or maximum value. If you are using a large value for k, then you might run into a bottleneck with these utility functions. Such a bottleneck would likely require ditching Data.Map.Lazy and either choosing a new data structure or rolling your own.

> minimum' :: (Ord a) => M.Map k a -> (k, a)
> minimum' = minimumBy comp . M.toList
>
> maximum' :: (Ord a) => M.Map k a -> (k, a)
> maximum' = maximumBy comp . M.toList
>
> comp (_, x) (_, y) = compare x y

Our main function uses getContents to read lazily from stdin, then splits the input into words while making everything lower case. We make everything lower case because spaceSaving is case-sensitive. Depending on your input data, you will want to write your own pFile function. After the calculation is complete, we print the most occuring element to stdout.

> main :: IO ()
> main = do
>   contents <- getContents
>   let pFile = words . map toLower $ contents
>   print . maximum' $ spaceSaving 100 pFile

Now lets test it! You can compile the code like this:

[ron ss]# ghc -O2 -o spacesaving Spacesaving.hs

And we can test the code by piping in some text, in this scenario I will use the "Chinese Classics".

[ron ss]# curl latermuse.com/classic.txt | ./spacesaving

The program should spit this out:

fromList [("the",1172)]

This means the word "the" is probably the most occuring word in our text. If you want to see the entire map, you can modify the print line in the main function and remove the maximum' function. Experimenting with k can have different results and it is worthwhile to test the algorithm using different values for k.

For anyone interested in working more with this algorithm, there is a way to record and calculate the error ratio of chosen elements. You could also implement the algorithm into a state monad, or find a way to query the data simultaneously as it is being computed. You might also consider writing an IO version of spaceSaving that writes your current maximum element into stdout or into a file. Enjoy!

Grab this code on Github.

View Comments on Reddit.

Thanks to winterkoninkje for awesome refactoring

2013-11-20 14:59:40 UTC
Ron

After demonstrating a simple script that extracts the total amount of unique IPs in an Apache server log, I decided that today I will make something a bit more complicated to show more of Haskell's power. This following literate Haskell program is an IP microscope. This microscope will zoom in on the unique IPs in an Apache log and show exactly which pages each IP has visited and the amount of visits the IP made to each page.

Lets get started by naming our module and importing our libraries.

> module Main where
> import qualified Data.Map.Lazy as M
> import Control.Arrow

Parsed is a data type that will hold information about the lines that are being parsed by our program. '_visitorIp' is the ip address of a user in String form, and '_visitedPage' is the page that the ip address visited.

> data Parsed = Parsed {
>     _visitorIp :: String,
>     _visitedPage :: String
>   } deriving (Show, Eq, Ord)

Ill admit this parser is pretty weak and could fail in many ways, but it does the job for this script. If you wanted to make this parser more robust, you would definitely want to avoid using 'head' and (!!) because they can both fail with exceptions (exceptions can cause some major headache if you aren't used to them). The parser works by first grabbing the 'head' of the string passed to it (which should be the IP address), then grabbing the 10th word in the string (which should be the web url). The parser then outputs the IP and URL into the "Parsed" data type. Remember, this parser is weaksauce and your mileage may vary if you chose to use it.

> parse :: String -> Parsed
> parse x = Parsed (parseIP x) (parsePage x)
>   where
>     parseIP = head . words 
>     parsePage = (!! 10) . words

This 'countParsed' function is pretty tricky. It zips a list of "Parsed" data with the integer 1, then lazily builds a 'Map' (from Data.Map.Lazy) with the zipped list. While building the 'Map', it will automatically add up all the duplicate 'Parsed' entries before spitting out the 'Map'. You will end up with a Map of the Parsed data with an Integer count of how many times the Parsed data was duplicated across your entire input.

> countParsed :: [Parsed] -> M.Map Parsed Integer
> countParsed = M.fromListWith (+) . flip zip [1,1..]

To prevent myself from becoming confused, I wrapped up the previous functions into a 'calculate' function that composes them together and finally converts the data from Map back into a normal list. This is probably not the most optimized way to go about this, but it is pretty readable so I opted to keep it. Sometimes readability is more important than optimization, especially with a script that already runs within a reasonable amount of time for the given data set.

> calculate :: String -> [(Parsed, Integer)]
> calculate = M.toList . countParsed . map parse . lines

Using Control.Arrow's (&&&) fanout function, I further caress my data into a type that is more suitable for my needs. "visitorIp' extracts the IP address from the Parsed data type, and "pageViews" separates each instance of a website URL and matches it up with the total page views into a list of tuples. This might feel a bit convoluted at the moment, but it will make for some easier function composition down the line. The 'Control.Lens' library would likely provide a much easier and concise version of this function, but I chose not to use Lens for this script.

> formatData :: (Parsed, Integer) -> (String, [(String, Integer)])
> formatData = visitorIp &&& pageViews
 
> visitorIp :: (Parsed, Integer) -> String
> visitorIp (parsed,_) = _visitorIp parsed
 
> pageViews :: (Parsed, Integer) -> [(String, Integer)]
> pageViews (parsed, num) = [(_visitedPage parsed, num)]

This 'converge' function is the climax of our script. It successfully appends all of our page view data together and solidates them under their parent IP. I am again abusing the Data.Map library to correctly move all the data into their assigned seats before preparing for output.

> converge :: [(String, [a])] -> M.Map String [a]
> converge = M.fromListWith (++)

Pretty printing is pretty easy for this kind of data. You just concat a bunch of stuff together and use 'putStrLn' to output it.

> prettyPrint (ip, pages) = putStrLn $ 
>     concat [ ip, " has visited these pages:\n"
>            , concatMap prettyPages pages
>            ]
>   where
>     prettyPages (page, views) = 
>           concat [ "    ["
>                  , show views
>                  , pluralView views
>                  , page
>                  , "\n"
>                  ]
>     pluralView 1 = " View] "
>     pluralView _ = " Views] "

We are here at the end of our haskell script. In the 'main' function, getContents will grab all the data sent into the program via stdin and 'calculations' will do all of the pure calculating and data manipulation. The last line of our 'main' function will wrangle and prettyPrint our calculations. This function is pretty self-explainable if you have made it this far into this blog post already.

> main = do
>   contents <- getContents
>   let calculations = M.toList . converge . map formatData . calculate $ contents
>   mapM_ prettyPrint calculations

So what happens when we actually run this script? Lets give it a whirl. Ill run this script by grepping for "latermuse.com" in the access_log and piping it into the runhaskell binary which will interpret my script. (Note: For publishing online, I have censored the last digits of each IP number, and have replaced many many results with a cool elipsis).

[latermuse httpd]# grep "latermuse.com" access_log | runhaskell microscope.hs
117.32.153.XXX has visited these pages:
    [2 Views] "http://www.latermuse.com/static/ron-simpleuniqueipsystemscript.html"
    [2 Views] "http://www.latermuse.com/"

12.69.21.XXX has visited these pages:
    [1 View] "http://www.latermuse.com/static/ron-usererrordataisnotintarformat.html"
    [3 Views] "http://www.latermuse.com/static/ron-simpleuniqueipsystemscript.html"
    [1 View] "http://www.latermuse.com/static/ron-shiftjisconversionlibraryforhaskell.html"

    ......

So there you have it. A slightly more complicated command-line tool written in Haskell.

You can get the full source code of this script by clicking here.

View comments on Reddit

Thanks /u/dave4420 for your code review!

2013-11-19 16:17:09 UTC
Ron

When writing system scripts, sometimes you need a bit more power than normal; why not use Haskell to help out? I wanted to count how many unique IP addresses have accessed latermuse.com according to my access_logs, so I wrote a quick Haskell program to help facilitate my needs. I will take you through my program and talk about how it works in detail.

Lets start by popping out our imports and naming our module Main.

> module Main where
> import qualified Data.Set as S
> import Control.Applicative

We know that each line in our input will start with an IP number, then be delimited with a space. This makes things easy. We can just run the 'words' function over the string and take the 'head' of that to extract the IP address.

> parseIP = head . words 

Since 'getContents' lazily reads from stdin and continues reading until EOF, we can use 'calculate' to split the content from stdin into lines, then parse each line for the IP address. After grabbing the IP address, 'S.fromList' throws the IP address into a Set. Each member of a Set needs to be unique, so grabbing the size of our set will give us the total amount of unique IP addresses.

> calculate = S.size . S.fromList . map parseIP . lines

Using Control.Applicative's sequential application function (<$>), we can send our input directly into 'calculate'. We use (=<<) to bind the output from 'calculate' into 'print'. 'Print' will pipe the final output of our program into stdout, displaying it in the terminal.

> main = print =<< calculate <$> getContents

Now we just run some quick commands in the terminal, and we can quickly count how many unique IP addresses have been logged as visitors to our website.

[latermuse httpd]# ls
    access_log    access_log.3  error_log.1  error_log.4
    access_log.1  access_log.4  error_log.2  analyzelogs.hs
    access_log.2  error_log     error_log.3

[latermuse httpd]# wc -l access*
    303787 access_log
    754845 access_log.1
    750058 access_log.2
    810321 access_log.3
    837264 access_log.4
    3456275 total

[latermuse httpd]# grep "latermuse.com" access* | runhaskell analyzelogs.hs
209

So there you have it. According to my access_logs, there were 209 unique IP addresses that have accessed latermuse.com.

You can grab the source code used in this post by clicking here.

View comments on Reddit

2013-11-18 11:01:44 UTC
Ron

This afternoon I was trying to install Control.Lens on a new machine when my internet died before cabal was able to finish working. When the internet came back online, running "cabal install lens" gave me an error.

Meow:~ ron$ cabal install lens
Resolving dependencies...
Failed to install MonadCatchIO-transformers-0.3.0.0
Failed to install generic-deriving-1.6.2
Failed to install reflection-1.3.2
cabal: Error: some packages failed to install:
MonadCatchIO-transformers-0.3.0.0 failed while unpacking the package. The
exception was:
user error (data is not in tar format)
generic-deriving-1.6.2 failed while unpacking the package. The exception was:
user error (data is not in tar format)
lens-3.10 depends on reflection-1.3.2 which failed to install.
reflection-1.3.2 failed while unpacking the package. The exception was:
user error (data is not in tar format)

When I saw the "user error (data is not in tar format)" exception, I figured the downloads for those libraries were corrupted when the internet died during the install. I did a quick scan of my ~/.cabal/config file and found the location for the downloaded binary tar files.

Meow:~ ron$ cat ~/.cabal/config | grep "packages"
--   ~/Library/Haskell for packages installed --user (the default)

I headed over to the the ~/Library/Haskell folder and started poking around. I found the repo-cache folder and inside had all the tar.gz files for downloaded libraries. A quick clean up of the broken libraries, and I was able to install Control.Lens correctly.

Meow:~ ron$ cd ~/Library/Haskell/repo-cache/hackage.haskell.org/
Meow:hackage.haskell.org ron$ rm -rf generic-deriving/
Meow:hackage.haskell.org ron$ rm -rf reflection/
Meow:hackage.haskell.org ron$ rm -rf MonadCatchIO-transformers/
Meow:hackage.haskell.org ron$ cabal install lens
Resolving dependencies...
Downloading MonadCatchIO-transformers-0.3.0.0...
2013-11-15 10:48:11 UTC
Ron

Working with ShiftJIS in Haskell is an exercise in patience. It took me a few hours to figure out the proper way to convert ShiftJIS into text that I could use in Haskell. After a bit of hacking, I figured out an easy way to do it by converting ShiftJIS ByteStrings into Data.Text's UTF8 implementation. This small ShiftJIS library for Haskell has functions for converting between ShiftJIS and UTF8, reading/writing ShiftJIS files, and downloading websites that are in the ShiftJIS encoding. You can get the library here.

2013-11-15 08:22:58 UTC
Ron

Hello. This is my new haskell blog. I just finished writing the first version of my hblog software. Hblog is a minimalist blogging platform written in haskell.

With hblog you just write your posts using markdown and save them into a flat text file. You then run the software and it compiles everything into a beautifully minimalistic blog format complete with RSS feed.

I will be adding some stuff to it as I see fit in the next few weeks. Please check it out and send any pull requests if you are interested.

Archives