I wrote a tokenizer/lexer (difference?) for Python in Haskell: this is my code on GitHub.
I already tested it out with some of CPython's standard library scripts, and it doesn't fail. However, it probably outputs non-conforming output. But I don't care about this right now.
What I'm looking for is a general (functional) style review. This is my first time using Parsec and Haskell for something relatively big, so I expect my functional style to be suboptimal.
Thanks.
Significant code snippets:
parseTokens :: LexemeParser [PositionedToken]
parseTokens = do
P.skipMany ignorable
tokens <- concat <$> P.many parseLogicalLine
P.eof
endIndents <- length . filter (>0) <$> getIndentLevels
dedents <- mapM position $ replicate endIndents Dedent
endMarker <- position EndMarker
return (tokens ++ dedents ++ [endMarker])
parseIndentation :: LexemeParser [PositionedToken]
parseIndentation = mapM position =<< go
where
go = do
curr <- length <$> P.many (P.char ' ')
prev <- getIndentLevel
case curr `compare` prev of
EQ -> return []
GT -> return [Indent] <* pushIndentLevel curr
LT -> do
levels <- getIndentLevels
when (not $ elem curr levels) (fail $ "indentation must be: " ++ show levels)
let (pop, levels') = span (> curr) levels
putIndentLevels levels'
return $ replicate (length pop) Dedent
parseLogicalLine :: LexemeParser [PositionedToken]
parseLogicalLine = do
P.skipMany ignorable
indentation <- parseIndentation
(indentation ++) <$> begin
where
begin = do
tokens <- P.many1 parsePositionedToken
let continue = (tokens ++) <$> begin
implicitJoin <- (> 0) <$> getOpenBraces
if implicitJoin then do
whitespace
parseComment <|> (P.optional backslash *> eol)
P.skipMany ignorable
whitespace
continue
else do
whitespace
explicitJoin <- P.optionMaybe (backslash *> eol)
case explicitJoin of
Nothing -> ((\t -> tokens ++ [t]) <$> position NewLine) <* P.optional eol
otherwise -> whitespace *> continue
(I was largely inspired by PureScript's lexer, though I probably ended up with something a lot worse.)