fullstop 0.1 - ridiculously simple sentence segmentation in Haskell
Eric Kow
eric.kow at gmail.com
Wed Mar 3 15:59:24 EST 2010
Dear Haskell NLP people ,
I'd like to announce a new sentence segmentation library I've uploaded to
Hackage : fullstop.
In lieu of a description, I present to you a set of test cases that
currently pass:
> testSuite =
> testGroup "NLP.FullStop"
> [ testGroup "basic sanity checking"
> [ testProperty "concat (segment s) == id s, modulo whitespace" prop_segment_concat
> ]
> , testGroup "segmentation"
> [ testCaseSegments "simple" ["Foo.", "Bar."] "Foo. Bar."
> , testCaseSegments "condense" ["What?!", "Yeah"] "What?! Yeah"
> , testCaseSegments "URLs" ["Check out http://www.example.com.", "OK?"]
> "Check out http://www.example.com. OK?"
> , testCaseNoSplit "titles" "Mr. Doe, Mrs. Durand and Dr. Singh"
> , testCaseNoSplit "initials" "E. Y. Kow"
> , testCaseNoSplit "numbers" "version 2.3.99.2" ] ]
The library is extremely simple and stupid. I'm hoping that somebody here
will be sufficiently offended by it to upload something better in its place.
Here's the whole segmenter:
> import Data.List.Split
>
> segment = map (dropWhile isSpace) . squish . breakup
>
> breakup = split
> . condense -- "huh?!"
> . dropFinalBlank -- strings that end with terminator
> . keepDelimsR -- we want to preserve terminators
> $ oneOf stopPunctuation
>
> stopPunctuation = [ '.', '?', '!' ]
> squish = squishBy (\_ y -> not (startsWithSpace y))
> . squishBy (\x _ -> looksLikeAnInitial x)
> . squishBy (\x _ -> any (`isSuffixOf` x) titles)
> . squishBy (\x y -> endsWithDigit x && startsWithDigit y)
> where
> looksLikeAnInitial [_,'.'] = True
> looksLikeAnInitial _ = False
> --
> startsW f [] = False
> startsW f (x:_) = f x
> --
> startsWithDigit = startsW isDigit
> startsWithSpace = startsW isSpace
> --
> endsWithDigit xs =
> case reverse xs of
> ('.':x:_) -> isDigit x
> _ -> False
>
> squishBy f = map concat . groupBy f
>
> titles :: [String]
> titles = [ "Mr.", "Mrs.", "Dr." ]
Enjoy!
PS. This message has a secondary purpose, to remind everybody that this
mailing list exists and should be put to use ;-) We now have 15 Haskell NLP
packages on hackage. I'm looking forward to somebody combining them
in clever ways to make something new and fun!
--
Eric Kow <http://www.nltg.brighton.ac.uk/home/Eric.Kow>
PGP Key ID: 08AC04F9
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: Digital signature
Url : http://projects.haskell.org/pipermail/nlp/attachments/20100303/93dd27d8/attachment.pgp
More information about the NLP
mailing list