High-precision arithmetic for calculating N-gram probabilities
Dan Maftei
ninestraycats at gmail.com
Sun Oct 16 13:26:42 BST 2011
Thank you Wren and David. Here is what I gleamed so far:
1 Standard practice is to use Doubles for probabilities. Is the main reason
for this because the small precision loss isn't worth reclaiming by using
more complicated types? Further, do we often sum probabilities in NLP? If
so, Ratios are a bad idea, due to many gcd calls. But if our only
manipulation is multiplication, Ratios will not suffer a performance hit.
2 I'm unclear what log (_ + 1) and exp (_ - 1) refer to, or what is
inaccurate about the standard log and exp functions. Same goes for what a
and b represent in logSum (== a + log(1+ exp(b-a)). If I want to multiply
probabilities, I use exp $ foldl' (+) 0 (map log probs), which is equivalent
to foldl' (*) 1 probs. Is there something inaccurate or incorrect about
that? NB: the probability of a large text is intractable, since foldl' (+) 0
(map log probs) gives an exponent in the thousands. However, the perplexity
of said text is computable, it just defers the exp function till after we
multiply the log sums by -1/N: exp $ negate (1 / n) * logSums.
3 I am still unsure how to neatly randomize output, given the inaccuracy of
Doubles (I have a working function but it requires a guard which is only
necessary due to Double imprecision). David gave "2 * abs(high - low)/(high
+ low) <= epsilon", but I can't figure what the high and low refer to, or
where to get that epsilon from. Isn't it machine-dependent? If not, I could
use the value Wiki for double precision types, 1.11e-16.
Dan
> Message: 1
> Date: Sat, 15 Oct 2011 20:36:45 -0400
> From: wren ng thornton <wren at freegeek.org>
> Subject: Re: High-precision arithmetic for calculating N-gram
> probabilities
> To: nlp at projects.haskell.org
> Message-ID: <4E9A271D.6090505 at freegeek.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 10/13/11 12:34 PM, David Hall wrote:
> > Also, you'll probably want to
> > use log(double) and operate using + as times and logSum (== a + log(1
> > + exp(b-a)) as sum. This will avoid underflow.
>
> Actually, for (a + log(1 + exp(b - a))) you want to first ensure that a
> >= b. Otherwise you'd want to do (b + log(1 + exp(a - b)))
>
> Also, there are more accurate implementations of log(1+_) and (exp _-1)
> than the naive versions which become extremely inaccurate for small
> arguments.
>
>
> If you want to deal with probabilities in the log-domain, the logfloat
> package[1] will handle all these details for you. In addition, it
> defines things using newtypes so that you get type safety instead of
> mixing up which things are in the log-domain and which are in the
> normal-domain; and it provides a Num instance so that you can say what
> you mean instead of what the implementation is doing. In fact, the
> package was specifically designed for NLP tasks needing to do this sort
> of thing. While the primary goal is to maximize precision, the secondary
> goal is to maximize performance within those constraints; and the
> implementation has been tested thoroughly on these criteria.
>
> The only caveat is that the performance tuning was done a while ago,
> back when GHC had horrible code generation for doubles, and so it uses
> --fvia-C which has since been deprecated. I've yet to install GHC7 in
> order to re-tune for the new backends.
>
>
> [1] http://hackage.haskell.org/package/logfloat
>
> --
> Live well,
> ~wren
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sat, 15 Oct 2011 20:40:42 -0400
> From: wren ng thornton <wren at freegeek.org>
> Subject: Re: High-precision arithmetic for calculating N-gram
> probabilities
> To: nlp at projects.haskell.org
> Message-ID: <4E9A280A.10609 at freegeek.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 10/13/11 2:40 PM, Dan Maftei wrote:
> > Yeah, log sums are actually a necessity when calculating perplexity. :)
> >
> > If I ever get around to profiling Rationals vs. Doubles I'll let people
> know
> > what I found. But upon reflection, I see that they shouldn't make a
> > significant difference in performance.
>
> The big performance problem with Rationals is that they're stored in
> normal form. That means the implementation has to run gcd every time you
> manipulate them. If you're really interested in performance, then you'd
> want to try working with unnormalized ratios rather than using the
> Rational type. Of course, then you have to be careful about not letting
> the numbers get too large, otherwise the cost of working with full
> Integers will be the new performance sinkhole.
>
> --
> Live well,
> ~wren
>
>
>
> ------------------------------
>
> _______________________________________________
> NLP mailing list
> NLP at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/nlp
>
>
> End of NLP Digest, Vol 12, Issue 3
> **********************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/nlp/attachments/20111016/5e9ef90b/attachment.htm>
More information about the NLP
mailing list