[Haddock] Help with parsing Haskell modules for documentation

Wed Apr 16 19:32:16 BST 2014

On Tue, Apr 15, 2014 at 3:47 AM, Mateusz Kowalczyk
<fuuzetsu at fuuzetsu.co.uk>wrote:

> On 14/04/14 19:53, Michael Pankov wrote:
> > Hello everyone,
> >
> > I'm working on an experimental tool that integrates with Git and tracks
> > updates to documentation as the source code is changing. It's early in
> > development and I'm not ready to show anything yet, but would like to ask
> > for some help instead.
> >
> > On the most basic level, I intend to notify the programmer in case they
> > change the source code and do not change the documentation comment of
> > top-level functions. I do understand that this will create a lot of false
> > positives, and it quite limited, but that's the first step I want to
> take.
> >
> > Then, I'm going to try to detect changes of arguments lists of the
> > functions as in source and as documented, and notify about that.
> >
> > Parsing the module itself already proved to be difficult to do in a
> > sensible or moderately complete way. I tried to use Haskell.Language.Exts
> > parser. But there are cases when you have multiple functions with same
> type
> > signature, don't have any type signature at all, etc.
>
> We use GHC API although as far as I know, Haskell.Language.Exts is able
> to extract Haddock comments as well. I don't know how well it handles
> other cases (no signature for example).

Well, Haskell.Language.Exts is able to parse the module with comments.
there's parseFileWIthComments (
http://hackage.haskell.org/package/haskell-src-exts-1.15.0/docs/Language-Haskell-Exts.html).
It returns a module AST and a list of comments.

But the problem is this: what comment exactly should be considered the
documentation comment? I mean, there can be bunches of comments with
newlines between, there also can be multiple functions with same type
signature (foo, bar :: Int -> Int). There may also be other corner cases.

It seems Haddock just considers the previous comment to be a part of
documentation. So that the following declaration is documented. And well,
that is probably sensible, I just have all these unknowns buzzing in my
head and nearly feel overwhelmed by the parts I may miss.

> We do not have to worry about
> any of that stuff as we use GHC itself. While there are many things
> wrong with Haddock being so attached to GHC, in return we get the
> ability to do things like ask for type signatures of everything and use
> that when generating documentation.
>
> I don't have much HSE experience but to me it seems that what HSE does
> and what Haddock needs aren't exactly lined up. All we care about on our
> end is that we can extract a lot of information about identifiers and
> documentation that is attached to them. HSE seems like it it's intended
> for source manipulation rather than information extraction but again I'm
> not experienced with it so I can't say for sure.
>

Yes, I think you're right. As I wrote above, HSE creates entire AST and a
separate list of comments, which is not exactly convenient for a project
like I intend to develop.

There's also annotated AST in HSE (
http://hackage.haskell.org/package/haskell-src-exts-1.15.0/docs/Language-Haskell-Exts-Annotated.html).
It stores SrcSpanInfo in the node by default, and it's not quite
transparent to me how to store anything else (in my case, the corresponding
comment would be useful).

>
> I do think that we could use HSE to achieve some of what we're doing
> now: I'm actually told that there is a version of Haddock out there that
> uses HSE instead of GHC API directly although it's an internal project
> in some company so I did not actually witness it myself.
>
> > I started to look into Haddock's source code to see how it handles this
> > stuff, but it's pretty hard to me to even find the place. To me, it seems
> > like there should be a map of entities to their comments.
>
> You're correct. We ask GHC to do all the heavy lifting with regards to
> renaming, type-checking and attaching comments.
>
> > Maybe someone could point me to the right source files and functions?
>
> Hm, it's rather spread out so it's difficult to point to the exact
> location. More or less how it works is that we parse the flags passed to
> Haddock, set any GHC flags according to that and ask GHC to rename and
> type-check things for us. We then get TypecheckedModule (this is a GHC
> API type) out of it which we further process. Amongst many things,
> TypecheckedModule contains list of all declarations &c. All these have a
> potential Haddock string attached to them.

Do I understand correctly that GHC matches the Haddock documentation to the
names by itself?..

Because in case of HSE you have to bind the comments to names afterwards.
And the only sensible way to do that seems to be to search for comments
whose source spans end just before the source span of the entity we're
interested in (say, function). But in HSE function itself isn't
represented. There's type binding, there's equation, etc., and when I
looked at documentation I got the impression that there are several
possible ways a function can be represented in HSE source tree.

Is it the same with GHC?

> What we do is simply take
> these declarations, parse a comment and create various maps from Name
> (GHC type) to ‘Doc a’ (Haddock type). We store this and more information
> in a file for future invocations (this is what the .haddock files are).
>
> I suppose you should be looking how we work with the GHC API output to
> achieve these interface files. You should be looking at close to
> everything under Interface directory as well as how we invoke the
> functions inside of it. createInterface function in Create.hs might
> might a fair starting point even though it's not exactly the smallest
> function.
>
> A small usage of GHC API is at [1], perhaps it will help you to get
> started, perhaps it won't. It does show how to go from a filename to a
> TypecheckedModule though.
>

I probably will take a look. I'm still hesitant to rely on GHC, though.

>
> > I also think that having Haddock API would be great and I noticed it's in
> > quite incomplete state now.
>
> Yes, there are plans for 2.15.x to improve the state of
> Haddock-as-a-library. Hopefully by GHC 7.10 things will be much nicer.
>
> > To use the Haddock's API is not my primary
> > interest, however. I could try at least looking on Haddock's way to
> handle
> > the ambiguities.
>
> I'm unsure what ambiguities you mean.

Well, the ones I stated above: ambiguities of the HSE AST. Maybe I'm
missing something. And maybe it's the other way with GHC.

Seems it would be great to have a lightweight parser which only gets names,
types, and comments in a nice map… But surely I won't be able to pull that
off. :)

Any source-code gets parsed by GHC
> itself so if your code itself is not ambiguous then the information we
> get back isn't either. Going the other way, String -> actual identifier,
> we first ask GHC to parse the identifier (makes sure it's valid) and
> then we ask it to give us things it knows about in the current
> environment with that name and then we make a best guess which one is
> meant. See [2] for an example when GHC folk changed something up and our
> guess was no longer correct. Also see bugfix commits for the mentioned
> tickets to actually see the code we use to decide this.
>
> > Thanks,
>
> Sorry for not being much help. I think your project has a potential to
> be quite useful.
>

Thanks for info and links, will take a look.

>
> [1]: https://ghc.haskell.org/trac/ghc/ticket/8945
> [2]:
>
> http://stackoverflow.com/questions/17912567/haddock-link-to-functions-in-non-imported-modules
>
> --
> Mateusz K.
>
> _______________________________________________
> Haddock mailing list
> Haddock at projects.haskell.org
> http://projects.haskell.org/cgi-bin/mailman/listinfo/haddock
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://projects.haskell.org/pipermail/haddock/attachments/20140416/d86357e5/attachment.htm>