Documentation tooling: the open source way


Ankit Gadgil speaking at DevRelCon London 2019

Ankit Gadgil works for Red Hat by day and as a volunteer in the Mozilla community in his spare time.

In this talk he shares how Red Hat and Mozilla tool their documentation and automate it to make the lives of writers easier.


Ankit: Thank you everybody. Let’s look at this picture. This is how I see documentation everywhere, mostly. Nowadays, we are giving importance to documentation but most of the products that we use or want to use have been this way for… I don’t know how much time. So let’s talk about documentation tooling.

I think documentation is a word that needs more love from all of us and there is a reason why. It’s because your product needs documentation. If your product is aimed toward developers and users that want to use it, they are lazy people. If they are going to use your product, they need the information of how to use it. If you have a user that is a developer, they’re not going to read your documentation if it is crappy because they’ll just use another product that has good documentation. And that’s what I’m going to talk about.

Developers prefer products with good documentation. The reason is that it allows them to integrate their product more easily, as they have clear steps of understanding on how to go through it. If your product does not have good documentation, they might just move. It clearly has adoption impact. Look at your documentation as your sales pitch because your users are going to look at it and you can tell them anything about your product that you would want them to know.

It also reduces customer support tickets. Now, it is really relevant, specifically in terms of Red Hat. The reason is, look at your documentation as a field where your users can easily find FAQs or good steps to follow for a specific problem they’re facing. This will reduce the number of tickets and at our random company, say our ticket cost around $100 to fix, multiply it by the number of tickets that are coming in that can be sold through documentation. You clearly have a financial impact there. This will also help you engage more with your customer and your user. Your user will feel more connected to the product. They will always say, “This product has amazing documentation.”

Let’s look at how two different open source organizations do documentation and the tooling of it. So let’s look at the Red Hat way. Red Hat actually invest heavily into documentation because it is the center of its customer experience and engagement. And to do this, they use this website, Red Hat’s primary model is a subscription model and for its users to use its product efficiently, they need proper documentation. Here is where the documentation tooling part comes in. Red Hat also invests a lot of energy and efficiency into documentation tooling because they want their documentation to be written by their documentation writers and instantly be seen on

The whole automation part is really important and has to be fast. The product that makes this happen, it’s an internal product. It’s also open sourced version that we’re trying to make available. It’s called Pantheon. What it basically does… Why am I calling it Pantheon I? You’ll come to know in a few slides. It’s basically a docs automation tooling platform. It’s based on Titles, Pages and Books. What I mean by Titles, Pages and Books is the way technical writers write documentation for Red Hat today is through single page clear documentation, classified under Titles and Books. The product is built in Python, Angular and MongoDb as its backend, and also utilizes automated jobs for AsciiDoc building. Writers write the documentation in Red Hat in the format of AsciiDoc mock-ups, and this particular tool, Pantheon, helps build this AsciiDoc into various exportable formats such as PDF, EPUBs, HTML and other stuff.

This is the pipeline, the whole documentation goes through. Right now, all of the documentation lives inside the internal git lab for Red Hat. It’s pushed into Pantheon by its documentation writer. When I say “pushed,” the content is not pushed, the metadata is actually created onto Pantheon and the writers just tell Pantheon from where to pull all this information. Now, just forget that there is GitLab. Replace that with content repositories and you have a system where you have metadata sitting inside Pantheon and it can easily pull out content from any repository. Pantheon then automates this billing process of AsciiDoc in Jenkins and then stores all the metadata in MongoDb, or your favorite system, sends out artifacts like HTML to AWS and then sends it out to the presentation layer. I have Drupal there just to inform that there is a presentation layer out there that is not doing all of this stuff, but just consuming the HTML artifacts.

The next generation of Pantheon, we’re calling it Pantheon II. The logo has changed, obviously. It’s because it’s based on different type of documentation. It’s based on modules and assembly and will come in the middle of this talk to talk about modules and assembly again. It’s based on Apache Sling, React and MongoDb. Why did we choose Apache Sling? It has a lot of out-of-the-box functionalities that we needed for content management and the system to be customizable enough. And also, it houses content. The best part about Pantheon II right now is it actually develops previews instantly as the content is being pushed to the documentation tooling system. And then, documentation writers can also use this system to manage the content as well. All of this is open source by the way. You can go to GitHub and have a look at these repositories.

So the architecture of the second system is more like these are the documents. It brings out all the documents, compiles it in real time and sends it out to the management system. The delivery system then asks the management system if there is something that is being updated or not, and then it just pulls it out. And also, at the same time, there is another trigger that goes to the third system that we have, and it sends out information like, “we need to index this documentation because now it’s published.” Look at it as a system where writers just write documentation and that’s it, and then the program managers tell the management system if they want to publish a particular documentation. When they hit that button, what happens is the management system works with the delivery system to put it out onto the website, and also sends out triggers to search saying that please, index this.

Another beautiful thing here is the management system can also un-publish a documentation. When they hit the un-publish button, the trigger works in a different way. It tells the delivery system that it no longer needs to be displayed there. And then, it also tells search that they need to de-index it so the next time a user is searching for that same documentation, they will not land on that particular page, even in case of caching. Pantheon II is open source. It’s available on this particular website. It’s an ongoing effort right now and yes, contributions are welcome.

Let’s talk about the Mozilla way. Let’s move towards a different open source system. Mozilla also creates documentation. It also has a tooling behind it but the users here are different. For Red Hat, they are paying customers. But for Mozilla, they are the community. The people who want to know about JavaScript, the people who write JavaScript, code in JavaScript or just want to use JavaScript. The system is called, it’s also called MDN. And the documentation tooling around it is being done by a system called as Kuma. This is system is also open source. You can go to GitHub and have a look at it. I put this slide in because I like the slogan there. It’s “Resources for developers, by developers.” Kuma is powered by MDN. It’s built in Python, NodeJS and React. It has an amazing startup guide. When I was starting in open source contributions, I actually went to Kuma and started working on a few bugs and I personally felt that it was a very clear startup guide with a point-to-point basis of how to get someone started with best first bugs and everything.

This is the architecture for Kuma. What happens is a user submits some new content and Kuma stores it as revision content. Kuma then bleaches and filters the content and creates the bleached content. KumaScript is then run on the rendered content and there’s something called a KumaScript content that comes out. This architecture is a bit different from how Red Hat is using it. And the rendered content is divided into Body HTML, which is basically the body fragment because there is clear chroming of headers and footers on every page and the content is just… Which is replaced using HTML. The quick links are something like… They’re like a ToC page where you can easily go from one link to another or a particular section of a page. Quick links and ToC, they work similar to each other. The only difference is quick links actually work on the side bar. And then there is the summary text and HTML for SEO purposes. Kuma is also open for contributions. There’s the link to it. I’ll share the slides at the end of the session so that you can have a look. I, personally, felt Kuma was very easy to start contributing to the project. It had detailed information on best first bugs and they have great documentation available on readthedocs as well.

So, the next section of the talk. “What exactly is the future?” is what I think or what we have been working on as the future of documentation. It’s called modular documentation. How is it different from the documentation that has been done up til now? I’m not generalizing here but most of the documentation that I’ve seen has been writing pages and pages of technical documentation, putting in PDFs and images and code snippets and all that stuff. And this way, it’s very hard to actually tool it and automate it. A lot organizations write their documentation in XML tags. Oh my God, I can’t imagine understanding those XML tags. Some organizations also use AsciiDoc, which is a bit more readable. And some of them are just using plain Google Docs, just text files, which is amazing. Let’s talk about modular documentation a bit.

Modular documentation is basically documentation which is based on modules. Each module is like a unit entity. It’s a self-contained system in itself. And also, it has its own value. Now, when you look at a module, it will always stand out by itself and always give out a specific value. For example, you have a login module. And now, this particular module will only tell you how a system should be logged into. Now you can, as a product, put in context into it. Saying that there is product one and there is product two. They both are going to use a single module but product one is going to send in the context related to product one, and product two is going to send in context related to product two. Now, the module in itself will transform into product one login page and product two login page.

Now, here is the reusability of the whole thing that comes in. Right now, the modular documentation concepts have three types of modules. The first is the concept module. A concept module is a module that says a particular thing about a specific idea or a concept. This is basically a grouping of modules so that you don’t mix up different types of modules with each other. A procedure module is like a step-by-step system or explanation to a particular user. And a reference module, this is a data user might need but does not need to remember. But then again, we’ll think why all these modules should be individual. The reason is that you can pull in every individual module content into a box, which could be called an assembly. And then, use these assemblies to basically tool your own documentation.

Here we can see that we are using Module A, Module B, Module C and all the related information to create an Assembly One. A user can just keep that assembly and then reuse that assembly in Assembly Three. Now, these are two completely different documentation systems but as Assembly One is being used in Assembly Three, it becomes part of that documentation as well. A real time example could be if you were trying to use OpenShift. If OpenShift documentation has login, how to use, and how to logout, and that whole documentation can be then put inside Red Hat Enterprise, in that system, as Red Hat Enterprise Linux Login, how do you use OpenShift? Logout. You don’t need to write documentation multiple times, and also just provide context to small bits of modules to then create assemblies. If you look at modules as Lego bricks, assemblies can look like like rocket ships.

What modular documentation is not. Really, specifically, it’s not documentation split into small parts that are meaningless. Every small part is a module and it has meaning, has its own value. A collection of modules that has no relationship to one another. That’s why we have assemblies, right? Because every assembly has modules that are related to each other or need the similar context.

I’m going to show you some diagrams, okay? And I’m going to go slow with this so that I’m able to go through all the diagrams well enough. The first diagram is about how we can tool this modular documentation system. It’s divided into four sections. The creation, the content creation, the content transformation, the management and the presentation. You might relate this to the earlier Pantheon II diagram that I’ve shown you. So let’s dive a bit deeper into this. The creation phase is where the content creators are actually writing documentation. They’re using various tools to write their own documentation and then just storing them in the repositories. The main aim of modular documentation tooling is to let documentation writers focus on writing documentation and not actually do the whole tooling in itself. The tooling should always be automated so that technical writers really focus on writing documentation and producing high quality content.

Once the creation phase is done, that is where the transformation phase comes in. It doesn’t matter what kind of markup you’re using, what kind of systems you’re using to write your documentation, there has to be some way you’ll have to transform it into a readable format for the user. A lot of times, users want to see RSS feeds. They want publications to be Red Hat. They need PDFs to be read afterwards. They need HTML formats to be seen in a beautifully presented way. The transformation layer actually takes care of all of that. You can either put some kind of tooling into it so that its automated using a Cron Job or, say, a Jenkins system, or any other system out there which gives you different formats. So this is where the tooling space comes in.

The third system is management. I put there, a word called “Module Library.” This might be something new. A module library is like a content system that hosts different types of modules, so that it’s easily findable by the users of the module itself. Say I’m a content writer and I’ve written N number of modules. Now, another content writer wants to use my modules, how do they find it? They find it inside the management system itself. They find it inside a module library. And it also has a distribution service attached to it so that it’s easily findable, easily changeable and whenever there is a change in the module, all the revisions of it can easily be found and automatically updated at different documentation systems that its using. Every module has its own metadata associated with it and it basically works along that metadata and keeps on updating it according to divisions.

And the last system, it’s called presentation layer. The presentation layer is basically nothing like a final content artifact presentation. The consumers actually come in, be it customers, users or any other system, or a third party system that is trying to consume this particular documentation. The best part about the presentation layer is it could be anything. It could be a web stack, it could be a single page application, it could be anything that is trying to consume. It could be a user as well. What I’m trying to say is it’s content display on demand.

Let’s dive one level deeper so that the concept of migration of modules is understood. There is always a module, but then that module can actually have multiple sources. We need to have a default module source that actually works and is written by content writers. Say, the module source is migrating to OpenShift and then you put in some context to it to actually tell it, “Migrate X to OpenShift.” So there could be XYZ that you’re migrating to OpenShift. And then, it then creates module instances. Every module then becomes an instance and then, further on, whenever there is a change in the source module is itself, the instance is automatically updated and then there’s module variant. So there is a step-by-step system of modules of how they work in assemblies and while tooling, we have to make sure that this whole system is captured. Let’s consider this, an example. Say I have written a module and somebody else is using it. Tomorrow, I go and change the source of module itself. Now, what do the users get or the other content writers get? Do they get the original instance? Do they get my updated instance? See, there are three updates that I made so three revisions that were created. What do they get now? Do we let content writers choose which instance they need? That can be very heavy. So we have to take a call where they will always get to see the instance of the module that they had exported, and they can also update it to the latest version and no version in between.

This is more of a use case diagram where, say, a writer creates modules and stores them in content repositories. Similarly, they can create YAML configurations. Now, when I say YAML configuration, this is like a master file which has links to all the other documentation you can use or you are trying to reference. And also, these modules then will be pushed into the module library. The module library is where the magic happens. Every module gets its own context package. A context package is nothing but small metadata, JSON, that’s it. It just tells us what the module is about, what exactly is the context and the version it is using. Or maybe whatever N number of metadatas you want.

There are other additional context packages that can be added to this module library but it depends upon how heavy the module wants to be. And then you actually get the module and its own ADOP or XML file. The module library also helps users to create modules that can be referenced by other modules. There are two modules and one of the modules actually needs a reference from the original base module, and there is a change. The actual module that he’s using gets automatically updated. The users also have a choice to not update it automatically and still keep on using the original version they want. The D section here has some internal logic that provides chroming, analytics and other stuff that your websites might want to use through modules. And ultimately, the modules can then be compiled into HTML formats and pushed it out for the user to consume.

There was a lot of information to take in, I understand but, again, when I’ll share the slides, maybe this’ll be more easy. I’ll also share my speaker notes so that you can easily go through all the information that I just spoke about and, for example, what does these section means and these section means, is where I’ll actually share in some data as well.

Last diagram, I promise, simple architecture. This is how it looks like from an eagle eye perspective. It’s nothing, it’s simple. People writing documentation and stopping at it. The automation process now begins, it transforms the documentation, pushes it out to the management system, and the management system here is the core. The management system does its magic, has the module library and everything. Now, the delivery system can either ask the management system what has changed or just wait for a push from the management system. The choice is yours, the way you want to deal with the tooling, the choice is yours. But then again, the consumer, in the first diagram, you might recall, I said search and here I said consumer, the reason is the consumer can be anybody. It could be search, it could be a third party that is listening to what has happened to your modules or what has happened to your documentation. An example I can give is, say you have a product documentation that is going out and you think that you can create APIs that can easily use that documentation and display it on a different website or do some other stuff. For example, make better your search through indexing, or just say it would trigger a particular event that will update all your internal teams that a new documentation for this particular product is available. That could be your consumer.

And, last but not least, whenever you’re designing your tool for your documentation, there are very important points to take care of. Four important points, the first is its really important to stay focused on whom you’re building it for. If it is too complicated, it’s not working for you because maybe it’s taking you too much time to build it and users have already moved on, right? The second is the data that you get from the usage is really important. I’m not saying invade privacy of users but, again, you can clearly mention it that maybe you’re getting some insights from your data. The third is resilience. Your code should be very easy to change. Again, the best practices come in when you’re writing a tool for your documentation. And also, it should be, say, a future proof architecture. Say tomorrow, there is a new style of modular documentation that has come up. Your tool should be able to evolve into that system that can now host the new style of modular documentation. It’s basically your tool should have a product life cycle that has clear update criterias. And that’s it. Thank you so much.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.