JavaScript word-extractor package release 1.0

Posted by Stuart on May 17, 2021 · 4 mins read

I’ve just released version 1.0 of my JavaScript module for reading text from Word files. This has been a long time coming. A lot has changed, and a fair number of teeny bugs which nobody had noticed except me have also been fixed.

Essentially, with this module, you can point it at a file (or a Node.js Buffer) and it’ll give you an object containing the file body, headers, footers, endnotes and footnotes, and the annotation comments, without needing to install Word or anything else. It’s fast, too. If you need to process a ton of Word files, this is the kind of tool you need.

The API hasn’t changed at all, but because almost the whole of the internals have been rewritten, it’s a major version update so that if anything is broken, you won’t automatically slurp up the new version through semantic versioning.

The major change is the module now supports “Open Office”-style (docx) files trasparently alongside classic “OLE”-style (doc) files. The differences are as follows:

Several people had requested support for .docx files, or at least (rightly) complained when the same module didn’t really explain why it didn’t handle newer files. And I’d not really had the time to write the code for it until recently. Anyway, it seemed time to handle the two transparently, even though the extraction process is entirely different.

Also, I felt a bit guilty over the lack of maintenance over this module, especially when I pushed at it and found how many things weren’t quite right. As is usual for coders, things start to happen when guilt beats out inertia.

There’s still room for more work to do, but it’s decently well tested. The main issue for now is that in some files, there are small white-space differences between the text extracted from an OLE-based and an Open Office-based version of the same file. The text is identical, but there are some extra newlines I still need to puzzle out. And text boxes might also be nice, the module doesn’t yet extract text from them at all.

If you have any requests or comments, you can find the module on NPM at https://www.npmjs.com/package/word-extractor, or on Github at: https://github.com/morungos/node-word-extractor, and feel free to create an issue there if you have any ideas, or find something that doesn’t work as well as you’d like. I especially like to add breaking cases to the test data, so if you can make a tiny Word file that doesn’t work, send them my way.