Excel is still one of the most common tools in science, so I’m routinely working with Excel data files, either as an input file or for output.
Today’s task was to import data for the tracker. There’s a small Perl script that reads data from a set of spreadsheets and merges them into a single data block. I used Perl because Java is boring for a small and simple task like this, and performance isn’t much of an issue for a command-line import tool.
There are many different modules for reading Excel files, but I’m now only
interested in “modern” .xlsx
files, which are actually implemented as zip
archives containing XML data. The modules of interest are:
There is actually a fourth module: Data::XLSX::Parser but I wasn’t aware of this at the time I did this work.
All have a similar-ish interface, but there are significant differences between them inside.
So my import script was using Spreadsheet::XLSX
and was normally OK, except for its total inability to handle some useful cases
(such as Excel’s special values for #N/A
). This module also does everything
using raw access to hashes, so it’s fairly brittle.
I wanted to switch to something more solid, and first I started with Spreadsheet::XLSX::Reader::LibXML (I’m a long-time user of LibXML) but it very quickly proved extremely slow. It’s actually still running an import I started three hours ago, and it’s still on one spreadsheet. This import took about three minutes with Spreadsheet::XLSX.
Turning to Spreadsheet::ParseXLSX I then found that it broke immediately on my data files. There is an outstanding issue on this (see: https://github.com/doy/spreadsheet-parsexlsx/issues/38) and I was mid-way through building a test case to help when the error went away – I noticed – when I removed a chart.
Thanks to Github, I could fork, build a test case and a pull request, and it’s now waiting to be merged. By far the longest step was installing all the Dist::Zilla plugins. In the meantime, I can manually delete the charts.
So for now, the choice seems to be dependent on the task: