While Harvest provides powerful reports sometimes nothing beats having all of your data in a spreadsheet.  Unfortunately, the defacto spreadsheet tool (Excel) has some tricky file formats, making programatic export more difficult than it should be. We recently made improvements to how we handle Excel exports in Harvest, and I’d like to share a few options for exporting spreadsheets with Ruby.

Perhaps the simplest way is to export Comma Separated Value (CSV) and let the Excel process these. Generating CSV files is generally fast, the format is well known but not trouble free. If you only export simple ASCII characters, Excel will work fine, but drop one non-English character in and Excel gets confused. Excel cannot select the encoding, so if you want to export UNICODE data (such as when you’re working in non-English languages) using CSV files with Excel won’t do. Sure all other spreadsheet programs can import UTF8 encoded CSV files, some of these programs are free (OpenOffice) so if you have the option of ignoring Excel CSV will do.

Next is using TSV files (Tab Separated Values), these solve the UNICODE problem but the user must remember to explicitly convert to native Excel format and this is not what they usually do with spreadsheets. Clearly, we must produce native Excel files, and this means either XLS (format used up till Office 2003 included) or the new XLSX documents based on the controversial OpenXML. The older format is undocumented but many hours of reverse engineering has been spent on it and there are quite a few libraries out there. For Ruby the most notable is Spreadsheet.

XLS & Spreadsheet has some drawbacks. First it has a column limit of 256 and 64K on the number of rows. Generally you will break down a lot sooner when exporting large number of rows as Spreadsheet is very slow to construct the global string table. See all strings in an XLS file are stored in a global string table then the row data merely refers to string by the index in the table. A nice optimization trick that makes opening XLS files significantly faster, but it also makes writes slow. Constructing the string table takes time and with the particular implementation Spreadsheet consumes a lot of memory as it keeps the entire document in memory all rows included, then upon save it constructs a Hash of all strings in the document. In short Spreadsheet is a great library for parsing and constucting decorated documents but breaks down above 20k rows.

We have implemented an XLSX generator to get Excel exports in Harvest. Unlike XLS the format is documented, albeit the thousands of pages of documentation provides only marginal help. However, it does support another string storage model, where all values are stored inline. This makes file generation faster while’st opening such documents for the first time will be somewhat slower. The resulting export will be about the same size since XLSX is basically a zip file of other xml documents. To use, first you need to install the gem:

  $gem install simple_xlsx_writer

Creating OpenXML files is fairly straight forward:

  require 'rubygems'
  require 'simple_xlsx'

  SimpleXlsx::Serializer.new("test.xlsx") do |doc|
    doc.add_sheet("People") do |sheet|
      sheet.add_row(%w{DoB Name Occupation})
      sheet.add_row([Date.parse("July 31, 1912"),
                     "Milton Friedman",
                     "Economist / Statistician"])
    end
  end

The gem will recognize a few basic ruby data types and generate cells of the right type. There is no other support for formating the document as in different colors, font styles etc just raw output compatible with the following programs:

  • Open Office 3.2 (Linux, Mac, Windows)
  • Neo Office 3.2 (Mac)
  • Microsoft Office 2007 (Windows)
  • Microsoft Office 2010 (Windows)
  • Microsoft Office 2008 for Mac (versions 12.2.5 or above)
  • Microsoft Excel Viewer (Windows)

One notable exception from this list is Numbers from iWork ’09 as it does not yet support documents in the inline string storage model. Apple may fix this eventually.