šŸ–¼ļø Adventures in netpbm

Continuing on a journey of computing archaeology, it’s time to extend Pillow with a plugin that can load images that are supported by netpbm.

The first step was a bridge for the anyto* applications, but it turned out it wasn’t so reliable - the tools rely on file and many ancient and obscure formats lack proper libmagic detection, some types of test data is hard to come by. So I ended up having to write my own detection rules, which was a bit of a pig.

I grabbed a bit of test data for formats I really care about (like seascape.iff), created synthetic test data for as many as I could using netpbm itself, pushed an alpha release to pypi and asked for it to be added to Pillow’s docs, so everyone in Pythonland can open the majority of files with as little friction as possible.

This was the easy part.

Then it was time to find test data around the web, figure out detection rules for data that file didn’t support, add functions where that’s infeasible, link in MIME types wherever they’re wrong around the web, expand my test coverage and open bug reports and pull requests to help steer various projects towards consistency, update wikidata where I could be bothered, submit new types and amendments to PRONOM, find bugs in my bridge and CI pipeline. And of course document the research and work here as I go, partly because I have a short memory, partly a lack of interesting things to write about, and a short memory, but also to link sources for future archivists, and a short memory, and also for the trophy cabinet šŸ†.


Andrew Toolkit Raster

atk

A nice simple file format to load, a properly registered MIME type - at least for the data format itself (application/andrew-inset) - with test data available in the project’s source since the 90’s. It has an ArchiveTeam page but didn’t have a Wikidata entry.

netpbm can load and save these, but file thinks all Andrew files are LaTeX documents, and the example raster files are buried in the distro’s source tree.

Let’s fix both of those things:


AutoCAD Slide

sld

AutoCAD slides are vector screen dumps of AutoCAD created by MSLIDE and viewed with VSLIDE. While the specs have been removed from AutoDesk’s website, Martin Reddy has a backup.

This is when I discovered Robert Schultz’s awesome test data collection, which I’m now mining for test data. These were detected as data by file - which means no detection rule. Its PRONOM entry lists 3 different MIME types (application/sld, application/x-sld and image/x-sld), and AutoDesk never actually registered an image/vnd.sld with IANA. Unsurprising since other people registered their other formats on their behalf, but they were registered and I always wanted to do that. PRONOM’s first entry has a weak collision with application/vnd.ogc.sld+xml, and the bad MIME bleeds through into Archive Team’s wiki, and across github like in MegaMimes and Vitam, presumably via DNSCore’s seed data repo.

Fixing as much of this as possible, I contacted PRONOM, sent a PR to MegaMimes, submitted a libmagic rules file, updated WikiData sources with links to existing usage, started the IANA registration process for image/vnd.sld to replace the image/x-sld I’ve been lobbying for. 40 years late, but better late than never right?

Once the IANA registration is complete and/or detection rules are in libmagic, Apache Tika and freedesktop shared-mime-info will one day follow suit. If not, I’ll give them a nudge.


FIASCO (Fractal Image And Sequence Codec)

fiasco

This novel fractal video format was created back in ā€˜94-ā€˜99 by Ullrich Hafner as part of his PhD thesis. It uses Weighted Finite Automata to compress the data, and the results are superb - it crushes other formats of the time at low bitrates, the above heavily compressed image would be unrecognisable as a 3.7K JPEG.

There’s actually two file types here that share a similar magic: one is the compressed binary data containing an image or some video, the other is an ASCII basis dictionary used during compression. Writing an image loader means I only care about the first type, but detection rules deserve to support both. netpbm uses the .wfa extension for compressed files, while the standalone FIASCO tools use .fco for both types. So we’ll document both in pillow-netpbm and detect by magic.

The files detect as data as they lack a detection rule. There’s also no PRONOM identifier and no MIME type, but since it is known by both ArchiveTeam and Wikidata, and was even mentioned in Linux Journal, that warrants both detection rules and a PRONOM identifier. TBH the format really deserved wide support back in the 2000s, I suspect it would have out-performed DivX during the era of early swashbuckling video if compression was faster.

So, let’s fix the omission:


MRF (Monochrome Recursive Format)

mrf

Created by Russell Marks in 1997 for zgv, this simple monochrome bitmap compression format uses quadtree decomposition - images are divided into 64x64 tiles, each recursively subdivided until we end up with a single colour. Brian Raiter liked it enough to create a colour extension called PRF (Polychrome Recursive Format), which, unlike netpbm, is supported by both XnView and Konvertor.

MRF has no MIME type registered and nobody is using image/x-mrf anywhere in the wild. It’s unsupported by libmagic and unknown by PRONOM. But again has a Wikidata entry and an ArchiveTeam page, and of course Sembiance has more test data for us to play with.

So we’ll pilfer his test data again and write a magic rule, use the rule in the detector, and link this one in to PRONOM like the others:


YBM Face File

ybm

Created by Bennet Yee at CMU around 1988 for his face and xbm programs - small monochrome portraits for UNIX user avatars. Jamie Zawinski and Jef Poskanzer wrote the netpbm converters in 1991. The name doesn’t seem to be official - ā€œYBMā€ appears to be a netpbm convention for ā€œYee BitMapā€, to distinguish it from X BitMap (XBM).

The format is simple: a 6-byte header (!! magic, BE 16-bit width, BE 16-bit height), then 1bpp bitmap data packed into 16-bit BE words with reversed bit order. It has a Wikidata entry and an ArchiveTeam page, but no PRONOM identifier or MIME type. Sembiance has test data, of course.

The 2-byte magic !! followed by two 16-bit big-endian ints is too generic for a reliable libmagic rule, since it lacks a way to do bit-twiddling arithmetic to make sure the file is the proper size. So rather than matching almost everything starting with !! and rarely ever see a positive match, we’ll skip submission to libmagic. I did make a rule file anyway, and submitted the format details to PRONOM for posterity.


NEO (Atari NEOchrome)

neo

NEOchrome was Atari Corporation’s own paint program, released in 1985 by Dave Staugas and Jim Eisenstein. Images are planar 16 colour files very similar to Atari DEGAS but with extra header data and a size of exactly 32128 bytes.

Unfortunately, the header partially overlaps with DEGAS files. This causes false detection as DEGAS in libmagic and scrambles images in my own degas loader. netpbm’s anytopnm doesn’t do automatic detection of DEGAS or NEOchrome images, which is also worth flagging.

The format has an ArchiveTeam entry but no PRONOM identifier. Wikidata lists both image/x-neo and image/x-neochrome as unregistered MIME types, but I couldn’t find evidence of the latter being used; dexvert and ksquirrel use image/x-neo. Sembiance’s dexvert test data contains 14 test files for us to plunder, including the one above.

Let’s fix some of this:


JBIG (Joint Bi-level Image experts Group)

jbig/bie

JBIG1 is a black and white image compression standard used for fax and scanned documents. The standalone file format is called a BIE (Bi-level Image Entity). Files use .jbg, .jbig, or .bie extensions - all the same format. The reference implementation is Markus Kuhn’s jbigkit.

image/x-jbig is the de-facto MIME used by both freedesktop and Apache Tika, it’s in PRONOM, Wikidata, and ArchiveTeam’s wiki.

So, it’s well-known, but files are misidentified by file as Targa image data because like Targa there’s no magic string. I wrote a detection rule which is too fragile for libmagic. Crappy file extension detection is all we have in pillow-netpbm for now. Oh, and two of Sembiance’s JBIG test files are mislabelled JPEGs, so I guess I can help there at least.


CompuServe RLE

cis/rle

Back in the 80s before GIF killed it, Compuserve had a 7-bit image format for use over serial links. Basically ESC G followed by M or H for medium or high resolution, then alternating lengths of off and on pixels with space being zero.

file identifies these as ā€œASCII text with escape sequencesā€, which they are. But we can do better than that since ESC G isn’t used for anything else. It’s a well-known format with no MIME type and a missing magic rule, so I won’t link to all the things, just add the magic rule:


Usenix FaceSaver

facesaver

FaceSaver was a system by Metron Computerware for capturing and distributing grayscale face photos of USENIX conference attendees, starting in 1987. It’s an internet message format text file with headers like FirstName: and E-mail:, and hex-encoded pixel data as the body. So, it detects as ASCII text. The above picture is RMS attending USENIX 1990, the original file contains his MIT AI lab email address - legendary piece of test data.

We have a Wikidata entry and a mention on ArchiveTeam’s wiki, but no MIME, PRONOM, or libmagic rule. Easy enough to detect - let’s fix the last two:


Sun Icon

sun-icon

The icon and cursor format for SunView from the mid-1980s. These are identified as ASCII text by file, and have a a C-style comment header (/* Format_version=1, Width=64, ... */) followed by comma-separated hex values. Usually monochrome 64x64 bitmaps, though 8-bit grayscale was added later for OpenWindows.

So, no libmagic rule, only icons in Sembiance’s test data. No PRONOM identifier and no MIME type. Has a Wikidata entry and ArchiveTeam wiki page.

Let’s submit a detection rule with image/x-sun-icon to match the existing image/x-sun-rasterfile convention, add to PRONOM, steal some test data from an early version of Solaris and update the wiki page.


SBIG CCDOPS

sbig

SBIG sold CCD cameras for amateur astronomy starting in 1988. Their CCDOPS software used a format with a 2048-byte ASCII header starting with ST-N Image(where N is the camera model), followed by 14-bit or 16-bit little-endian pixel data with optional delta compression. The header contains metadata in Key = Value pairs separated by LF-CR (wut?!) line endings.

file identifies these as ā€œdataā€ as there’s no libmagic rule. There’s a Wikidata entry and a sparse ArchiveTeam page that explicitly asks for expansion, but no PRONOM identifier or MIME type. The separate Application Note spec has been offline for years since SBIG was bought by Diffraction Limited, though the CCDOPS Version 3.5 manual is still hosted and contains the full Type 3 spec in Appendix A. A bit crappy of them, but SBIG cameras use FITS instead nowadays - so whatever 🤷

Sembiance has a single sample to work from, and we can generate them with netpbm. So let’s add a magic rule and a PRONOM entry:


Interleaf

interleaf

Interleaf was an aerospace/defense doc system bought out by BroadVision around 2000. The software is long gone, but the image format has lived on in netpbm’s leaftoppm and ppmtoleaf since ā€˜94.

Magdir/interleaf matches the document format but not the image format, so file thinks images are ā€œdataā€. Netpbm, Sebiance’s samples and wikipedia confirm a magic of \x89OPS for the images. There’s an ArchiveTeam page and a Wikidata entry, but no PRONOM identifier and no MIME type.

Let’s fix it in PRONOM and file:


GEM Raster

gem

GEM was Digital Research’s graphical desktop environment, first released in 1985 for the IBM PC and later the Atari ST. Its raster image format (IMG) was used by GEM Paint and Ventura Publisher. The format is well-standardized for monochrome images, but color support spawned several incompatible variants: XIMG, STTT, TIMG, and HyperPaint.

GEM has an ArchiveTeam page, a PRONOM entry, a Wikidata entry, and libmagic rules already exist in Magdir/images — but they’re broken. Every GEM file comes back as ā€œdataā€. The root cause: GEM v1 images share a 0x0001 version word with Atari DEGAS mid-res bitmaps, and the existing rule tree nests the GEM header-size checks as siblings of DEGAS file-size tests that use >>-0 offset (end-of-file). This corrupts the offset context for the GEM checks that follow — they end up reading from past the end of the file instead of from the header.

The fix is straightforward: give GEM v1 its own top-level 0 beshort 0x0001 entry, independent of DEGAS. Since this touches the same block as the NEOchrome fix, I’ve combined both into a single updated patch. Sembiance has a generous collection of over 100 GEM files to test with, including XIMG and TIMG variants — though netpbm’s gemtopnm only handles v1.


MacPaint

macpaint

MacPaint was the original bitmap editor for the Macintosh, created by Bill Atkinson and released in 1984. Images are always 576x720 monochrome, PackBits compressed. The format comes in three flavors: MacBinary-wrapped (128-byte finder header, PNTGMPNT type/creator at offset 65), versioned (4-byte version number + 304 bytes of brush/pattern data), and null-header (512 bytes of zeros before the pixel data).

libmagic’s existing rule in Magdir/apple detects the MacBinary variant via PNTGMPNT at offset 65, but the other two variants come back as ā€œdataā€. The versioned files start with 0x00000002 or 0x00000003 followed by 0xFFFFFFFF (default brush patterns) — distinctive enough for a magic rule. The null-header variant is indistinguishable from any other file that starts with zeros.

The format has a PRONOM entry, a Wikidata entry, and uses image/x-macpaint as its MIME type. Sembiance has a good collection of 31 files across all three variants.


Group 3 Fax

g3

CCITT Group 3 is the standard fax compression format, using Modified Huffman coding for monochrome scan lines. Raw G3 files have no header, no magic bytes, and no signature of any kind — just Huffman-coded pixel data starting with an EOL marker. The first byte varies depending on the content of the first scan line.

libmagic’s existing rule in Magdir/modem attempts detection via 0x0100 or 0x1400 (common first EOL byte patterns), then runs a deep exclusion tree to avoid false positives against TrueType fonts, DEGAS bitmaps, GEM images, Panorama databases, and various other formats that happen to start with similar bytes. It catches some files but misses most — three of our four Sembiance test files come back as ā€œdataā€. There’s no fix here that wouldn’t also produce false positives; raw G3 is fundamentally undetectable by content.

netpbm’s g3topbm only supports MH (Modified Huffman) compression, not MR or MMR. It’s also worth noting that many ā€œ.faxā€ files in the wild are actually TIFF-wrapped G3, not raw — Sembiance’s CROW.FAX is a TIFF with compression=bi-level group 3.

The format has a PRONOM entry, a Wikidata entry, MIME type image/g3fax, and an ArchiveTeam page. Extension-only detection is all we can do in pillow-netpbm.


The rest

So, this was the lowish hanging fruit and historical formats that kind of matter, there’s a bunch more that I couldn’t find test data for, but I think 3 weeks of shovelling reports onto other projects is enough. It’s slow work and puts an unfair burden on others, given the pace at which I work. So I think I’ll go back to silo-mode for a while - at least until some of the above are merged.