Changing with the times as the world moves from paper to digital, the Harvard Library has adopted forensic techniques to save material stored on obsolete formats. Jennifer Weintraub, a digital archivist and librarian at Schlesinger Library, eyes a floppy disk (photo 1), a format many of today’s College students have never even seen. Even the technology to retrieve the data must be preserved (photo 2). Some of the formats (photo 3) require a Zip drive — yet another relic to preserve.

Kris Snibbe/Harvard Staff Photographer

Science & Tech

Saving the digital record

long read

Harvard Library adopts forensic techniques to safeguard material stored on obsolete formats

When digital becomes dinosaur, most people simply get inconvenienced. But librarians and archivists get seriously concerned.

Ensuring that digital content — whether it’s a short story by John Updike or a very rare audio recording of a vanished Native American language — lives on past its initial platform is one of the most pressing issues in preservation science. Harvard is one of a handful of cultural institutions in the first wave of adopting a technology and process to preserve its digital content.

Libraries and archives at Harvard hold thousands of unique items across hundreds of digital formats, including aging technology such as CDs, floppy disks, tapes, and cassettes. To retrieve content prior to total obsolescence or decay of digital formats, librarians are using digital forensic software commonly employed by the police or the FBI to solve crimes, which enables them to identify content noninvasively and migrate it to a more stable platform.

Digital forensics was developed to create authentic, unimpeachable source data suitable as evidence in criminal trials. Library staff members hold themselves to the same high standards and model some of their workflow on law-enforcement practices. After all, altering a document literally rewrites history.

But librarians and archivists face mounting urgency in this task. For centuries, data meant print. Paper was far and away the best medium to record and preserve information, and for good reason. It is relatively affordable, easy to make, and stands up well to benign neglect. Open a book that was placed on a shelf 200 years ago, and its pages will still provide the same information, tell the same stories.

With the comparatively quick shift to digital content delivery, and the even faster evolution of digital hardware from Mark III-era behemoths to today’s sleek iPhones, an increasing amount of content is born digital, created, disseminated, and accessed completely on computers, existing as 1s and 0s instead of printed type, engraved texture, or magnetic coding — all of which are more robust than more modern technologies.

Now, collections might come in with digital material that is already on the brink of decline. Digital degradation doesn’t follow a steady curve like books. Items can be fine for decades, and then quickly decline from perfectly accessible to completely useless. For some formats, experts don’t know what that plateau and drop-off might be, and it can even vary among individual items kept in the same condition. The situation poses problems for preservation, access, and collection development.

“People outside of the field hadn’t anticipated how quickly this would become such a pressing issue,” said Megan Sniffin-Marinoff, University archivist. “It happened practically overnight.”

The presence of digital materials in incoming collections has risen exponentially over the past decade or so as professionals who started their careers on paper and migrated to digital hand over their work to the library to preserve. All-paper collections are becoming rare.

When the first hybrid hard-copy and born-digital collections came into the library in the 1980s, the digital formats were treated as objects or artifacts instead of content. A disk might have been noted but was not accessed, and was tucked back in with the papers it arrived alongside.

“We certainly have that issue of these hidden problems riddled throughout the collections,” Sniffin-Marinoff said. “I don’t think people were imagining the extent of the implications. It’s added a layer of complexity to our work that’s pretty unbelievable.”

Archivists are now much savvier when assessing incoming collections. They try to uncover these issues in a collection as soon as it comes in — sometimes even before it arrives at the door. University archives staff members take a mobile forensics kit to the offices, basements, attics, and studies of donors and are equipped to survey materials onsite, like members of a forensics SWAT team.

Harvard’s first collection to be preserved via digital forensics was at the Business School’s Baker Library. One recent acquisition left librarians pondering how to capture the significant portion of born-digital information and integrate it with the print items in the collection.

“Our collections range from the Medici family to Lehman Brothers,” said Rachel Wise, Harvard Business School archivist in Baker Library’s Historical Collections, who started the digital forensics program. To get more specialized knowledge for the recent acquisitions, officials hired a consultant and worked with other institutions to learn about essential tools and workflows. Since acquiring their initial digital collection, the program has grown to include discs from new collections such as the Wang Laboratories Inc. records and faculty research collections.

Essentially, to retrieve content from an obsolete format, three components need to align: the hardware, the software, and the technician. Once the staff has procured the hardware — a drive or reader — the digital forensics software does much of the remaining work.

The first and most crucial step is imaging, which creates a copy of the source medium that replicates the structure and contents of a storage device independent of the file system.

To retrieve data from a 5¼-inch floppy disk, a drive is connected to a “writeblocker,” a device that ensures information only flows one way, preventing data from being overwritten. The writeblocker is plugged into a computer, which extracts all the content and builds the disk image on the new drive.

HBS was the first to use a FRED (Forensic Recovery of Evidence Device), a black computer tower with myriad plugin and reader capabilities that combines drive readers and writeblockers. Several other FREDs are now in use across the library. Staff members have come together at training sessions and to share best practices.

The Harvard Law School Library has less variety of format types in its collections so far, and can manually create the imaging environment by using pristine, functioning, but now-obscure drives and hooking them up to a regular office computer with the writeblocker. Entering the library’s digital collections workspace is like going to a computer museum, or hopping into a time machine. Zip and floppy-disk drives of various sizes are strewn about.

Once a disk is imaged and the content is off the original carrier, the content can be processed. Since imaging is the more urgent and sensitive step, most librarians try to image as much as they can first, and process information later. Occasionally, bit rot (or data rot) corrupts files.

In the processing step, analysis with a forensic tool kit is performed on the imaged disk, preserving the original. Then a decision must be made about how the material will be accessed.

There are two options, migration and emulation. Migration moves the information forward from one format to another, like converting a mid-’90s Corel WordPerfect file to an Adobe PDF. It is the easiest option for allowing researchers to view material, but may not perfectly recreate the original document. There could be a change as small as a shift in margin or spacing, or as large as rearranging the text.

At the Law School Library, the curator uses XENA, open-source software developed by the National Archives of Australia, which recognizes hundreds of old and unusual file formats and quickly migrates them forward to current standard formats.

Emulation recreates the original computing environment in which content was created, like an entire software suite, thus enabling a document to be viewed in its native form. Some types of files must be migrated because, while at a bit level the 1s and 0s are preserved, the lack of an appropriate operating system precludes the possibly of emulation. In either case, seeing a document in a form as close to its original is immensely valuable to researchers.

“There is something about looking at these that evokes a different time and how faculty did research,” said Wise.

Librarians sample the content after it’s extracted to make sure the work was successful. Afterward, more typical steps — common for paper records — are taken for access. The items are described and cataloged and opened up to researchers.

Working in digital formats makes some things easier and some harder. “We can make high-level decisions,” explained Wise. “There’s a lot of intelligence you’re gaining in the process.”

But frequently there are more decisions for archivists to make. In paper collections, the subjects often unintentionally “curated” content as it was created. After all, it would be impossible to keep every scrap of paper from a 60-year career without turning into a packrat. That’s not so with born-digital material; as device sizes decreased, storage size increased. Archivists and librarians don’t know how big a digital collection might be until they open it, and the volume is usually much greater than anticipated.

“It’s a sort of closet that goes on and on,” Wise said. In addition to the increase in content in general, duplication is common, since disks were used as transport and backups as well as delivery tools. While the digital forensic software can “de-dupe” content, curators have to be careful. Sensitive personal information may need to be kept for an accurate original record of the contents, but must be removed from the files accessed by patrons until enough time has passed for all of a collection to be released.

“There’s a lot of digging to be done,” said Margaret Peachy, curator of digital collections at the Law School Library. In addition to duplicates, copyrighted materials like music the creator listened to or e-publications he or she read are on the drives, and need to be removed. “It’s easier in some ways and a lot harder in others than paper processing,” she said.

There’s also the issue of which items to reformat first. Since researcher requests do play into prioritizing library workflows, sometimes that affects the sequence. Whose it is or what it is also plays a role in determining priority. Where it makes sense due to cost and rarity of materials, Harvard outsources some of this kind of work to vendors.

Each School has approached the problem differently, depending on the needs of the collections and their community of users, but they meet regularly to share best practices and learn from other areas, like Harvard’s Media Preservation group.

Consideration for digital has seeped into many areas of library management. Gift agreements with the library have been re-engineered to include access to donor accounts on password-protected websites, like Facebook. Patron-access policies also have been adjusted, since some materials are restricted to viewing inside a reading room.

One thing on which all the librarians and archivists agree: Even when the content is retrieved, the original media may need to be retained. Advancements now allow retrieval of content on formats that previously were written off as lost causes, such as the IRENE system for audiovisual materials, which photographically retrieves enough information to produce sound.

Elizabeth Walters is Harvard Library’s preservation librarian for audiovisual materials, but sees almost everything while surveying collections. She said the problem won’t disappear soon. “If it’s physical media, there’s hardly a format left that’s not obsolete or obsolescing,” she said.

As such, the churn of technology in collections will be something to keep addressing, but the Harvard Library is increasingly equipped to incorporate that in its collecting, processing, and preservation workflows. Maybe one day an iPad will cross an archivist’s desk and be kept as an artifact right next to a Bernoulli disk drive. In any case, the true treasures of Harvard’s collections — writings, recordings, and images — will be migrated forward, safe for generations.