brainbaking/we-should-build-our-own-wayback-machines-reprise.md at 397ab7c131c129d7a9be2da9ffeea3ba94387300

6.7 KiB

Raw Blame History

title

date

tags

We Should Build Our Own Wayback Machines (Reprise)

2023-03-29T09:52:00+02:00

webdesign

archiving

In October 2022, I wondered: Should We Build Our Own Wayback Machines? If you even remotely care about some of the websites you encountered the last almost thirty years of the lovely World Wide Web, The answer is undeniably yes. Sites appear and disappear, and when the latter happens to your favorite thing that made you smile every single visit, that's just sad. Depending on your evaluation of nostalgia, freezing websites in time and clinging on to them might sound like a good idea, especially if they're yours and you lost the original source code---or in case of a drag-and-drop website builder, you never owned it in the first place.

The aforementioned post explores the concept and possible solutions to web archiving, and until now, I just relied on a very simple command: wget. Apparently, it comes with mirror flags that allows you to recursively download an entire website using just:

wget -m -np [url]

The result is a local copy, starting with index.html, you can simply open in your browser. Great!

Except that I yesterday discovered that method doesn't work for the more intricate websites, such as a Wordpress instance that's hosted on wordpress.com. Wget chockes on AJAX-heavy sites, more complex image URLs and CDN-like redirects, resulting in barely any or none image downloaded. I gave Archivebox a second chance, but it still doesn't support recursive crawling, rendering it quite useless.

Then I found Munin, a social media archiver that uses SquidWarc to do the heavy (crawling) lifting. SquidWarc spits out a (series of) .warc file(s); a "Web ARCive" standard that is recognized by many archival and library software and even has its own ISO standard. That means it's a great standardized way to compress and save websites as the risk for vendor lock-in is minimal. Certain versions of wget even support flags that have it output web archive files.

But if you're really into web archiving, you'll be wanting multiple .warc files, as they act as snapshots or moments in time when you captured a website, just like Archive.org's Wayback Machine. Additional context, such as screenshots, full (plain) text, config, and a list of captured URLs can be zipped up in a single file with another extension: the .wacz file, as Ed Summers writes about in his blog. The Z extension is fairly new but if an older archive view software isn't able to process it, just rename it to .zip and extract---again, vendor lock-in is minimal.

SquidWarc is but one of the many open source tools that crawls and generates archives. Pywb, part of the Webrecorder tool suite, offers a more complete packet, including browsing snapshots in an Archive.org way. The Webrecorder tools are really really cool: for instance, there's also a client-side ReplayWeb page at https://replayweb.page/ that allows for interactive browsing through web archives that supports both .warc and .wacz files. An Electron app also exists, and an archive-as-you-browse Chromium plugin as well.

After fiddling with a few of the above tools, I settled with Webrecorder's Browsertrix-crawler Docker container that uses JavaScript and Pywb to archive a website. The crawler comes with a slew of configuration variables: when to wait for a page to load, threshold values, include and exclude regex URLs, crawl depth, browser profile and behavior, ... Admittedly, it's a bit under-documented at this moment. I managed to create a few .wacz files of my wife's Wordpress sites using the following command:

cat ./config.yml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config stdin

Where the config file is something like:

seeds:
  - url: https://mysite.wordpress.com/
    scopeType: "host"
    exclude:
      - https://mysite.wordpress.com/wp-admin/*
      - respond$
      - nb=1$
      - https://mysite.wordpress.com/wp-login.php
      - pinterest*

generateWACZ: true
text: true%

What happens behind the scenes is a browser that's fired up and controlled by Puppeteer, where requests, responses, and resources are recorded and links are followed according to the depth configuration. The exclude regex values don't seem to be working that well, and depending on the size of the website, Docker will be running a long time, but the end result is a single archive that's yours forever!

I previously dismissed the .warc format as too obtuse and yet another standard I'd have to learn, compared to a simple wget command that just downloads the site. While that might or might not work---see above---it also neglects additional information needed to perfectly reproduce the behavior of the website: request/response headers and other metadata is also recorded and archived in Web Archives. Furthermore, it plays a central role in the development and standardization of tools in the archiving community and there's even interesting research about the format; such as Emily Maemura's All WARC and no playback: The materialities of data-centered web archives research (open access).

Web Archive files also enable easy archiving to other official institutions such as Archive.org or digital university libraries. I tried the other way around: downloading a .warc from the Wayback Machine, but sadly, that's a bit more involving (read: it may even require scraping). Paid services like https://www.waybackmachinedownloader.com/ exist but it's just as easy to point browsertrix-crawler to a Wayback URL. Removing the Wayback banner on top is a matter of adding id_ between the date and the URL, for instance, https://web.archive.org/web/20041206215212id_/http://jefklak.suidzer0.org/ is an old site of mine.

Of course, there's no substitute for the source code of your website(s). Wordpress offers export options for that reason. Still, code can get lost. So can web archives. So can others' sites you admire. A colleague of mine got fed up with maintaining his site and told me he won't be prolonging his domain name. His blog contained a personal log on bread baking and beer brewing.

Perhaps that's another reason to occasionally spin up that crawler.

6.7 KiB Raw Blame History

6.7 KiB

Raw Blame History