50 lines
6.4 KiB
Markdown
50 lines
6.4 KiB
Markdown
---
|
|
title: Should We Build Our Own Wayback Machines?
|
|
date: 2022-10-27T13:04:00+02:00
|
|
categories:
|
|
- webdesign
|
|
---
|
|
|
|
Preserving web content never really left my mind ever since taking screenshots of old sites and putting them in [my personal museum](/post/2020/10/a-personal-journey-through-the-history-of-webdesign/). The Internet Archive's [Wayback Machine](https://web.archive.org/) is a wonderful tool that currently stores 748 billion webpage snapshots over time, including dozens of my own webdesign attempts, dating back to 2001. But that data is not in our hands.
|
|
|
|
Should it? It should. Ruben says: [archive it if you care about it](https://rubenerd.com/archive-it-if-you-care-about-it/):
|
|
|
|
> The only way to be sure you can read, listen to, or watch stuff you care about is to archive it. Read a tutorial about yt-dlp for videos. Download webcomics. Archive podcast episodes.
|
|
|
|
This should include websites! ([And mirrors of your favorite git repos](https://astharoshe.net/2020-10-24-Mirror_your_favourite_git_repositories.html)) And I'm not talking about "clipping" a (portion of a) single page to your Evernote-esque scrapbook tool, but about a proper archive of the whole website. It seems that there are already tools for that, such as [ArchiveBox](https://archivebox.io/) that crawls and downloads using a browser engine, or [Archivarix](https://archivarix.com/), an online Wayback machine downloader, or even just using `wget`, as [David Heinemann suggests](https://dheinemann.com/posts/2022-02-05-archiving-a-website-with-wget).
|
|
|
|
The problem I've encountered with personal Wayback Machine snapshots is:
|
|
|
|
- Proprietary frameworks and their fleeting popularity; e.g. Flash or Applet embeds that break;
|
|
- Wayback machine doesn't seem to be very fond of correctly preserving all images---CSS backgrounds or `.php` scripts to embed watermarks in images don't make it into the archive;
|
|
- Certain JavaScript snippets can muddle with the archiving system and prevent pages from being crawled (called [data loss by JavaScript](https://matthiasott.com/notes/data-loss-also-by-javascript)).
|
|
- Binaries are lost. I loved sharing levels or savegames, did not archive everything myself locally, and neither did Archive.org.
|
|
|
|
To combat these problems, Jeff Huang came up with seven guidelines to [design webpages to last](https://jeffhuang.com/designed_to_last/):
|
|
|
|
1. Return to vanilla HTML/CSS. See above. Many of my old Wayback snapshots now display a "Something's wrong with the database, contact me!" message.
|
|
2. Don't minimize that HTML. This increases your workflow that will probably not survive in 10+ years.
|
|
3. Prefer one page over several. Not sure if agree, but a one-pager is definitely easier to save.
|
|
4. End all forms of hotlinking. `<link/>` only to your own local stuff.
|
|
5. Stick to native fonts. I do ignore this rule: if the font is lost, the content isn't, and I won't care.
|
|
6. Obsessively compress your images. [Low Tech Magazine](https://solar.lowtechmagazine.com/) even uses dithering to great effect.
|
|
7. Eliminate the broken URL risk by using monitoring to check for dead links.
|
|
|
|
While writing this article, I explored others' usage of Wayback Machine, but surprisingly few seem to mention that they regularly back up their own website---either by saving their own build artifacts somewhere, or by leveraging Wayback Machine. David Mead suggested [to include a personalized Wayback Machine link in your 404 page](https://davidjohnmead.com/posts/2019-12-04-handling-broken-links/) which sounds good but doesn't really help towards carefully preserving your stuff.
|
|
|
|
So I wondered. Can we self-host Wayback machine? Soneone at a "datahoarder" sub-Reddit asked that very same question 2 years ago, but never received a reply. I think ArchiveBox comes _very_ close! It has a docker-compose script so is dead easy to throw on our NAS. However, this creates another potential problem: will that piece of software still work after 10-20-30 years? The source code is on [GitHub](https://github.com/ArchiveBox/ArchiveBox): internally, it uses trend-sensitive packages like Django, so you're still better off by simply archiving static HTML yourself---given you've got control over the source.
|
|
|
|
Except that with ArchiveBox, you can archive _any_ website. And you can tell it to archive the same site every week. And it has a [clear strategy laid out](https://archivebox.io/#background--motivation) towards long-term usage. If what you're looking to download doesn't exist anymore, I guess then your only option is a Wayback extractor like Archivarix (of which the free tier does not save CSS). Wayback comes with APIs and wrappers call one of those "SavePageNow"---this is to tell Wayback to _archive it_, not to locally _download_ (or what I'd call _save_) it. Bummer.
|
|
|
|
Check out the [Web Archiving Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) if you're interested in more lists or archiving software. I was pleasantly surprised by the amount of existing software and people actively involved in this initiative.
|
|
|
|
By the way, by limiting our understanding of "archiving webpages" to the HTTP protocol, we're also ignoring thousands of Gemini and Gopher ones.
|
|
|
|
---
|
|
|
|
Wayback Machine's timeline you can pick snapshots from is nice to interact with; it gives an immediate idea of frequency of archival. Once you select a certain snapshot, it cries _it's alive!_ and serves you the site. What's missing though is screenshots: sometimes it fails to render the site or gives a timeout---or doesn't have any snapshot stored at all. I think that's what I tried to do with my personal museum. Unfortunately, even though I have the source, some websites are impossible to revive: either I miss the DB files or don't have the right ancient framework versions anymore (or even they are becoming hard to find).
|
|
|
|
Another fun experiment: here are [old bookmarks from 2007](/museum/fav.html). Try randomly clicking on a few of those. 404? Yes? No? I tried creating a script to convert these into HTTP response status codes but that won't work as many still return a 200 but suddenly become infested with smileys, rifle clubs, and other spam junk as the domain is hijacked, or it just states "database error" (still a 200? Cool!), or it states "we will return!". Less than 20% of those links are still fully accessible 15 years later, and those are probably the Amazons.
|
|
|
|
I'm not sure where this thought experiment is going, but I _am_ sure that Ruben is right: archive it if you care about it.
|