should we build our own wayback machines? + footer/main min-height css fix

This commit is contained in:
Wouter Groeneveld 2022-10-27 16:12:02 +02:00
parent 84d8331287
commit 554898b620
6 changed files with 67 additions and 27 deletions

12
content/404.md Normal file
View File

@ -0,0 +1,12 @@
---
title: "Whoops, 404 Page not found!"
url: 404.html
disableComments: true
---
No Worries though! There's a couple of things you can try:
- Try to [dig through the archives](/archives) instead to find something similar to what you were looking for. There's also a **search function** in there.
- If that didn't work, perhaps the Internet Archive's [Wayback Machine](https://web.archive.org/web/*/https://brainbaking.com) can conjure up a long lost article.
Good luck and sorry about the mess!

View File

@ -13,7 +13,8 @@ Flemish and Dutch businesses, teachers, governments, and shops seem to have a ve
- Graneveld ("Grainfield")---sure, a wheat field can be green;
- Van Groenveld---more evidence of laziness and just making stuff up by adding a German style _von_ preposition;
- Gröneveld---squished those letters together, did you?;
- Geleveld---are you mixing up the colors yellow and green on purpose?.
- Geleveld---are you mixing up the colors yellow and green on purpose?;
- Grunnenveld---on the invoice of the locksmith. I'm all out of ideas here.
These silly "mistakes" (some feel like bad jokes) remind me of the name that appears on the weekly TV guide for Joey and Chandler in Friends: [Miss Chanandler Bong](https://www.youtube.com/watch?v=v1xXlxDHWPU). I'm not at all offended by the confusion but instead surprised by the incapability of correctly typing in a name in a digital form---or even worse, just copying it, since I am the one who usually has to enter it---especially by Dutch speaking people who should have no problem whatsoever spelling it. I guess it could also be attributed to my incomprehensible mumbling on the phone.

View File

@ -0,0 +1,49 @@
---
title: Should We Build Our Own Wayback Machines?
date: 2022-10-27T13:04:00+02:00
categories:
- webdesign
---
Preserving web content never really left my mind ever since taking screenshots of old sites and putting them in [my personal museum](/post/2020/10/a-personal-journey-through-the-history-of-webdesign/). The Internet Archive's [Wayback Machine](https://web.archive.org/) is a wonderful tool that currently stores 748 billion webpage snapshots over time, including dozens of my own webdesign attempts, dating back to 2001. But that data is not in our hands.
Should it? It should. Ruben says: [archive it if you care about it](https://rubenerd.com/archive-it-if-you-care-about-it/):
> The only way to be sure you can read, listen to, or watch stuff you care about is to archive it. Read a tutorial about yt-dlp for videos. Download webcomics. Archive podcast episodes.
This should include websites! ([And mirrors of your favorite git repos](https://astharoshe.net/2020-10-24-Mirror_your_favourite_git_repositories.html)) And I'm not talking about "clipping" a (portion of a) single page to your Evernote-esque scrapbook tool, but about a proper archive of the whole website. It seems that there are already tools for that, such as [ArchiveBox](https://archivebox.io/) that crawls and downloads using a browser engine, or [Archivarix](https://archivarix.com/), an online Wayback machine downloader, or even just using `wget`, as [David Heinemann suggests](https://dheinemann.com/posts/2022-02-05-archiving-a-website-with-wget).
The problem I've encountered with personal Wayback Machine snapshots is:
- Proprietary frameworks and their fleeting popularity; e.g. Flash or Applet embeds that break;
- Wayback machine doesn't seem to be very fond of correctly preserving all images---CSS backgrounds or `.php` scripts to embed watermarks in images don't make it into the archive;
- Certain JavaScript snippets can muddle with the archiving system and prevent pages from being crawled (called [data loss by JavaScript](https://matthiasott.com/notes/data-loss-also-by-javascript)).
- Binaries are lost. I loved sharing levels or savegames, did not archive everything myself locally, and neither did Archive.org.
To combat these problems, Jeff Huang came up with seven guidelines to [design webpages to last](https://jeffhuang.com/designed_to_last/):
1. Return to vanilla HTML/CSS. See above. Many of my old Wayback snapshots now display a "Something's wrong with the database, contact me!" message.
2. Don't minimize that HTML. This increases your workflow that will probably not survive in 10+ years.
3. Prefer one page over several. Not sure if agree, but a one-pager is definitely easier to save.
4. End all forms of hotlinking. `<link/>` only to your own local stuff.
5. Stick to native fonts. I do ignore this rule: if the font is lost, the content isn't, and I won't care.
6. Obsessively compress your images. [Low Tech Magazine](https://solar.lowtechmagazine.com/) even uses dithering to great effect.
7. Eliminate the broken URL risk by using monitoring to check for dead links.
While writing this article, I explored others' usage of Wayback Machine, but surprisingly few seem to mention that they regularly back up their own website---either by saving their own build artifacts somewhere, or by leveraging Wayback Machine. David Mead suggested [to include a personalized Wayback Machine link in your 404 page](https://davidjohnmead.com/posts/2019-12-04-handling-broken-links/) which sounds good but doesn't really help towards carefully preserving your stuff.
So I wondered. Can we self-host Wayback machine? Soneone at a "datahoarder" sub-Reddit asked that very same question 2 years ago, but never received a reply. I think ArchiveBox comes _very_ close! It has a docker-compose script so is dead easy to throw on our NAS. However, this creates another potential problem: will that piece of software still work after 10-20-30 years? The source code is on [GitHub](https://github.com/ArchiveBox/ArchiveBox): internally, it uses trend-sensitive packages like Django, so you're still better off by simply archiving static HTML yourself---given you've got control over the source.
Except that with ArchiveBox, you can archive _any_ website. And you can tell it to archive the same site every week. And it has a [clear strategy laid out](https://archivebox.io/#background--motivation) towards long-term usage. If what you're looking to download doesn't exist anymore, I guess then your only option is a Wayback extractor like Archivarix (of which the free tier does not save CSS). Wayback comes with APIs and wrappers call one of those "SavePageNow"---this is to tell Wayback to _archive it_, not to locally _download_ (or what I'd call _save_) it. Bummer.
Check out the [Web Archiving Community Wiki](https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community) if you're interested in more lists or archiving software. I was pleasantly surprised by the amount of existing software and people actively involved in this initiative.
By the way, by limiting our understanding of "archiving webpages" to the HTTP protocol, we're also ignoring thousands of Gemini and Gopher ones.
---
Wayback Machine's timeline you can pick snapshots from is nice to interact with; it gives an immediate idea of frequency of archival. Once you select a certain snapshot, it cries _it's alive!_ and serves you the site. What's missing though is screenshots: sometimes it fails to render the site or gives a timeout---or doesn't have any snapshot stored at all. I think that's what I tried to do with my personal museum. Unfortunately, even though I have the source, some websites are impossible to revive: either I miss the DB files or don't have the right ancient framework versions anymore (or even they are becoming hard to find).
Another fun experiment: here are [old bookmarks from 2007](/museum/fav.html). Try randomly clicking on a few of those. 404? Yes? No? I tried creating a script to convert these into HTTP response status codes but that won't work as many still return a 200 but suddenly become infested with smileys, rifle clubs, and other spam junk as the domain is hijacked, or it just states "database error" (still a 200? Cool!), or it states "we will return!". Less than 20% of those links are still fully accessible 15 years later, and those are probably the Amazons.
I'm not sure where this thought experiment is going, but I _am_ sure that Ruben is right: archive it if you care about it.

View File

@ -39,6 +39,7 @@ nav, footer
font-size: 0.85rem
z-index: 9
.support
font-size: 1rem
noscript
@ -179,6 +180,8 @@ h1, h2, h3
color: $grey
main
min-height: calc(100vh - 112px)
footer
padding-top: 2rem
padding-bottom: 2rem

View File

@ -1,22 +0,0 @@
{{ partial "header" . }}
<header>
<h1 id="header" class="p-name">Whoops, <em>404</em>!</h1>
<h2>Page not found...</h2>
</header>
<main class="single post">
<hr/>
<article>
No Worries! Try to <a href="/archives">dig through the archives</a> instead to find something similar to what you were looking for. There's also a search function in there. Good luck!
</article>
<hr/>
</main>
{{ partial "footer" . }}

View File

@ -1,10 +1,7 @@
<footer>
<footer class="bottom">
<p class="copyright text-muted">{{ .Site.Params.copyright | markdownify }}</p>
</footer>
</body>
</html>