a few posts days in advance: wayback machines 2, closed/open software debate

This commit is contained in:
Wouter Groeneveld 2023-03-28 16:07:42 +02:00
parent 29e1929e78
commit 397ab7c131
4 changed files with 139 additions and 2 deletions

View File

@ -7,9 +7,9 @@ tags:
- productivity
---
A few days ago, a stressed-out gamer confessed on [ResetERA](https://www.resetera.com/threads/it-might-be-time-for-me-to-give-up-playing-video-games-for-good.699064/) he was considering giving up on gaming, as he felt the time spent could be put to "better" use. The thread somehow struck a chord here, not because I agree, but because the mentality more and more people fall for is a very dangerous and worrying one. This reply by Benzychenz prefectly sums up my thoughts:
A few days ago, a stressed-out gamer confessed on [ResetERA](https://www.resetera.com/threads/it-might-be-time-for-me-to-give-up-playing-video-games-for-good.699064/) he was considering giving up on gaming, as he felt the time spent could be put to "better" use. The thread somehow struck a chord here, not because I agree, but because the mentality more and more people fall for is a very dangerous and worrying one. This reply by Benzychenz perfectly sums up my thoughts:
> A mentality to be “productive” all the time is toxic af. If you have enough money to be comfortable, and arent neglecting your household duties and your family, there is nothing wrong with enjoying a hobby to yourself in some downtime.
> A mentality to be “productive” all the time is toxic af. If you have enough money to be comfortable, and aren't neglecting your household duties and your family, there is nothing wrong with enjoying a hobby to yourself in some downtime.
As one grows older, spending time with games becomes more and more frowned upon: don't you have anything "better" to do?

View File

@ -0,0 +1,59 @@
---
title: "We Should Build Our Own Wayback Machines (Reprise)"
date: 2023-03-29T09:52:00+02:00
categories:
- webdesign
tags:
- archiving
---
In October 2022, I wondered: [Should We Build Our Own Wayback Machines?](/post/2022/10/should-we-build-our-own-wayback-machines/) If you even remotely care about some of the websites you encountered the last almost thirty years of the lovely World Wide Web, The answer is undeniably _yes_. Sites appear and disappear, and when the latter happens to your favorite thing that made you smile every single visit, that's just sad. Depending on your evaluation of nostalgia, freezing websites in time and clinging on to them might sound like a good idea, especially if they're yours and you lost the original source code---or in case of a drag-and-drop website builder, you never owned it in the first place.
The aforementioned post explores the concept and possible solutions to web archiving, and until now, I just relied on a very simple command: `wget`. Apparently, it [comes with mirror flags](https://dirask.com/posts/Bash-download-entire-website-with-wget-recursive-download-DLrKLj) that allows you to recursively download an entire website using just:
```
wget -m -np [url]
```
The result is a local copy, starting with `index.html`, you can simply open in your browser. Great!
Except that I yesterday discovered that method doesn't work for the more intricate websites, such as a Wordpress instance that's hosted on `wordpress.com`. Wget chockes on AJAX-heavy sites, more complex image URLs and CDN-like redirects, resulting in barely any or none image downloaded. I gave [Archivebox](https://archivebox.io/) a second chance, but it [still doesn't support recursive crawling](https://github.com/ArchiveBox/ArchiveBox/issues/191), rendering it quite useless.
Then I found [Munin](https://github.com/peterk/munin-indexer), a social media archiver that uses [SquidWarc](https://github.com/N0taN3rd/Squidwarc) to do the heavy (crawling) lifting. SquidWarc spits out a (series of) `.warc` file(s); a "Web ARCive" standard that is recognized by many archival and library software and even has its own ISO standard. That means it's a great standardized way to compress and save websites as the risk for vendor lock-in is minimal. Certain versions of `wget` even support flags that have it output web archive files.
But if you're _really_ into web archiving, you'll be wanting multiple `.warc` files, as they act as snapshots or moments in time when you captured a website, just like Archive.org's Wayback Machine. Additional context, such as screenshots, full (plain) text, config, and a list of captured URLs can be zipped up in a single file with another extension: the `.wacz` file, as [Ed Summers writes about in his blog](https://inkdroid.org/2021/11/24/wacz/). The Z extension is fairly new but if an older archive view software isn't able to process it, just rename it to `.zip` and extract---again, vendor lock-in is minimal.
SquidWarc is but one of the many open source tools that crawls and generates archives. [Pywb](https://github.com/webrecorder/pywb), part of the [Webrecorder tool suite](https://webrecorder.net/), offers a more complete packet, including browsing snapshots in an Archive.org way. The Webrecorder tools are really _really_ cool: for instance, there's also a client-side ReplayWeb page at https://replayweb.page/ that allows for interactive browsing through web archives that supports both `.warc` and `.wacz` files. An Electron app also exists, and an archive-as-you-browse Chromium plugin as well.
After fiddling with a few of the above tools, I settled with Webrecorder's [Browsertrix-crawler](https://github.com/webrecorder/browsertrix-crawler) Docker container that uses JavaScript and Pywb to archive a website. The crawler comes with a slew of configuration variables: when to wait for a page to load, threshold values, include and exclude regex URLs, crawl depth, browser profile and behavior, ... Admittedly, it's a bit under-documented at this moment. I managed to create a few `.wacz` files of my wife's Wordpress sites using the following command:
```
cat ./config.yml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config stdin
```
Where the config file is something like:
```
seeds:
- url: https://mysite.wordpress.com/
scopeType: "host"
exclude:
- https://mysite.wordpress.com/wp-admin/*
- respond$
- nb=1$
- https://mysite.wordpress.com/wp-login.php
- pinterest*
generateWACZ: true
text: true%
```
What happens behind the scenes is a browser that's fired up and controlled by Puppeteer, where requests, responses, and resources are recorded and links are followed according to the depth configuration. The exclude regex values don't seem to be working that well, and depending on the size of the website, Docker will be running a _long_ time, but the end result is a single archive that's yours forever!
I previously dismissed the `.warc` format as too obtuse and yet another standard I'd have to learn, compared to a simple `wget` command that just downloads the site. While that might or might not work---see above---it also neglects additional information needed to perfectly reproduce the behavior of the website: request/response headers and other metadata is also recorded and archived in Web Archives. Furthermore, it plays a central role in the development and standardization of tools in the archiving community and there's even interesting research about the format; such as Emily Maemura's [All WARC and no playback: The materialities of data-centered web archives research](https://journals.sagepub.com/doi/10.1177/20539517231163172) (open access).
Web Archive files also enable easy archiving to other official institutions such as Archive.org or digital university libraries. I tried the other way around: downloading a `.warc` from the Wayback Machine, but sadly, that's a bit more involving (read: it may even require scraping). Paid services like https://www.waybackmachinedownloader.com/ exist but it's just as easy to point `browsertrix-crawler` to a Wayback URL. Removing the Wayback banner on top is a matter of adding `id_` between the date and the URL, for instance, https://web.archive.org/web/20041206215212id_/http://jefklak.suidzer0.org/ is an old site of mine.
Of course, there's no substitute for the source code of your website(s). Wordpress offers export options for that reason. Still, code can get lost. So can web archives. So can others' sites you admire. A colleague of mine got fed up with maintaining his site and told me he won't be prolonging his domain name. His blog contained a personal log on bread baking and beer brewing.
Perhaps that's another reason to occasionally spin up that crawler.

View File

@ -0,0 +1,64 @@
---
title: "I Pay And Use Both Closed And Open Source Software"
date: 2023-04-01T09:00:00+02:00
categories:
- software
tags:
- pricing
---
Something I never quite understood is the extreme fanboyism as seen in the "FOSS Scene"---the Free and Open Source Software scene. Many folks pride themselves on never touching anything that isn't open source, and while I applaud the effort and am glad they're glad with their choice, I just think that view heavily suffers from tunnel vision.
If you're an artist and you create a work, you'll eventually want to create another, hence some form of financial self-insurance is needed. You might be disgusted at the thought of selling your babies, but if that's what it takes to both be able to produce more and to spread your work, then why not? Software development can't be fully compared with the creation of physical objects as ones and zeroes can be copied, creating a whole slew of other ethical, political, and financial problems.
But if you're a FOSS fan, you obviously also want your favorite software to be actively maintained. And that costs time, which obviously has to be compensated for (I'm disregarding the [time is money mantra](/post/2023/03/continuous-productivity-is-toxic/) here). So the first letter in the abbreviation FOSS is actually _very_ misleading, and some people replace it with an L for _Libre_. [Free in FOSS does NOT mean free of cost](https://itsfoss.com/what-is-foss/). Bigger FOSS projects create an enterprise or SaaS branch where companies or power users pay for support and hosting. And then of course there's donating.
It boils down to the following: **if you like the software you use, pay for it**. Whatever the development and license model. Whether it's closed source or open source, whether it's FOSS or Libre: developers need your support. If you're a developer yourself and would like to contribute in the form of patches, even better---that's also paying and possibly even worth more than your hard earned green.
I too recognize the many advantages of open source software, especially when it comes to the privacy and security of the end users: it's the only way to create a layer of transparency, where other technical experts can audit---sometimes disguised in the form of cursing---the code. Some arguments against closed source software, like [Nora's Open Source for Normal People](https://nora.codes/post/open-source-for-normal-people/), make little sense in context of for example a video game:
> First, the people who made the software are the only ones who know how it works. Second, the people who made the software are the only ones who are really sure about what it does.
True. So? For me as an end user of a particular subset of software, like productivity tools such as Alfred that comes with a support forum and plugin community, that's not an issue. For me as an end user of games, that are supposed to be "finished" products, I don't care either. For more critical software that requires long term support and comes with security considerations, that's an entirely different case.
Furthermore, when it comes to making money, everyone, including FOSS cheerleaders, knows that living off donations is incredibly difficult, and not everyone has the courage or energy to start up a business spin-off based on your open source software. "But, but, RedHat!"---is the exception. That's why I think we should still respect people who develop closed source software; there's nothing wrong with that---provided their intentions are (1) clearly stated and (2) ethically sound.
Some good examples:
- Sublime HQ has been successfully living off Sublime Text's shareware model since 2008. Most plugins are community-built and open source. Compared to the open source Visual Studio Code, it's much more mature and less sneaky when it comes to telemetry and other Microsoft bullshit. Also, thanks to the license model, the company doesn't need to rely on the usual investor or upscaling nonsense.
- The Affinity Designer/Photo/Publisher apps, available with just a one-off payment, became very competent alternatives for the ridiculously expensive Adobe Cloud pricing model. Compared to the open source Gimp, their UI and integration is already years ahead.
- The video game industry thrives off sales from closed source software, and these heterogeneous teams usually consist of multi-talented folks where a portion of the money goes to art, music, development, design, writing, the closed source software used to build assets, ...
Some bad examples:
- macOS is, compared to Linux or BSD-based OSes, a nightmare for kernel plug-in developers: there is no readily available full-blown documentation available, and it's often a mystery as to which process does exactly what. If it was open source, we'd know more about the M1/M2 chip implementation and revers-engineering Linux on MacBooks would make much faster progress.
- Android is Google's open source smartphone OS but notorious for its customer data collection. Still, it's easier to develop for compared to iOS, as it doesn't require an Apple OS or an Apple developer subscription.
- Nintendo's (3)DS OS and platform documentation is hidden behind an expensive development toolkit that would not have been needed if it was to be initially released as (partial) open source in 2007, just like the PlayDate or the FPGA core system of the Analogue Pocket. Because of it, reverse engineering takes a lot of work and usually lags behind.
I'm fairly certain that to some, the above reasoning looks like a deliciously aged Swiss cheese, but still. What I'm trying to say is: there's nothing wrong with open source software (to an extend). There's also nothing wrong with closed source software (to an extend). So why can't both worlds respect with each other a bit more? I'm happy to hear your thoughts on this.
To close out, here's a selection of software I use and paid for.
Open source:
- KeePass
- Firefox
- Rectangle
- mGBA
- GoatCounter
- Navidrome
- Various people at GitHub maintaining open source projects, too many to list
- ...
Closed source:
- Sublime Text
- Alfred
- SpamSieve
- Affinity Designer
- DEVONThink
- Obsidian
- JetBrains development IDEs
- macOS (I sincerely hope that somewhat justifies the price of the MacBook)
- Any video game, too many to list
- ...

View File

@ -41,6 +41,20 @@
"target": "https://brainbaking.com/post/2023/03/continuous-productivity-is-toxic/",
"relativeTarget": "/post/2023/03/continuous-productivity-is-toxic/"
},
{
"author": {
"name": "Simone Silvestroni",
"picture": "/pictures/simonesilvestroni.com"
},
"name": "Note: 24 March 2023",
"content": "@wouter I really really love this post. I too have always been obsessed with productivity: in my case it stems from a necessity to optimize time and energy. Having such eclectic workflows, things must proceed smoothly to overcome potential blockers ...",
"published": "2023-03-24T11:59:49+01:00",
"url": "https://simonesilvestroni.com/note/2023-03-24-11-58-00/",
"type": "mention",
"source": "https://simonesilvestroni.com/note/2023-03-24-11-58-00/",
"target": "https://brainbaking.com/post/2023/03/continuous-productivity-is-toxic/",
"relativeTarget": "/post/2023/03/continuous-productivity-is-toxic/"
},
{
"author": {
"name": "https://d-s.sh/2023/re-continuous-productivity-is-toxic/",