implementing searching in static websites

This commit is contained in:
Wouter Groeneveld 2022-08-04 11:45:03 +02:00
parent afb12ca476
commit b24001dae7
3 changed files with 89 additions and 1 deletions

View File

@ -3,7 +3,7 @@ date: 2022-08-03T08:44:51+02:00
context: "https://roytang.net/2022/08/twenty-years/"
---
Excellent summary Roy, cheers! I dug around in your archives and discovered you were into MtG [way back in 2001](https://roytang.net/archives/ancient/tripod/ffmagic/)---that's exactly the same year as I started playing! Since you regularly post updates on your digital Arene grinds, I was wondering how you migrated from analog to digital MtG. I only play with "the real stuff", but as a consequence, I regularly have trouble finding buddies to play with.
Excellent summary Roy, cheers! I dug around in your archives and discovered you were into MtG [way back in 2001](https://roytang.net/archives/ancient/tripod/ffmagic/)---that's exactly the same year as I started playing! Since you regularly post updates on your digital Arena grinds, I was wondering how you migrated from analog to digital MtG. I only play with "the real stuff", but as a consequence, I regularly have trouble finding buddies to play with.
Since discovering Commander, I much prefer playing it like that: more chaos and politics, more crazy cards, and it's not always the player with the most expensive deck that wins. Most of my stuff is geared towards a budget anyway.

View File

@ -0,0 +1,10 @@
---
date: 2022-08-03T11:10:41+02:00
context: "https://fundor333.com/social/2022/08/03/1659516036/"
---
Fundor 333 asked:
> In your opinion something like Gitea with a syndication like Mastodon will solve some of the problems and move more people on this “Gitea with Syndication”?
I'd answer: yes and no. Yes, it will solve some problems---hopefully more easy collaboration across different instances. With GitHub, that's not a problem, provided that everyone uses GitHub. And No, I don't think it will move more people towards Gitea, since syndication and self-hosting are usually two "complicated" solutions. Note that I didn't say complex. Most people will still find it too troublesome to move. Just look at Mastodon VS Twitter. The `@user@mastodoninstance` thing already trips most people up.

View File

@ -0,0 +1,78 @@
---
title: Implementing Searching In Static Websites
date: 2022-08-04T10:59:00+02:00
categories:
- webdesign
tags:
- hugo
- searching
---
In my monthly [July 2022 overview](/post/2022/08/july-2022) write-up, I wrote:
> This website got a new search engine! The baked archives page used to be powered by Lunr.js, which has been replaced by Pagefind.app. I guess this is worth its own blog post, Ill save the details for later.
It's time for those juicy details.
Last month's first HugoConf revealed many interesting JAMStack-related tooling to boost your statically generated blog. For the uninitiated, a "JAMStack" is a _JavaScript, API, and Markup stack_ that (almost) enables static websites to be just as dynamic as true blogging engines such as Wordpress. For example, [a Webmention-based commenting system](/post/2021/05/beyond-webmention-io/) with a queryable API, a few pre- and post-processor scripts like [YouTube link to image converters](/post/2021/06/youtube-play-image-links-in-hugo/), or, **search functionality**.
One of those new search tools mentioned during the conference is [Pagefind](https://pagefind.app/). Since I was looking into throwing out [Lunr.js](https://lunrjs.com/) anyway, it was a good opportunity to try out new things. The result is the simple but very fast [search bar in the /archives page](/archives).
How do these tools work?
1. You generate some content in Markdown. Your static site processor, in my case Hugo, converts it to static HTML, ready to be served to visitors.
2. A script needs to be run to create **an index** of your content---either by processing the `.md` source, or the `.html` target. The result is usually a fairly large `.js` file.
3. On a search page, you include 2 `<script/>` tags: the index file and the tool that uses it. Users that enter a query search **client-side** in JS code, as opposed to submitting a real form like in search engines or with Wordpress.
The problem with step 2 is the index file itself, as it can quickly grow in size. Furthermore, I automatically checked in the changes to the `brainbaking-index.js` file, needlessly convoluting the git repository. Even with gzip-compression in mind, I found Lunr.js not to be the best approach.
Instead, Pagefind uses **fragmentation**. It never requires the inclusion of a single huge index file, but rather a tiny JS file (`8.05 kB`), that only fetches a minimal index file after you start typing (`45.66 kB`), and for each result to be displayed (usually limited to five), fetches a _fragment_ of the indexed content (between `5` and `8.5 kB`), and, optionally, a (currently non-optimized) thumbnail. The result is a blazing fast search-as-you-type system that's still self-hosted, highly optimized, and doesn't require a page submit.
Try it out yourself [at /archives](/archives).
There are a few obvious disadvantages of using Pagefind. For one, it's very bleeding edge, currently at [version 0.5.3](https://github.com/CloudCannon/pagefind/releases). Custom placeholder text, proper internationalization support, and more custom options are currently missing, but it is possible to use the lower-level API and come up with something cool yourself. I took a stab at it but decided that most of the default stuff is just fine.
The other downside is that it still requires you to run another executable---this is the JAMStack part, so to speak---after Hugo is done generating. I have a simple shell script that is triggered every hour:
```sh
#!/bin/bash
sites=( brainbaking jefklakscodex redzuurdesem )
export WEBMENTION_TOKEN="supersecret"
echo "building at $(date)... with $1"
for site in "${sites[@]}"
do
echo "building site $site"
cd /var/dev/$site
git reset --hard
RESULT=$(git pull | grep 'Already up to date')
if [[ -z "$RESULT" ]] || [[ $1 == "--force" ]]
then
/usr/local/bin/hugo --cleanDestinationDir --destination docs
/usr/local/bin/pagefind --source docs
rsync --archive --delete docs/ /var/www/$site/
yarn install
yarn run postdeploy
else
echo "nothing to do for $site"
fi
done
echo "done building."
```
This boils down to:
1. Execute `hugo`, dump HTML output in `docs/`
2. Execute `pagefind`, scour through `docs/` and dump index/JS/fragments in there as well
3. Copy over new files using `rsync` to the deployed location for Nginx to pick up
4. Run `yarn` for an optional post-deploy step. This contains webmention sending.
Pagefind is a Rust self-contained binary, but I had to install it from source for my MacBook as there's no released ARM64 artifact available. You do have to install it as well on your web server---although that is optional: you can also run the `pagefind` command locally and simply check in all changes. I did that before with Lunr.js, but do not recommend it: every slightest change of your blog triggers a commit of the index file.
---
Is all this trouble worth it? I'm not sure. [Rubenerd's Archives page](https://rubenerd.com/archives/) resorts to another technique: simply let a _real_ search engine do the searching. By embedding a DuckDuckGo `<form/>`, you delegate all the above to another party, decreasing the complexity of your build pipeline and website theme code. It's worth noting that this alternative works _even with JavaScript disabled in the browser!_ I had to put in a `<noscript/>` tag to bring JS-haters bad news: they can't search.
On the other hand, DuckDuckGo doesn't immediately index new posts, and you still route users away from your site with a form submit. In the end, Ruben's approach is probably the easiest, albeit the less immersive option. You'll have to decide for yourself whether or not to go for it. I still like Pagefind's relative simpleness and even [implemented it at my other sites](https://jefklakscodex.com/tags/) that didn't have a search option before.