brainbaking/content/post/2022/08/implementing-searching-in-s...

5.7 KiB
Raw Blame History

title date categories tags
Implementing Searching In Static Websites 2022-08-04T10:59:00+02:00
webdesign
hugo
searching

In my monthly July 2022 overview write-up, I wrote:

This website got a new search engine! The baked archives page used to be powered by Lunr.js, which has been replaced by Pagefind.app. I guess this is worth its own blog post, Ill save the details for later.

It's time for those juicy details.

Last month's first HugoConf revealed many interesting JAMStack-related tooling to boost your statically generated blog. For the uninitiated, a "JAMStack" is a JavaScript, API, and Markup stack that (almost) enables static websites to be just as dynamic as true blogging engines such as Wordpress. For example, a Webmention-based commenting system with a queryable API, a few pre- and post-processor scripts like YouTube link to image converters, or, search functionality.

One of those new search tools mentioned during the conference is Pagefind. Since I was looking into throwing out Lunr.js anyway, it was a good opportunity to try out new things. The result is the simple but very fast search bar in the /archives page.

How do these tools work?

  1. You generate some content in Markdown. Your static site processor, in my case Hugo, converts it to static HTML, ready to be served to visitors.
  2. A script needs to be run to create an index of your content---either by processing the .md source, or the .html target. The result is usually a fairly large .js file.
  3. On a search page, you include 2 <script/> tags: the index file and the tool that uses it. Users that enter a query search client-side in JS code, as opposed to submitting a real form like in search engines or with Wordpress.

The problem with step 2 is the index file itself, as it can quickly grow in size. Furthermore, I automatically checked in the changes to the brainbaking-index.js file, needlessly convoluting the git repository. Even with gzip-compression in mind, I found Lunr.js not to be the best approach.

Instead, Pagefind uses fragmentation. It never requires the inclusion of a single huge index file, but rather a tiny JS file (8.05 kB), that only fetches a minimal index file after you start typing (45.66 kB), and for each result to be displayed (usually limited to five), fetches a fragment of the indexed content (between 5 and 8.5 kB), and, optionally, a (currently non-optimized) thumbnail. The result is a blazing fast search-as-you-type system that's still self-hosted, highly optimized, and doesn't require a page submit.

Try it out yourself at /archives.

There are a few obvious disadvantages of using Pagefind. For one, it's very bleeding edge, currently at version 0.5.3. Custom placeholder text, proper internationalization support, and more custom options are currently missing, but it is possible to use the lower-level API and come up with something cool yourself. I took a stab at it but decided that most of the default stuff is just fine.

The other downside is that it still requires you to run another executable---this is the JAMStack part, so to speak---after Hugo is done generating. I have a simple shell script that is triggered every hour:

#!/bin/bash

sites=( brainbaking jefklakscodex redzuurdesem )
export WEBMENTION_TOKEN="supersecret"

echo "building at $(date)... with $1"

for site in "${sites[@]}"
do
	echo "building site $site"
	cd /var/dev/$site
	git reset --hard
	RESULT=$(git pull | grep 'Already up to date')
	if [[ -z "$RESULT" ]] || [[ $1 == "--force" ]]
	then
		/usr/local/bin/hugo --cleanDestinationDir --destination docs
		/usr/local/bin/pagefind --source docs
		rsync --archive --delete docs/ /var/www/$site/
		yarn install
		yarn run postdeploy
	else
		echo "nothing to do for $site"
	fi
done
echo "done building."

This boils down to:

  1. Execute hugo, dump HTML output in docs/
  2. Execute pagefind, scour through docs/ and dump index/JS/fragments in there as well
  3. Copy over new files using rsync to the deployed location for Nginx to pick up
  4. Run yarn for an optional post-deploy step. This contains webmention sending.

Pagefind is a Rust self-contained binary, but I had to install it from source for my MacBook as there's no released ARM64 artifact available. You do have to install it as well on your web server---although that is optional: you can also run the pagefind command locally and simply check in all changes. I did that before with Lunr.js, but do not recommend it: every slightest change of your blog triggers a commit of the index file.


Is all this trouble worth it? I'm not sure. Rubenerd's Archives page resorts to another technique: simply let a real search engine do the searching. By embedding a DuckDuckGo <form/>, you delegate all the above to another party, decreasing the complexity of your build pipeline and website theme code. It's worth noting that this alternative works even with JavaScript disabled in the browser! I had to put in a <noscript/> tag to bring JS-haters bad news: they can't search.

On the other hand, DuckDuckGo doesn't immediately index new posts, and you still route users away from your site with a form submit. In the end, Ruben's approach is probably the easiest, albeit the less immersive option. You'll have to decide for yourself whether or not to go for it. I still like Pagefind's relative simpleness and even implemented it at my other sites that didn't have a search option before.