brainbaking/content/post/2022/08/implementing-searching-in-s...

---
title: Implementing Searching In Static Websites
date: 2022-08-04T10:59:00+02:00
categories:
  - webdesign
tags:
  - hugo
  - website
  - searching
---

In my monthly [July 2022 overview](/post/2022/08/july-2022) write-up, I wrote:

> This website got a new search engine! The baked archives page used to be powered by Lunr.js, which has been replaced by Pagefind.app. I guess this is worth its own blog post, I’ll save the details for later.

It's time for those juicy details.

Last month's first HugoConf revealed many interesting JAMStack-related tooling to boost your statically generated blog. For the uninitiated, a "JAMStack" is a _JavaScript, API, and Markup stack_ that (almost) enables static websites to be just as dynamic as true blogging engines such as Wordpress. For example, [a Webmention-based commenting system](/post/2021/05/beyond-webmention-io/) with a queryable API, a few pre- and post-processor scripts like [YouTube link to image converters](/post/2021/06/youtube-play-image-links-in-hugo/), or, **search functionality**.

One of those new search tools mentioned during the conference is [Pagefind](https://pagefind.app/). Since I was looking into throwing out [Lunr.js](https://lunrjs.com/) anyway, it was a good opportunity to try out new things. The result is the simple but very fast [search bar in the /archives page](/archives).

How do these tools work?

1. You generate some content in Markdown. Your static site processor, in my case Hugo, converts it to static HTML, ready to be served to visitors.
2. A script needs to be run to create **an index** of your content---either by processing the `.md` source, or the `.html` target. The result is usually a fairly large `.js` file.
3. On a search page, you include 2 `<script/>` tags: the index file and the tool that uses it. Users that enter a query search **client-side** in JS code, as opposed to submitting a real form like in search engines or with Wordpress.

The problem with step 2 is the index file itself, as it can quickly grow in size. Furthermore, I automatically checked in the changes to the `brainbaking-index.js` file, needlessly convoluting the git repository. Even with gzip-compression in mind, I found Lunr.js not to be the best approach.

Instead, Pagefind uses **fragmentation**. It never requires the inclusion of a single huge index file, but rather a tiny JS file (`8.05 kB`), that only fetches a minimal index file after you start typing (`45.66 kB`), and for each result to be displayed (usually limited to five), fetches a _fragment_ of the indexed content (between `5` and `8.5 kB`), and, optionally, a (currently non-optimized) thumbnail. The result is a blazing fast search-as-you-type system that's still self-hosted, highly optimized, and doesn't require a page submit.

Try it out yourself [at /archives](/archives).

There are a few obvious disadvantages of using Pagefind. For one, it's very bleeding edge, currently at [version 0.5.3](https://github.com/CloudCannon/pagefind/releases). Custom placeholder text, proper internationalization support, and more custom options are currently missing, but it is possible to use the lower-level API and come up with something cool yourself. I took a stab at it but decided that most of the default stuff is just fine.

The other downside is that it still requires you to run another executable---this is the JAMStack part, so to speak---after Hugo is done generating. I have a simple shell script that is triggered every hour:

```sh
#!/bin/bash

sites=( brainbaking jefklakscodex redzuurdesem )

for site in "${sites[@]}"
do
	echo "building site $site"
	cd /var/dev/$site
	git reset --hard
	RESULT=$(git pull | grep 'Already up to date')
	if [[ -z "$RESULT" ]] || [[ $1 == "--force" ]]
	then
		/usr/local/bin/hugo --cleanDestinationDir --destination docs
		/usr/local/bin/pagefind --source docs
		rsync --archive --delete docs/ /var/www/$site/
		yarn install
		yarn run postdeploy
	else
		echo "nothing to do for $site"
	fi
done
```

This boils down to:

1. Execute `hugo`, dump HTML output in `docs/`
2. Execute `pagefind`, scour through `docs/` and dump index/JS/fragments in there as well
3. Copy over new files using `rsync` to the deployed location for Nginx to pick up
4. Run `yarn` for an optional post-deploy step. This contains webmention sending.

Pagefind is a Rust self-contained binary, but I had to install it from source for my MacBook as there's no released ARM64 artifact available. You do have to install it as well on your web server---although that is optional: you can also run the `pagefind` command locally and simply check in all changes. I did that before with Lunr.js, but do not recommend it: every slightest change of your blog triggers a commit of the index file.

---

Is all this trouble worth it? I'm not sure. [Rubenerd's Archives page](https://rubenerd.com/archives/) resorts to another technique: simply let a _real_ search engine do the searching. By embedding a DuckDuckGo `<form/>`, you delegate all the above to another party, decreasing the complexity of your build pipeline and website theme code. It's worth noting that this alternative works _even with JavaScript disabled in the browser!_ I had to put in a `<noscript/>` tag to bring JS-haters bad news: they can't search.

On the other hand, DuckDuckGo doesn't immediately index new posts, and you still route users away from your site with a form submit. In the end, Ruben's approach is probably the easiest, albeit the less immersive option. You'll have to decide for yourself whether or not to go for it. I still like Pagefind's relative simpleness and even [implemented it at my other sites](https://jefklakscodex.com/tags/) that didn't have a search option before.