Implementing search as a jekyll plugin
Feb 28, 2014Search for most dynamically-generated websites is usually implemented as a server-side feature. The server generates and maintains an index of all unique words and their occurrences on the site. When a client runs a search query, the query is sent to the server which then looks up the query in its index and returns a list of results, which are then rendered on the client.
An approach like this is clearly infeasible on static websites such as those generated using a static site generator like Jekyll, so one has to come up with alternative solutions. One possible alternative is to send the content of the entire site to the client, and to then search through it clientside. This approach is infeasible for two reasons:
- Downloading the entire site will take a while, even though you would only be sending the text and none of the images/javascript files/stylesheets.
- Searching through the entire site will also be pretty slow. Building an index would speed this up, but would be overkill for a client who is probably going to search your site a couple of times per session.
So, we need some sort of optimization during the site’s generation process that will optimize on both the network usage (the content that the client has to download) and execution time (how long it takes the client to search through the downloaded content). With respect to the second requirement, a hash map would be the fastest possible data structure for the client to search through, but it would be ever worse than downloading the entire site in terms of bandwidth usage. Enter bloom filters.
(I won’t go into details about how bloom filters work here, for that have a look at my previous post)
So, for the purposes of implementing search for a statically-generated site (namely this one), here is the approach that I followed:
- Statically generate a bloom filter per post/page in your site. The filter will contain the bitmap itself, along wit h the probability of false positives, the size of your dataset (number of unique words inserted into the filter) and the URL of the associated page.
- Statically generate a JSON file that has an array of all the generated bloom filters.
- On the client, download the generated JSON file and recreate the bloom filters. As long as you use the same hashing functions that you used for generating the filters, the data mentioned in the first point is everything that’s required for searching for membership of elements in each filter
- Whenever a search query is received, iterate over all the filters and return those that contain the word(s) in the query
- Fill a div with the results
Below I’ll describe my implementation of each of the above steps in some detail.
Generate a bloom filter for each page
Once I had my implementation of the bloom filter class ready, creating one for each page was relatively trivial. Jekyll plugins have a Generator
category that lets you generate new pages besides the templates that you already have. If you implement a def generator
, that method gets invoked with the entire site’s contents just prior to the actual generation. site.posts
contains a list of all posts and site.pages
contains a list of all pages. All I had to do was iterate over each page and
- Convert the content to lowercase, e.g. so that
python
andPython
will not be treated as unique elements in the set - Remove everything between
{% raw %} {% and %} {% endraw %}
and between{% raw %}{{ and }}{% endraw %}
. This strips out liquid template variables and code segments - Replace all non-alphabetic characters with a space
- Split on space and remove all duplicates
The following two lines accomplished this:
content = post.content.downcase.gsub(/{% raw %}{%.*?%}{% endraw %}/, " ").gsub(/{{.*?}}/, " ").gsub(/[^a-zA-Z]/, " ")
words = content.split.uniq
After that, I iterate over each word and call filter.insert
where filter
is the bloom filter created for that page. Once I’m done with the page, the final filters is inserted into an array.
Generate a JSON file from all bloom filters
This was slightly trickier. I need a template to fill up with the generated JSON, so I obviously need a file somewhere in my site that will be filled up. I opted for a file named search.json
in the root of my site. I started off filling it manually by iterating over the generated filters via liquid tags and then slotting member variables into different fields, but then I thought of an easier approach: have the template be empty, and fill it entirely from ruby!
In my def generate
method, once I’m done creating the array of bloom filters, I convert the array to json and insert it directly into the template:
json_page = site.pages.detect {|page| page.name == "search.json"}
json_page.data['filters'] = filters.to_json
Of course this needed a corresponding to_json
for each of my filter objects, which was also trivial:
def to_json(options = {})
{
'u' => @url,
'l' => @length,
'h' => @hash_count,
'f' => @filter
}.to_json
end
Now, running jekyll build
generates _site/search.json
which contains the bloom filters for all posts in my site.
Fetch and recreate the filters on the client
Since I want my page loads to appear faster than they actually are, I opted for fetching the search JSON asynchronously after pageload was done. This was easily done with a simple jquery $.get
call that fetches the json, iterates over each item in the JSON (which will be defined by the to_json
method above) and creates a bloom filter out of it.
var initialize = function() {
var index;
$.get("/search.json", function(data) {
for ( index = 0; index < data.length; index += 1 ) {
filters.push( BloomFilter.create(data[index].u,
data[index].l,
data[index].h,
data[index].f) );
}
})
.fail(function(jqXHR, textStatus, errorThrown) {
console.log("textStatus = " + textStatus + ", errorThrown = " + errorThrown);
});
};
The BloomFilter.create
method uses the exact same hashing functions as were used to create the filter in the Jekyll plugin. I ended up using Murmur hashes.
Look for matches in a search query
The search function has to do the same sort of setup that was required for inserting words into the filters. It converts the query to lowercase and replaces all non-alphabetic characters to spaces. It then iterates over bloom filter (and hence over each post) and returns a list of words that were found in that bloom filter. The return-type of this function is a list of JSON objects that have two fields - the URL of the page and a list of matches for that page.
var search = function(query) {
var matches = [];
var i;
query = query.toLowerCase().replace(/[^a-zA-Z]/, " ");
var words = query.split(/\s+/);
for ( i = 0; i < filters.length; i += 1 ) {
var results = words.filter( function(word) {
return filters[i].has(word);
} );
matches.push({
url: filters[i].url(),
words: results
});
}
return matches.filter(removeMisses).sort(rsortMatches);
};
The removeMisses
function on the last line of the function removes any object whose list of matched words is empty. That’s required because the var results = words.filter
will be an empty array if there was no match for any of the words in words
, and that’s therefore a page that is of no interest to us. The rsortMatches
function sorts the matches in reverse order by:
- Number of matches. For a multi-word query, pages that had more matches are sorted to the top of the array
- Alphabetically for pages with equal number of matches
Display the results to the user
Once have a list of all the posts that matched the user’s query and also the words that matched for each page, I display the results to the user in a div. This div starts off hidden on each page and sits alongside the main #content
div. When I have a list of search results, I hide the #content
div, append the results to the #search-results
div and unhide it. Clearing all text from the search input box hides the #search-results
div and unhides the #content
div. Pretty simple.
Bonus - Show snippets of text surround matches
While the approach outlined above is pretty solid in its own right, it has two shortcomings:
- The probability of false positives is ever-present. Even with the probability set to 0.01, I invariable ran into a couple of false matches during testing
- Relevance of results is usually hard for users to judge based solely on the link of a page. Think of how most of us usually search on Google. We type the query, and then we quickly scan through the snippets of text that are displayed below the link to the page that matched the query to determine if that page would help us
The solution to both of the above involves downloading a bit more content, but I believe it’s worth it. What I do is, I display the results of the user’s query as expected the moment I’ve checked through all the bloom filters and got the results. However, as soon as that is done, I iterate over all the matched pages and I fire off ajax requests to my webserve to fetch the post itself. In the callback for the request:
- I run an
indexOf
on the fetched content, and if that returns-1
then I know that the page was a false positive, so I remove it from the search results. This fixes the first drawback mentioned above. - If the
indexOf
returned a valid index into the content, I run a regexp that extracts the entire sentence in which the match was found, up to a maximum of ten words before and ten words after the match. I match the regexp a maximum of two times over the post’s content, highlight the searched word in the result of the regexp, and then replace the simplematched: keyword1 keyword2
of the search result with a... and in the context of blah and bloop, the awesome keyword1 was unmatched ... however, some people have recommended keyword2 as a highly viable alternative, and I am highly tempted to ...
. This (I hope) addresses the second drowback by making it easier for users to judge for themselves if the content on that page is what they are looking for.