Web Excursions 2023-02-13
The Technology Behind GitHub’s New Code Search
we built our own search engine from scratch, in Rust, specifically for the domain of code search.
We call this search engine Blackbird
we haven’t had a lot of luck using general text search products to power code search.
The user experience is poor, indexing is slow, and it’s expensive to host.
There are some newer, code-specific open source projects out there, but they definitely don’t work at GitHub’s scale
We understand that code search is uniquely different from general text search.
Code is already designed to be understood by machines and we should be able to take advantage of that structure and relevance.
Searching code also has unique requirements:
we want to search for punctuation (for example, a period or open parenthesis);
we don’t want stemming;
we don’t want stop words to be stripped from queries; and
we want to search with regular expressions.
GitHub’s scale is truly a unique challenge.
When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time).
Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle.
“Why don’t you just use grep?”
On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.
Code search runs on 64 core, 32 machine clusters.
Even if we managed to put 115 TB of code in memory and assume we can perfectly parallelize the work, we’re going to saturate 2,048 CPU cores for 96 seconds to serve a single query
You can see where this is going: we need to build an index.
We scan each document to detect what programming language it’s written in, assign a document ID,
and then create an inverted index where language is the key and the value is a posting list of document IDs.
For code search, we need a special type of inverted index called an ngram index, which is useful for looking up substrings of content.
An ngram is a sequence of characters of length n.
For example, if we choose n=3 (trigrams), the ngrams that make up the content “limits” are
lim
,imi
,mit
,its
To perform a search, we intersect the results of multiple lookups to give us the list of documents where the string appears.
With a trigram index you need four lookups: lim, imi, mit, and its in order to fulfill the query for limits.
The next problem we have is how to build this index in a reasonable amount of time
Git’s use of content addressable hashing and the fact that there’s actually quite a lot of duplicate content on GitHub.
Those two insights lead us the the following decisions:
Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication.
Model the index as a tree and use delta encoding to reduce the amount of crawling and to optimize the metadata in our index.
We also designed the system so that query results are consistent on a commit-level basis.
Kafka provides events that tell us to go index something.
There are a bunch of crawlers that interact with Git and a service for extracting symbols from code,
and then we use Kafka, again, to allow each shard to consume documents for indexing at its own pace.
In between GitHub.com and the shards is a service that coordinates taking user queries and fanning out requests to each host in the search cluster.
We use Redis to manage quotas and cache some access control data.
The front end accepts the user query and passes it along to the Blackbird query service
where we parse the query into an abstract syntax tree
and then rewrite it,
resolving things like languages to their canonical Linguist language ID and
tagging on extra clauses for permissions and scopes.
Next, we fan out and send n concurrent requests: one to each shard in the search cluster.
On each individual shard, we then do some further conversion of the query in order to lookup information in the indices.
Back in the query service, we aggregate the results from all shards, re-sort by score, filter (to double-check permissions), and return the top 100.
Now that we’ve seen the full system, let’s revisit the scale of the problem.
Our ingest pipeline can publish around 120,000 documents per second, so working through those 15.5 billion documents should take about 36 hours.
But delta indexing reduces the number of documents we have to crawl by over 50%, which allows us to re-index the entire corpus in about 18 hours.
we started with 115 TB of content that we want to search.
Content deduplication and delta indexing brings that down to around 28 TB of unique content.
And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content