Regular Expression Matching with a Trigram Index or How Google Code Search Worked
Created on 2022-04-28T18:14:20-05:00
Corpus: entire collection of documents being managed.
Invertex index: a mapping of things which exist in the corpus and links back to the documents which contain those things.
n-gram: a tuple of N symbols in sequence. In advanced cases a symbol may be a wildcard.
trigram: an n-gram where N is 3.
Whether whitespace gets its own symbol or is ignored is search engine dependent. Sometimes people search for whitespace with regular expressions so it *can* be useful.
Search indexing is basically about creating a cheap way to quickly rule out documents before performing a much slower but more precise search on the candidate list.
Ingesting a file
To add a file to the corpus you scan it for every unique trigram within. For every new trigram discovered in the corpus it recieves a new unique ID. This ID mapping is added to the database. A link from each ID is created in the database to the original document.
Searching for files
Analyze the search query for every unique trigram which is needed to satisfy it. For example if the query is "Quinn" then you need to look up the IDs for trigrams QUI, UIN, INN. Search links in the invertex index to find every file which contains all of the necessary trigrams.
Once the smaller candidate list is found; fall back to checking each file for the exact query requested.