8.2.3 Indexed Search

The limitation of binary search is we can only use is when the input data is already sorted. What if we want to search a collection of documents, such as finding all web pages that contain a given word? The web visible to search engines contains billions of web pages most of which contain hundreds or thousands of words. A linear search over such a vast corpus would be infeasible: supposing each word can be tested in 1 millisecond, the time to search 1 trillion words would be over 30 years!

Providing useful searches over large data sets like web documents requires finding a way to structure the data so it is not necessary to examine all documents to perform a search. One way to do this is to build an index that provides a mapping from words to the documents that contain them. Then, we can build the index once, store it in a sorted binary tree, and use it to perform all the searches. Once the index is built, the work required to perform one search is just the time it takes to look up the target word in the index. If the index is stored as a sorted binary tree, this is logarithmic in the number of distinct words.

Strings. We use the built-in String datatype to represent documents and target words. A String is similar to a List, but specialized for representing sequences of characters. A convenient way to make a String it to just use double quotes around a sequence of characters. For example, "abcd" evaluates to a String containing four characters.

The String datatype provides procedures for matching, ordering, and converting between Strings and Lists of characters:

One advantage of using Strings instead of Lists of characters is the built-in procedures for comparing Strings; we could write similar procedures for Lists of characters, but lexicographic ordering is somewhat tricky to get right, so it is better to use the built-in procedures.

Building the index. The entries in the index are Pairs of a word represented as a string, and a list of locations where that word appears. Each location is a Pair consisting of a document identifier (for web documents, this is the Uniform Resource Locator (URL) that is the address of the web page represented as a string) and a Number identifying the position within the document where the word appears (we label positions as the number of characters in the document before this location).

To build the index, we split each document into words and record the position of each word in the document. The first step is to define a procedure that takes as input a string representing an entire document, and produces a list of (word . position) pairs containing one element for each word in the document. We define a word as a sequence of alphabetic characters; non-alphabetic characters including spaces, numbers, and punctuation marks separate words and are not included in the index.

The text-to-word-positions procedure takes a string as input and outputs a list of word-position pairs corresponding to each word in the input. The inner procedure, text-to-word-positions-iter, takes three inputs: a list of the characters in the document, a list of the characters in the current word, and a number representing the position in the string where the current word starts; it outputs the list of (word . position) pairs. The value passed in as w can be null, meaning there is no current word. Otherwise, it is a list of the characters in the current word. A word starts when the first alphabetic character is found, and continues until either the first non-alphabetic character or the end of the document. We use the built-in char-downcase procedure to convert all letters to their lowercase form, so KING, King, and king all correspond to the same word.

(define (text-to-word-positions s)
  (define (text-to-word-positions-iter p w pos)
    (if (null? p)
        (if (null? w) null (list (cons (list->string w) pos)))
        (if (not (char-alphabetic? (car p))) ; finished word
            (if (null? w) ; no current word
                (text-to-word-positions-iter (cdr p) null (+ pos 1))
                (cons (cons (list->string w) pos)
                      (text-to-word-positions-iter (cdr p) null
                       (+ pos (list-length w) 1))))
            (text-to-word-positions-iter (cdr p)
             (list-append w (list (char-downcase (car p))))
  (text-to-word-positions-iter (string->list s) null 0))

The next step is to build an index from the list of word-position pairs. To enable fast searching, we store the index in a binary tree sorted by the target word. The insert-into-index procedure takes as input an index and a word-position pair and outputs an index consisting of the input index with the input word-position pair added.

The index is represented as a sorted binary tree where each element is a pair of a word and a list of the positions where that word appears. Each word should appear in the tree only once, so if the word-position pair to be added corresponds to a word that is already in the index, the position is added to the corresponding list of positions. Otherwise, a new entry is added to the index for the word with a list of positions containing the position as its only element.

(define (insert-into-index index wp)
  (if (null? index)
      (make-tree null (cons (car wp) (list (cdr wp))) null)
      (if (string=? (car wp) (car (tree-element index)))
          (make-tree (tree-left index)
                     (cons (car (tree-element index))
                           (list-append (cdr (tree-element index))
                                        (list (cdr wp))))
                     (tree-right index))
          (if (string<? (car wp) (car (tree-element index)))
              (make-tree (insert-into-index (tree-left index) wp)
                         (tree-element index)
                         (tree-right index))
              (make-tree (tree-left index)
                         (tree-element index)
                         (insert-into-index (tree-right index) wp))))))

To insert all the (word . position) pairs in a list into the index, we use insert-into-index to add each pair, passing the resulting index into the next recursive call:

(define (insert-all-wps index wps)
  (if (null? wps) index
      (insert-all-wps (insert-into-index index (car wps)) (cdr wps))))

To add all the words in a document to the index we use text-to-word-positions to obtain the list of word-position pairs. Since we want to include the document identity in the positions, we use list-map to add the url (a string that identifies the document location) to the position of each word. Then, we use insert-all-wps to add all the word-position pairs in this document to the index. The index-document procedure takes a document identifier and its text as a string, and produces an index of all words in the document.

(define (index-document url text)
   (list-map (lambda (wp) (cons (car wp) (cons url (cdr wp))))
             (text-to-word-positions text))))

We leave analyzing the running time of index-document as an exercise. The important point, though, is that it only has to be done once for a given set of documents. Once the index is built, we can use it to answer any number of search queries without needing to reconstruct the index. %Large search engine companies dedicate large numbers of machines to maintaining the index as new web pages are found.

Merging indexes. Our goal is to produce an index for a set of documents, not just a single document. So, we need a way to take two indexes produced by index-document and combine them into a single index. We use this repeatedly to create an index of any number of documents. To merge two indexes, we combine their word occurrences. If a word occurs in both documents, the word should appear in the merged index with a position list that includes all the positions in both indexes. If the word occurs in only one of the documents, that word and its position list should be included in the merged index.

(define (merge-indexes d1 d2)
  (define (merge-elements p1 p2)
    (if (null? p1) p2
        (if (null? p2) p1
            (if (string=? (car (car p1)) (car (car p2)))
                (cons (cons (car (car p1))
                            (list-append (cdr (car p1)) (cdr (car p2))))
                      (merge-elements (cdr p1) (cdr p2)))
                (if (string<? (car (car p1)) (car (car p2)))
                    (cons (car p1) (merge-elements (cdr p1) p2))
                    (cons (car p2) (merge-elements p1 (cdr p2))))))))
   (lambda (e1 e2) (string<? (car e1) (car e2)))
   (merge-elements (tree-extract-elements d1)
                   (tree-extract-elements d2)))))))

To merge the indexes, we first use tree-extract-elements to convert the tree representations to lists. The inner merge-elements procedure takes the two lists of word-position pairs and outputs a single list.

Since the lists are sorted by the target word, we can perform the merge efficiently. If the first words in both lists are the same, we produce a word-position pair that appends the position lists for the two entries. If they are different, we use string&lt;? to determine which of the words belongs first, and include that element in the merged list. This way, the two lists are kept synchronized, so there is no need to search the lists to see if the same word appears in both lists.

Obtaining documents. To build a useful index for searching, we need some documents to index. The web provides a useful collection of freely available documents. To read documents from the web, we use library procedures provided by DrRacket.

This expression loads the libraries for managing URLs and getting files from the network: (require (lib "url.ss" "net")). One procedure this library defines is string-&gt;url, which takes a string as input and produces a representation of that string as a URL. A Uniform Resource Locator (URL) is a standard way to identify a document on the network. The address bar in most web browsers displays the URL of the current web page.

The full grammar for URLs is quite complex (see Exercise 2.14), but we will use simple web page addresses of the form:1

An example of a URL is http://www.whitehouse.gov/index.html. The http indicates the HyperText Transfer Protocol, which prescribes how the web client (browser) and server communicate with each other. The domain name is www.whitehouse.gov, and the path name is /index.html (which is the default page for most web servers).

The library also defines the get-pure-port procedure that takes as input a URL and produces a port for reading the document at that location. The read-char procedure takes as input a port, and outputs the first character in that port. It also has a side effect: it advances the port to the next character. We can use read-char repeatedly to read each character in the web page of the port. When the end of the file is reached, the next application of read-char outputs a special marker representing the end of the file. The procedure eof-object? evaluates to true when applied to this marker, and false for all other inputs.

The read-all-chars procedure takes a port as its input, and produces a list containing all the characters in the document associated with the port:

(define (read-all-chars port)
  (let ((c (read-char port)))
    (if (eof-object? c) null
        (cons c (read-all-chars port)))))

Using these procedures, we define web-get, a procedure that takes as input a string that represents the URL of some web page, and outputs a string representing the contents of that page.

(define (web-get url)
  (list->string (read-all-chars (get-pure-port (string->url url)))))

To make it easy to build an index of a set of web pages, we define the index-pages procedure that takes as input a list of web pages and outputs an index of the words in those pages. It recurses through the list of pages, indexing each document, and merging that index with the result of indexing the rest of the pages in the list.

(define (index-pages p)
  (if (null? p) null
      (merge-indexes (index-document (car p) (web-get (car p)))
                     (index-pages (cdr p)))))

We can use this to create an index of any set of web pages. For example, here we use Jeremy Hylton's collection of the complete works of William Shakespeare (http://shakespeare.mit.edu) to define shakespeare-index as an index of the words used in all of Shakespeare's plays.

(define shakespeare-index
    (lambda (play)
     (string-append "http://shakespeare.mit.edu/" play "/full.html"))
     ;; List of plays following the site's naming conventions.
    (list "allswell" "asyoulikeit" "comedy_errors" "cymbeline" "lll"
     "measure" "merry_wives" "merchant" "midsummer" "much_ado"
     "pericles" "taming_shrew" "tempest" "troilus_cressida" "twelfth_night"
     "two_gentlemen" "winters_tale" "1henryiv" "2henryiv" "henryv"
     "1henryvi" "2henryvi" "3henryvi" "henryviii" "john" "richardii"
     "richardiii" "cleopatra" "coriolanus" "hamlet" "julius_caesar" "lear"
     "macbeth" "othello" "romeo_juliet" "timon" "titus"))))

Building the index takes about two and a half hours on my laptop. It contains 22949 distinct words and over 1.6 million word occurrences. Much of the time spent building the index is in constructing new lists and trees for every change, which can be avoided by using the mutable data types we cover in the next chapter. The key idea, though, is that the index only needs to be built once. Once the documents have been indexed, we can use the index to quickly perform any search.

Searching. Using an index, searching for pages that use a given word is easy and efficient. Since the index is a sorted binary tree, we use binary-tree-search to search for a word in the index:

(define (search-in-index index word)
   (lambda (el) (string=? word (car el))) ; first element of (word . position)
   (lambda (el) (string<? word (car el)))

As analyzed in the previous section, the expected running time of binary-tree-search is in $\Theta(\log n)$ where $n$ is the number of nodes in the input tree.2 The body of search-in-index applies binary-tree-search to the index. The number of nodes in the index is the number of distinct words in the indexed documents. So, the running time of search-in-index scales logarithmically with the number of distinct words in the indexed documents. Note that the number and size of the documents does not matter! This is why a search engine such as Google can respond to a query quickly even though its index contains many billions of documents.

One issue we should be concerned with is the running time of the procedures passed into binary-tree-search. Our analysis of binary-tree-search assumes the equality and comparison functions are constant time procedures. Here, the procedures as string=? and string&lt;?, which both have worst case running times that are linear in the length of the input string. As used here, one of the inputs is the target word. So, the amount of work for each binary-tree-search recursive call is in $\Theta(w)$ where $w$ is the length of word. Thus, the overall running time of search-in-index is in $\Theta(w \log d)$ where $w$ is the length of word and $d$ is the number of words in the index. If we assume all words are of some maximum length, though, the $w$ term disappears as a constant factor (that is, we are assuming $w < C$ for some constant $C$. Thus, the overall running time is in $\Theta(\log d)$.

Here are some examples: % uses of search-in-index using our index of Shakespeare's plays:

Our search-in-index and index-pages procedures form the beginnings of a search engine service. A useful web search engine needs at least two more capabilities: a way to automate the process of finding documents to index, and a way to rank the documents that contain the target word by the likelihood they are useful. The exploration at the end of this section addresses these capabilities.

Histogram. We can also use our index to analyze Shakespeare's writing. The index-histogram procedure produces a list of the words in an index sorted by how frequently they appear:

(define (index-histogram index)
   (lambda (e1 e2) (> (cdr e1) (cdr e2)))
   (list-map (lambda (el) (cons (car el) (length (cdr el))))
             (tree-extract-elements index))))

The expression,

(list-filter (lambda (entry) (> string-length (car entry) 5))
             (index-histogram shakespeare-index))

evaluates to a list of Shakespeare's favorite 6-letter and longer words along with the number of times they appear in the corpus (the first two entries are from their use in the page formatting):

(("blockquote" . 63345) ("speech" . 31099)
("should" . 1557) ("father" . 1086) ("exeunt" . 1061)
("master" . 861) ("before" . 826) ("mistress" . 787)
... ("brother" . 623)
... ("daughter" . 452)
... ("mother" . 418)
... ("mustardseed" . 13)
... ("excrement" . 5)
... ("zwaggered" . 1))

Exercise 8.14. Define a procedure for finding the longest word in a document. Analyze the running time of your procedure.

Exercise 8.15. Produce a list of the words in Shakespeare's plays sorted by their length.

Exercise 8.16. $\left[\star \right]$ Analyze the running time required to build the index. a. Analyze the running time of the text-to-word-positions procedure. Use $n$ to represent the number of characters in the input string, and $w$ to represent the number of distinct words. Be careful to clearly state all assumptions on which your analysis relies.
b. Analyze the running time of the insert-into-index procedure.
c. Analyze the running time of the index-document procedure.
d. Analyze the running time of the merge-indexes procedure.
e. Analyze the overall running time of the index-pages procedure. Your result should describe how the running time is impacted by the number of documents to index, the size of each document, and the number of distinct words.

Exercise 8.17. $\left[\star \right]$ The search-in-index procedure does not actually have the expected running time in $\Theta(\log w)$ (where $w$ is the number of distinct words in the index) for the Shakespeare index because of the way it is built using merge-indexes. The problem has to do with the running time of the binary tree on pathological inputs. Explain why the input to list-to-sorted-tree in the merge-indexes procedure leads to a binary tree where the running time for searching is in $\Theta(w)$. Modify the merge-indexes definition to avoid this problem and ensure that searches on the resulting index run in $\Theta(\log w)$.

Exercise 8.18. $\left[\star \right]$$\left[\star \right]$ The site http://www.speechwars.com provides an interesting way to view political speeches by looking at how the frequency of the use of different words changes over time. Use the index-histogram procedure to build a historical histogram program that takes as input a list of indexes ordered by time, and a target word, and output a list showing the number of occurrences of the target word in each of the indexes. You could use your program to analyze how Shakespeare's word use is different in tragedies and comedies or to compare Shakespeare's vocabulary to Jefferson's.

Exploration 8.1: Searching the Web

In addition to fast indexed search, web search engines have to solve two problems: (1) find documents to index, and (2) identify the most important documents that contain a particular search term.

For our Shakespeare index example, we manually found a list of interesting documents to index. This approach does not scale well to indexing the World Wide Web where there are trillions of documents and new ones are created all the time. For this, we need a web crawler.

A web crawler finds documents on the web. Typical web crawlers start with a set of seed URLs, and then find more documents to index by following the links on those pages. This proceeds recursively: the links on each newly discovered page are added to the set of URLs for the crawler to index. To develop a web crawler, we need a way to extract the links on a given web page, and to manage the set of pages to index.

a. $\left[\star \right]$ Define a procedure extract-links that takes as input a string representing the text of a web page and outputs a list of all the pages linked to from this page. Linked pages can be found by searching for anchor tags on the web page. An anchor tag has the form:3 $<$href=target$>$. (The text-to-word-positions procedure may be a helpful starting point for defining extract-links.)

b. Define a procedure crawl-page that takes as input an index and a string representing a URL. As output, it produces a pair consisting of an index (that is the result of adding all words from the page at the input URL to the input index) and a list of URLs representing all pages linked to by the crawled page.

c. $\left[\star \right]$$\left[\star \right]$ Define a procedure crawl-web that takes as input a list of seed URLs and a Number indicating the maximum depth of the crawl. It should output an index of all the words on the web pages at the locations given by the seed URLs and any page that can be reached from these seed URLs by following no more than the maximum depth number of links.

For a web search engine to be useful, we don't want to just get a list of all the pages that contain some target word, we want the list to be sorted according to which of those pages are most likely to be interesting. Selecting the best pages for a given query is a challenging and important problem, and the ability to do this well is one of the main things that distinguishes web search engines. Many factors are used to rank pages including an analysis of the text on the page itself, whether the target word is part of a title, and how recently the page was updated.

The best ways of ranking pages also consider the pages that link to the ranked page. If many pages link to a given page, it is more likely that the given page is useful. This property can also be defined recursively: a page is highly ranked if there are many highly-ranked pages that link to it.

The ranking system used by Google is based on this formula:

$$ R(u) = \sum_{v \in L_u} \frac{R(v)}{L(v)} $$

where $L_u$ is the set of web pages that contain links to the target page $u$ and $L(v)$ is the number of links on the page $v$ (thus, the value of a link from a page containing many links is less than the value of a link from a page containing only a few links). The value $R(u)$ gives a measure of the ranking of the page identified by $u$, where higher values indicate more valuable pages.

The problem with this formula is that is is circular: there is no base case, and no way to order the web pages to compute the correct rank of each page in turn, since the rank of each page depends on the rank of the pages that link to it.

One way to approximate equations like this one is to use relaxation. Relaxation obtains an approximate solution to some systems of equations by repeatedly evaluating the equations. To estimate the page ranks for a set of web pages, we initially assume every page has rank 1 and evaluate $R(u)$ for all the pages (using the value of 1 as the rank for every other page). Then, re-evaluate the $R(u)$ values using the resulting ranks. A relaxation keeps repeating until the values change by less than some threshold amount, but there is no guarantee how quickly this will happen. For the page ranking evaluation, it may be enough to decide on some fixed number of iterations.

d. $\left[\star \right]$$\left[\star \right]$ Define a procedure, web-link-graph, that takes as input a set of URLs and produces as output a graph of the linking structure of those documents. The linking structure can be represented as a List where each element of the List is a pair of a URL and a list of all the URLs that include a link to that URL. The extract-links procedure from the previous exploration will be useful for determining the link targets of a given URL.

e. $\left[\star \right]$ Define a procedure that takes as input the output of web-link-graph and outputs a preliminary ranking of each page that measures the number of other pages that link to that page.

f. $\left[\star \right]$$\left[\star \right]$ Refine your page ranking procedure to weight links from highly-ranked pages more heavily in a page's rank by using a algorithm.

e. $\left[\star \right]$$\left[\star \right]$$\left[\star \right]$$\left[\star \right]$ Come up with a cool name, set up your search engine as a web service, and attract more than 0.001\% of all web searches to your site.

  1. We use Name to represent sequences of characters in the domain and path names, although the actual rules for valid names for each of these are different. 

  2. Because of the way merge-indexes is defined, we do not actually get this expected running time. See Exercise 8.17

  3. Not all links match this structure exactly, so this may miss some of the links on a page.}