8.1.3 Quicker Sorting

Although insertion sort is typically faster than best-first sort, its running time is still scales quadratically with the length of the list. If it takes 100 milliseconds (one tenth of a second) to sort a list containing 1000 elements using list-sort-insert, we expect it will take four ($=2^2$) times as long to sort a list containing 2000 elements, and a million times ($=1000^2$) as long (over a day!) to sort a list containing one million ($1000*1000$) elements. Yet computers routinely need to sort lists containing many millions of elements (for example, consider processing credit card transactions or analyzing the data collected by a super collider).

The problem with our insertion sort is that it divides the work unevenly into inserting one element and sorting the rest of the list. This is a very unequal division. Any sorting procedure that works by considering one element at a time and putting it in the sorted position as is done by list-sort-find-best and list-sort-insert has a running time in $\Omega(n^2)$. We cannot do better than this with this strategy since there are $n$ elements, and the time required to figure out where each element goes is in $\Omega(n)$.

To do better, we need to either reduce the number of recursive applications needed (this would mean each recursive call results in more than one element being sorted), or reduce the time required for each application. The approach we take is to use each recursive application to divide the list into two approximately equal-sized parts, but to do the division in such a way that the results of sorting the two parts can be combined directly to form the result. We partition the elements in the list so that all elements in the first part are less than (according to the comparison function) all elements in the second part.

Our first attempt is to modify insert-one to partition the list into two parts. This approach does not produce a better-than-quadratic time sorting procedure because of the inefficiency of accessing list elements; however, it leads to insights for producing a quicker sorting procedure.

First, we define a list-extract procedure that takes as inputs a list and two numbers indicating the start and end positions, and outputs a list containing the elements of the input list between the start and end positions:

(define (list-extract p start end)
  (if (= start 0)
      (if (= end 0) null
          (cons (car p) (list-extract (cdr p) start (- end 1))))
      (list-extract (cdr p) (- start 1) (- end 1))))

The running time of the list-extract procedure is in $\Theta(n)$ where $n$ is the number of elements in the input list. The worst case input is when the value of end is the length of the input list, which means there will be $n$ recursive applications, each involving a constant amount of work.

We use list-extract to define procedures for obtaining first and second halves of a list (when the list has an odd number of elements, we put the middle element in the second half of the list):

(define (list-first-half p)
  (list-extract p 0 (floor (/ (list-length p) 2))))
 
(define (list-second-half p)
  (list-extract p (floor (/ (list-length p) 2)) (list-length p)))

The list-first-half and list-second-half procedures use list-extract so their running times are linear in the number of elements in the input list.

The list-insert-one-split procedure inserts an element in sorted order by first splitting the list in halves and then recursively inserting the new element in the appropriate half of the list:

(define (list-insert-one-split cf el p) ; requires: p is sorted by cf
  (if (null? p) (list el)      
      (if (null? (cdr p))
          (if (cf el (car p)) (cons el p) (list (car p) el))
          (let ((front (list-first-half p)) (back (list-second-half p)))    
            (if (cf el (car back))
                (list-append (list-insert-one-split cf el front) back)
                (list-append front (list-insert-one-split cf el back)))))))

In addition to the normal base case when the input list is null, we need a special case when the input list has one element. If the element to be inserted is before this element, the output is produced using cons; otherwise, we produce a list of the first (only) element in the list followed by the inserted element.

In the recursive case, we use the list-first-half and list-second-half procedures to split the input list and bind the results of the first and second halves to the front and back variables so we do not need to evaluate these expressions more than once.

Since the list passed to list-insert-one-split is required to be sorted, the elements in front are all less than the first element in back. Hence, only one comparison is needed to determine which of the sublists contains the new element: if the element is before the first element in back it is in the first half, and we produce the result by appending the result of inserting the element in the front half with the back half unchanged; otherwise, it is in the second half, so we produce the result by appending the front half unchanged with the result of inserting the element in the back half.

To analyze the running time of list-insert-one-split we determine the number of recursive calls and the amount of work involved in each application. We use $n$ to denote the number of elements in the input list. Unlike the other recursive list procedures we have analyzed, the number of recursive applications of list-insert-one-split does not scale linearly with the length of the input list. The reason for this is that instead of using (cdr p) in the recursive call, list-insert-one-split passes in either the front or back value which is the result of (first-half p) or (second-half p) respectively. The length of the list produced by these procedures is approximately $\frac{1}{2}$ the length of the input list. With each recursive application, the size of the input list is halved. This means, doubling the size of the input list only adds one more recursive application. This means the number of recursive calls is logarithmic in the size of the input.

Recall that the logarithm ($\log_b$) of a number $n$ is the number $x$ such that $b^x = n$ where $b$ is the \emph{base} of the logarithm. In computing, we most commonly encounter logarithms with base 2. Doubling the input value increases the value of its logarithm base two by one: $\log_{2} 2n = 1 + \log_{2} n$. Changing the base of a logarithm from $k$ to $b$ changes the value by the constant factor (see Section 7.3.1), so inside the asymptotic operators a constant base of a logarithm does not matter. Thus, when the amount of work increases by some constant amount when the input size doubles, we write that the growth rate is in $\Theta(\log n)$ without specifying the base of the logarithm. %Thus, the number of recursive applications of list-insert-one-split is in $\Theta(\log n)$ since doubling the size of the input requires one more recursive application.

Each list-insert-one-split application applies list-append to a first parameter that is either the front half of the list or the result of inserting the element in the front half of the list. In either case, the length of the list is approximately $\frac{n}{2}$. The running time of list-append is in $\Theta(m)$ where $m$ is the length of the first input list. So, the time required for each list-insert-one-split application is in $\Theta(n)$ where $n$ is the length of the input list to list-insert-one-split.

The lengths of the input lists to list-insert-one-split in the recursive calls are approximately $\frac{n}{2}$, $\frac{n}{4}$, $\frac{n}{8}$, $\ldots$, 1, since the length of the list halves with each call. The summation has $\log_2 n$ terms, and the sum of the list is $n$, so the average length input is $\frac{n}{\log_2 n}$. Hence, the total running time for the list-append applications in each application of list-insert-one-split is in $\Theta(\log_2 n \times \frac{n}{\log_2 n}) = \Theta(n)$.

The analysis of the applications of list-first-half and list-second-half is similar: each requires running time in $\Theta(m)$ where $m$ is the length of the input list, which averages $\frac{n}{\log_2 n}$ where $n$ is the length of the input list of list-insert-one-split. Hence, the total running time for list-insert-one-split is in $\Theta(n)$.

The list-sort-insert-split procedure is identical to list-sort-insert (except for calling list-insert-one-split):

(define (list-sort-insert-split cf p)
  (if (null? p) null
      (list-insert-one-split cf (car p) (list-sort-insert-split cf (cdr p)))))

Similarly to list-sort-insert, list-sort-insert-split involves $n$ applications of list-insert-one-split, and the average length of the input list is $\frac{n}{2}$. Since list-sort-insert-split involves $\Theta(n)$ applications of list-insert-one-split with average input list length of $\frac{n}{2}$, the total running time for list-sort-insert-split is in $\Theta(n^2)$. Because of the cost of evaluating the list-append, list-first-half, and list-second-half applications, the change to splitting the list in halves has not improved the asymptotic performance; in fact, because of all the extra work in each application, the actual running time is higher than it was for list-sort-insert.

The problem with our list-insert-one-split procedure is that the list-first-half and list-second-half procedures have to cdr down the whole list to get to the middle of the list, and the list-append procedure needs to walk through the entire input list to put the new element in the list. All of these procedures have running times that scale linearly with the length of the input list. To use the splitting strategy effectively, we need is a way to get to the middle of the list quickly. With the standard list representation this is impossible: it requires one cdr application to get to the next element in the list, so there is no way to access the middle of the list without using at least $\frac{n}{2}$ applications of cdr. To do better, we need to change the way we represent our data. The next subsection introduces such a structure; in Section 8.1.5 shows a way of sorting efficiently using lists directly by changing how we split the list.