A General Order(n) Sort

Abstract

Natural numbers expressed conventionally (base 10 or base 2) as a string of digits have a word length of log n, where n is the number expressed. Similarly, using string or numerical keys to retrieve data objects, if duplicate keys are to be avoided, the key length must grow as log n. However, in many practical situations, the key length is arbitrarily bounded, and duplicates are avoided because the number of objects, n, in the database, is bounded by practical considerations. For example, 9-digit social security numbers (SSNs) are used to uniquely identify people by the government of the Unites States of America. Bounding the key word size allows the implementation of an order n sort function for objects with either numerical or general string keys.

A tree data structure and associated algorithms have been created that provide the following functionality and performance properties:

Function Time Complexity
(allows duplicate keys)
Time Complexity
(no duplicate keys)
Comment
1 Insert an object with a key string
into a container.
O(1) O(log n) n is the number of objects stored in
the container.
2 Determine whether an object is in a
container based on some key string
for the object.
O(1) O(log n)  
3 Retrieve an object from a container
based on some key string.
O(1) O(log n)  
4 Determine how many of an object with
the same key exist in a container.
O(1)   Does not apply if duplicate keys are
not allowed.
5 Write the objects from a container
into an array in sorted order based
on their keys.
O(n) O(n log n)  

The storage space required for the objects is O(n). Combining the above functions 1 and 5 gives an O(n) sort algorithm for objects with bounded-length keys (e.g. people sorted by social security number). This algorithm has been implemented and tested, with test results corroborating the O(n) complexity from analysis.

The better-performing duplicates-allowed functions may be used with fixed field key relational databases if some external mechanism is used to preclude duplicate keys (such as using serial numbers for keys as is a common practice in many implementations).

Introduction

The theoretical best time complexity for sorting a set of numbers (or objects with unique number keys) by comparison is Omega(n log n) [2]. By the definition of a set, duplicates are not allowed, so as n increases, the length of the key (field width or number of digits in the key) increases as log n. This would apply, for example, to an application that stores a large set of objects keyed to serial numbers. Stefan Nilsson describes an O(n log log n) integer sorting algorithm that allows duplicates in the April 2000 issue of Dr. Dobb's Journal [3].

Many practical applications have a fixed key length, such as the 9-digit social security number (SSN), name string (last + first + middle), or other property of the stored object. If, as in a relational database, duplicate keys are not allowed, the fixed key length puts an upper limit on storage capacity that is not usually reached in practice.

In such practical applications, a string tree as described here can provide keyed storage of data objects with the depth of the tree equal to the length of the largest key. This constant bound on the depth of the tree allows the implementation of a linear sort function, as described and demonstrated below. Moreover, the sort algorithms described here are stable, that is, they preserve the initial order of items with equal keys [5].

Other known O(n) sorting algorithms such as counting sort, radix sort, and bucket sort [2], impose restrictions on the sorted objects. For example, bucket sort requires both that the numbers to be sorted to be within a predetermined range and that there be no duplicates [1]. The O(n) sort described here imposes no such restrictions. However, as n grows without bound, duplicates will be encountered when the key word length is less than log n.

String Tree Description and Complexity Analysis

The string tree described here is organized similarly to a trie tree [5] with the addition of node queues to store keyed objects, node flags to indicate the presence of stored objects, and parent pointers for key recovery. The node queues allow storage of objects with duplicate keys and enable the O(n) sort.

Insert(object with key k)

Keys used for inserting objects in the container may be any natural number or string, such as serial numbers, social security numbers, names, ages, descriptions, etc. To illustrate how the string tree container works, consider five objects with the following five prime number keys: 1997, 1999, 2003, 2011, and 2017. These can be represented by an integer data type or by a string literal such as a character array.

First, a StringTreeNode object is created with a null key character and null object queue to serve as the tree root. Then, the first digit (left end) of the first object's key is read and a new StringTreeNode is created with that digit being used for the key character. The new node pointer is stored in the root's array of children indexed by key character. The resulting structure is shown below in figure 1:


The first (leftmost) digit of 1997 is stored in the tree.

Then the process is repeated for the the next three digits of 1997:


Figure 2: The node of the last digit of 1997 is flagged (circle symbol) to
indicate that the object corresponding to the key (1997) is stored in this
node's object queue.

Storing the object with key 1997 took four operations to create key nodes and one operation to push the object onto the leaf node queue. Similarly, the object with key 1999 is stored:


Figure 3: The first three tree nodes of 1999 already exist so only the
leaf node for the digit 9 is created to store the object with the key 1999.

Storing the object with key 1999 took three operations to locate the node with the digit 9 at the third level of the tree, one operation to create the leaf with the digit 9, and one operation to push the object onto that node's queue. Next, the object with the key 2003 is stored:


Figure 4: The object with key 2003 is stored.

Finally, the objects with keys 2011 and 2017 are stored in a similar manner:


Figure 5: The objects with keys 2011 and 2017 are stored.

All these insert operations have a time complexity on the order of the depth of the tree, which is the number of digits or characters in the key. For bounded-length keys, the insert operation is performed in constant time. For unbounded keys, such as for storing the prime numbers themselves, the depth of the tree is log n, where n is the largest number stored (the number of primes up to integer m is O(m)). The key itself can be recovered by traversing upward through the parents to the root. A parent pointer is provided in the StringTreeNode for this purpose.

Note that data objects are not necessarily stored at leaves and that the string tree data structure is not a binary tree. For example, if we wanted to add data with the keys 13, 17 and 19 to the tree we would have:


Figure 6: The objects with keys 13, 17 and 19 are inserted in the tree.

The data structure allows for storing objects with duplicate keys because each node has a queue of data objects. Objects with the same key are stored by pushing them into the queue of the same node, and are recovered by popping them off until the queue is empty.

Find(an object with key k)

To locate an object in the tree, the root's array of children is examined at index d where d is the first digit (or character) of k. If the child is not null, the process is repeated on the next digit, and so on until the object is reached. If any child is null before the object is reached, the object is not in the tree and null is returned. The time for finding an object is on the order of the depth of the tree, which is constant for keys of bounded field width.

Count(objects with key k)

First the object is located (see above) and then the queue size is returned. Time complexity is also of the order of the depth of the tree.

Write Objects in Key Order

This function has two variants: one for objects with numerical keys and one for objects with string keys. We want to write the objects to an array in numerical order for the first case, and in alphabetical order in the second case. The following is an example of strings of digits (representing numbers) in numerical order:

1
10
11
101
111

and the following are the same strings in alphabetical order:

1
10
101
11
111

just as these strings are in alphabetical order:

b
ba
bab
bb
bbb

A convenient property of the string tree utility data structure is that a preorder (depth-first) traversal encounters the objects in alphabetical order of the keys, and a breadth-first traversal encounters the objects in numerical order.

The time required to visit every node in the tree is O(n) where n is the number of objects stored in the tree. This applies to both variants of the traversal function because each node of the tree is visited. This holds for trees where data objects are stored only in leaves because the number of internal nodes for any tree with a bounded branching factor (which is a property of the string tree) is a constant function of the number of leaves in the tree [4]. For string trees with unbounded depth (as in the case of unbounded key length (which must hold for the no-duplicates case)), the traversal time is O(n log n).

Implementation

The string tree utility data structure (STUD) was implemented to demonstrate general functionality and to perform performance tests to corroborate the complexity analysis. The STUD was implemented in C++ and the tests were performed with two kinds of data objects, people (with first name, last name, age, weight, and social security number (SSN)), and rocks (with a serial number ID (zero to n) and color represented as a character array, e.g. "red", "green", "blue", etc. (16 different colors).

Functions were written to create arrays of person pointers and rock pointers. The people were created by hard coding data. Rocks were created by assigning colors at random with sequencial keys. The rocks were then permuted randomly in the array to test the sorting functions.

A console menu user interface was created for testing as shown below in figure 7:


Figure 7: The user interface of the STUD demonstration program.

For option A, people's SSNs are not duplicated. Each node in the tree will have at most one data object stored in its queue. Figure 8 shows the output from option A:


Figure 8: People objects sorted by SSN. Data are totally fictional.

The people's ages are duplicated in some cases. Some nodes will have several person objects stored in their queues. Figure 9 shows the output from option B:


Figure 9: People sorted by age.

People's names include spaces and punctuation. The last and first names are concatenated before sorting and the output of option C is shown in figure 10:


Figure 10: People sorted by name.

Option D sorts rocks by sequential ID number (no duplicates) and its output is shown in figure 11:


Figure 11: Rocks sorted by serial number.

Option E sorts rocks by color (lots of duplicates) and its output is shown in figure 12:


Figure 12: Rocks sorted by color.

Here is the function to sort the rocks by color:

// Sort rocks by color:
void rocksByColor(int iNumRocks)
{
  int i = 0;                                    // Loop index.
  int iManual = 0;                              // Flag for manual entry.
  Rock** array;                                 // Array of rock pointers.
  StringTreeNode<Rock>* rockRoot = NULL;        // Root of the tree of rocks.
  double dfProc = 0;                            // Statistical information variable.
  
  if (iNumRocks == 0)                           // Get the number from the user.
  {
    iManual = 1;
    cout << endl;
    cout << "Sort rocks by color:" << endl << endl;
    cout << "How many rocks do you want to sort? ";
    cin >> iNumRocks;
  }

  array = new Rock*[iNumRocks];                 // Create the array of pointers.                 
  buildRocks(array, iNumRocks);                 // Create rocks randomly.

  rockRoot = new StringTreeNode<Rock>('.');     // The root of the tree.

  for (i = 0; i < iNumRocks; i++)               // Insert the rocks into the tree.
  {
    rockRoot->addObject(array[i]->getColor(), array[i]);
  }
  i = 0;
  rockRoot->preOrderStoreObjects(array, i);     // Traverse the tree in preorder
                                                // and store the rocks back in the

  dfProc = ((double) clock()) / CLOCKS_PER_SEC; // 1000 clocks per second.

  if (iManual)
  {
    cout << endl;
    for (i = 0; i < iNumRocks; i++)
    {
      array[i]->print();
    }
  }
  else
  {
    cout << endl << " Processor time = " << dfProc << endl;
  }
}
Notice that the sorting is accomplished by first inserting all the rocks into a new tree (time order n) and then preorder traversing the tree and storing the rocks in the array (time order n). This code illustrates that the STUD is easy to use for a sorting application.

Performance Tests

For performance testing a flag was set in the program to bypass the user interface so that the processor time gathered (with the time.h clock() function) would not include anything but the creation of an array of n randomly generated rocks, insertion of the n rocks into the tree from the array, and the gathering of the rocks in sorted order back into the array.

The resulting program output looked like this:


Figure 13: Output for 120,000 rocks sorted by color. Processor time is in seconds.

Tests were performed for rocks sorted by ID and by color. Raw data for the ID sort runs are shown here:

String Tree Utility Datastructure Performance Tests				
24-Mar-00				
Rocks sorted by unique numerical ID (no duplicates)
Number	Avg.	Time	Time	Time
1000	0.062	0.063	0.062	0.062
2000	0.130	0.141	0.125	0.125
3000	0.1875	0.187	0.188	
4000	0.2575	0.265	0.25	
5000	0.335	0.328	0.334	0.343
6000	0.4065	0.407	0.406	
7000	0.484	0.484	0.484	
8000	0.5545	0.562	0.547	
9000	0.64	0.64	0.64	
10000	0.7425	0.75	0.735	
11000	0.844	0.844	0.844	
12000	0.974	0.984	0.969	0.969
At least two runs were performed for each number of rocks. When graphed, the average data look like this:


Figure 14: Processor time for sorting rocks by numerical ID.

Notice the slight upward curve due to the O(n log n) performance predicted for this algorithm. Based on the difference in time from 1000 rocks to 2000 rocks, O(n log n) predicts a time of 0.96 seconds for 12000 rocks, within one percent agreement with the measured 0.97.

When graphed, the data for sorting rocks by color (many duplicates because there are only 16 colors) look like this:


Figure 15: Processor time for sorting rocks by color.

Notice that this indicates a linear sort, as analysis predicts.

Comparison Test

I wrote an applet in Java to compare sorting person objects by age using QuickSort versus StudSort:


Sort people by age:
compare StudSort and QuickSort.

Source code is available for educational purposes: StudTest.java is the main file. StringTreeNode.java, ListNode.java, Queue.java, and Person.java are necessary object files.

As will be noticed, StudSort is significantly faster with more than 5000 people. Some Java runtime systems will have memory shortages above a few hundred thousand people. Such problems are mitigated by programming in C++. Note also that results with larger numbers of people will be erratic due to the virtual memory paging in many systems. The implementation of QuickSort used here is adapted from Structured C for Engineering and Technology, third edition, by Adamson et al. [1] (page 403), and uses more memory than StudSort.

Conclusion

I have described a new data structure that organizes data by the sequence of characters or digits in the data objects' keys. In the case of primitive data types, the data item itself is its own key. This organization uses the sequence of digits or characters in the key to locate the object in a tree structure. This data structure allows efficient algorithms for insertion, location, deletion, and sorting of data. Duplicates are handled by queuing data objects in the tree node, preserving the order of objects with duplicate keys. The efficiency of these algorithms has been demonstrated by both analysis and by empirical tests.

References

[1] Tom Adamson, James L. Antonakos, and Kenneth C. Mansfield, Jr., Structured C for Engineering and Technology, third edition, Prentice-Hall, Inc., 1998.

[2] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest, Introduction to Algorithms, The MIT Press, 1990.

[3] Stefan Nilsson, "The Fastest Sorting Algorithm?," Dr. Dobb's Journal, April 2000.

[4] Kenneth H. Rosen, Discrete Mathematics and Its Applications, Fourth Edition, McGraw-Hill, 1999.

[5] Robert Sedgewick, Algorithms, Second Edition, Addison-Wesley, 1988.


Copyright 2000-2014 by Rick Wagner, all rights reserved.
United States patent 6,741,999, awarded May 25, 2004.

This page established March 25, 2000; last updated by Rick Wagner, January 29, 2014.