Next: Definitions Up: Data Structures for Previous: The C Interface

Comparing sequence data structures

In order to compare sequence data structures it is necessary to know the relative frequency of the five basic operations in typical editing.

The NewSequence operation is infrequent. Most sequence data structures require the NewSequence operation to scan the entire file and convert it to some internal form. This can be a problem with very large files. To look through a very large file, it must be read in any case but if the sequence data structure does not require sequence preprocessing it can interleave the file reading with text editor use rather than requiring the user to wait for the entire file to be read before editing can begin. This is an advantage and means the user does not have to worry about how many files are loaded or how large they are, since the cost of reading the large file is amortized over its use.

The Close operation is also infrequent and not normally an important factor in comparing sequence data structures.

The ItemAt operation will, of course, be very frequent, but it will also be very localized as well because characters in a text editor are almost always accessed sequentially. When the screen is drawn the characters on the screen are accessed sequentially beginning with the first character on the screen. If the user pages forward or backwards the characters are accessed sequentially. Searches normally begin at the current position and proceed forward or backward sequentially. Overall, there is almost no random access in a text editor. This means that, although the ItemAt operation is very frequent, its efficiency in the general case is not significant. Only the case where a character is requested that is sequentially before or after the last character accessed needs to be optimized.

To test this hypothesis I instrumented a text editor and looked at the ItemAt operations in a variety of text editing situations. The number of ItemAt calls that were sequential (one away from the last ItemAt) was always above 97% and usually above 99%. The number of ItemAts that were within 20 items of the previous ItemAt was always over 99%. The average number of characters from one ItemAt to the next was typically around 2 but sometimes would go up to as high as 10 or 15. Overall these experiments verify that the positions of ItemAt calls are extremely localized.

Thus sequence data structures should be optimized for edits (inserts and deletes) and sequential character access. Since caching can be used with most data structures to make sequential access fast, this means that the efficiency of inserts and deletes is paramount. With regard to Inserts and Deletes one would assume that there would be more Inserts than Deletes but that there would be a mix between them in typical text editing.

In all sequence data structures, ItemAt is much faster than Inserts and Deletes. If we consider typical text editing, the display is changed after each Insert and Delete and this requires some number of ItemAts, often just a few characters around the edit but possibly the whole rest of the window (if a newline in inserted or deleted).

The criteria used for comparing sequence data structures are:

The time taken by each operation
The paging behavior of each operation
The amount of space used by the sequence data structure
How easily it fits in with typical file and IO systems
The complexity (and space taken by) the implementation

Later I will present timings comparing the basic operations for a range of sequence data structures. These timings will be taken from example implementations of the data data structures and a text editor simulator that calls these implementations.

Next: Definitions Up: Data Structures for Previous: The C Interface

Charlie Crowley
Thu Jun 27 15:36:10 MDT 1996