The central data structure in a text editor is the one that manages the sequence of characters that represents the current state of the file that is being edited. Every text editor requires such a data structure but books on data structures do not cover data structures for text sequences. Articles on the design of text editors often discuss the data structure they use [1, 3, 6, 8, 11, 12] but they do not cover the area in a general way. This article is concerned with such data structures.

Figure 1 shows where sequence data structures fit in with other data structures.

Some ordered sets are ordered by something intrinsic in the items in the sets (e.g., the value of an integer, the lexicographic position of a string) and the position of an inserted item depends on its value and the values of the items already in the set. Such data structures are mainly concerned with fast searching. Data structures for this type of ordered set have been studied extensively.

The other possibility is for the order to be determined by where the items are placed when they are inserted into the set. If insert and delete is restricted to the two ends of the ordering then you have a deque. For deques, the two basic data structures are an array (used circularly) and a linked list. Nothing beyond this is necessary due to the simplicity of the ADT interface to deques. If you can insert and delete items from anywhere in the ordering you have a sequence. An important subclass is sequences where reading an item in the sequence (by position number) is extremely localized. This is the case for text editors and it is this subclass that is examined in this paper.

A linked list and an array are the two obvious data structures for a sequence. Neither is suitable for a general purpose text editor (a linked list takes up too much memory and an array is too slow because it requires too much data movement) but they provide useful base cases on which to build more complex sequence data structures. The gap method is a simple extension of an array, it is simply an array with a gap in the middle where characters are inserted and deleted. Many text editors use the gap method since it is simple and quite efficient but the demands on a modern text editor (multiple files, very large files, structured text, sophisticated undo, virtual memory, etc.) encourage the investigation of more complicated data structures which might handle these things more effectively.

The more sophisticated sequence data structures keep the sequence as a recursive sequence of spans of text. The line span method keeps each line together and keeps an array or a linked list of line pointers. The fixed buffer method keeps a linked list of fixed size buffers each of which is partially full of text from the sequence. Both the line span method and the fixed buffer method have been used for many text editors.

A less commonly used method is the piece table method which keeps the text as a sequence of ``pieces'' of text from either the original file and an ``added text'' file. This method has many advantages and these will become clear as the methods are presented in detail and analyzed. A major purpose of this paper is to describe the piece table method and explain why it is a good data structure for text sequences.

Looking at methods in detail suggests a general model of sequence data structures that subsumes them all. Based on examination of this general model I will propose several new sequence data structures that do not appear to have been tried before.

It is difficult to analyze these algorithms mathematically so an experimental approach was taken. I have implemented a text editor simulator and a number of sequence data structures. Using this as an experimental text bed I have compared the performance of each of these sequence data structures under a variety of conditions. Based on these experiments and other considerations I conclude with recommendations on which sequence data structures are best in various situations. In almost all cases, either the gap method or the piece table method is the best data structure.

Thu Jun 27 15:36:10 MDT 1996