The design space for sequence data structures

Next: Experimental comparison of Up: A General Model Previous: A General Model

The design space for sequence data structures

This model can be used to examine the design space for sequence data structures.

The four basic methods seem to cover most of the useful cases. A basic method is one where the number of descriptors is fixed. The array method uses two descriptors and the gap method uses three. We could generalize this to a two gap method using four descriptors and so on but it is not clear that there is any advantage in doing that. The linked list method is basic even though it uses a descriptor for every item in the sequence. The items are linked together so one descriptor serves to define the sequence. As soon as we try to put two or more items into a descriptor it becomes an instance of the recursive fixed size buffer method.

Using a more complex linked structure than a linked list will make reading items (ItemAt) more efficient but makes Insert and Delete less efficient. Since ItemAt access is nearly always sequential this tradeoff is not advantageous.

The recursive methods divide into two types. The first uses fixed size buffers and the second uses variable sized buffers.

There are two issues in the fixed size buffer method The first is the method used in maintaining the items in each fixed size buffer and the second is the method of maintaining the sequence of buffers.

The items in a single (fixed size) buffer are a sequence. The typical implementation keeps them as an array at the beginning of the buffer. In general the gap method is superior to the array method, thus it might be useful to keep characters in a single buffer using the gap method where the gap is kept at the last edit point. This should halve the expected number of bytes moved for one edit at little additional program cost. If the edits exhibit locality (as they typically do) the advantage will be greater.

A linked list is usually used to implement the sequence of descriptors (buffer pointers) and this is generally a good method because of the ease of inserting and deleting descriptors (which are frequent operations). One problem is the extra space used by the links but this is only a problem if there are lots of descriptors. With 1K blocks and editing 5M of files, this is still only 5K descriptors with 10K links. Thus this problem does not seem to be significant in practical cases.

Another issue is virtual memory performance. Linked lists do not exhibit good locality. If the descriptors were kept using the gap method, locality would be improved considerably. Assuming a descriptor takes four words (one for the pointer to the span, one for the length of the span and two for the links) the 5K descriptors would consume 20K words or 80K bytes (assuming four bytes per word). Again this is small enough that virtual memory performance would probably not be a significant problem, but if it were, the gap method could improve things.

The piece table method uses fewer descriptors than the fixed buffer method initially (before any editing) but heavy editing can create numerous pieces. There are advantages to maintaining the pieces since it allows easy implementation of unlimited undo. In addition, pieces can be used to record structures in the text (as described in section 7.1). As a consequence there might be many pieces. This means the problems presented above in the discussion of fixed size buffers (space consumed by links and virtual memory performance) might be significant here. That is, the gap method of keeping the piece table might be preferred.

Another generalization would be to use two levels of recursion, that is, to use one of the recursive sequence data structures to implement the sequence of descriptors. The recursive methods are beneficial when the sequences are quite large so we might use a two-level recursive method if the number of descriptors was quite large. As we mentioned above, this might be the case with a piece table.

So there are four new variations that we have uncovered in this analysis.

The fixed size buffers method using the gap method inside each buffer.
The fixed size buffers method using the gap method for descriptors. This might be better if virtual memory performance was a consideration.
The piece table method using the gap method for descriptors. This might be better if virtual memory performance was a consideration.
A two level recursive method that uses a recursive method to maintain the sequence of descriptors. This would suitable if there is a very large number of descriptors.

Next: Experimental comparison of Up: A General Model Previous: A General Model

Charlie Crowley
Thu Jun 27 15:36:10 MDT 1996