The piece table method

Next: A General Model Up: Recursive Sequence Data Previous: Fixed size buffers

The piece table method

In the piece table method the sizes of the spans are as large as possible but are split as a result of the editing that is done on the sequence. The sequence starts out as one big span and that gets divided up as insertions and deletions are made. We call each span a piece (of the sequence) and its descriptor is called a piece descriptor. The sequence of piece descriptors is called the piece table.

The piece table method uses two buffers. The first (the file buffer) contains the original contents of the file being edited. It is of fixed size and is read-only. The second (the add buffer) is another buffer that can grow without bound and is append-only. All new items are placed in this buffer.

Each piece descriptor points to a span in the file buffer or in the add buffer. Thus a descriptor must contain three pieces of information: which buffer (a boolean), an offset into that buffer (a non-negative integer) and a length (a positive integer).

Figure 7 shows a piece table structure.

Figure 7: The piece table method

The file consists of five pieces.

Initially there is only one piece descriptor which points to the entire file buffer. A delete is handled by splitting a piece into two pieces. One of these pieces points to the items in the old piece before the deleted item and the other piece points to the items after the deleted item. A special case occurs when the deleted item is at the beginning or end of the piece in which case we simply adjust the pointer or the piece length.

An insert is handled by splitting the piece into three pieces. The first piece points to the items of the old piece before the inserted item. The third piece points to the items of the old piece after the inserted item. The inserted item is appended to the end of the add file and the second piece points to this item. Again there are special cases for in insertion at the beginning or end of a piece.

If several items are inserted in a row, the inserted items are combined into a single piece rather than using a separate piece for each item inserted.

Figures 8, 9 and 10 show the effect of a delete and an insert in a piece table. Figure 8 shows the piece table after the file is read in initially. This is a very short file containing only 20 characters. Figure 9 shows the piece table after the word ``large '' has been deleted. Figure 10 shows the piece table after the word ``English '' has been added. Notice that, in general, a delete increases the number of pieces in the piece table by one and an insert increases the number of pieces in the piece table by two.

Figure 8: A piece table before any edits

Figure 9: A piece table after a delete

Figure 10: A piece table after a delete and an insert

Let us look at another example. Suppose we start with a new file that is 1000 bytes long and make the following edits.

Six characters inserted (typed in) after character 900.
Character 600 deleted.
Five characters inserted (typed in) after character 500.

The piece table after these edits will look like this:

The ``logical offset'' column does not actually exist in the piece table but can be computed from it (it is the running total of the lengths). These logical offsets are not kept in the piece table because they would all have to be updated after each edit.

The piece table method has several advantages.

The original file is never changed so it can be a read-only file. This is advantageous for caching systems since the data never changes.
The add file is append-only and so, once written, it never changes either.
Items never move once they have been written into a buffer so they can be pointed to by other data structures working together with the piece table.
Undo is made much easier by the fact that items are never written over. It is never necessary to save deleted or changed items. Undo is just a matter of keeping the right piece descriptors around. Unlimited undoes can be easily supported.
No file preprocessing is required. The initial piece can be set up only knowing the length of the original file, information that can be quickly and easily obtained from the file system. Thus the size of the file does not affect the startup time.
The amount of memory used is a function of the number of edits not the size of the file. Thus edits on very large files will be quite efficient.

The above description implies that a sequence must start out as one big piece and only inserts and deleted can add pieces. Following this rule keeps the number of pieces at a minimum and the fewer pieces there are the more efficient the ItemAt operations are. But the text editor is free to split pieces at other times to suit its purposes.

For example, a word processor needs to keep formatting information as well as the text sequence itself. This formatting information can be kept as a tree where the leaves of the tree are pieces. A word in bold face would be kept as a separate piece so it could be pointed to by a ``bold'' format node in the format tree. The text editor Lara [8] uses piece tables this way.

As another example, suppose the text editor implements hypertext links between any two spans of text in the sequence. The span at each end of the link can be isolated in a separate piece and the link data structure would point to these two pieces. This technique is used in the Pastiche text editor [5] for fine-grained hypertext.

These techniques work because piece table descriptors and items do not move when edits occur and so these tree structures will be maintained with little extra work even if the sequence is edited heavily.

Overall, the piece table is an excellent data structure for sequences and is normally the data structure of choice. Caching can be used to speed up this data structure so it is competitive with other data structures for sequences.

Piece tables are used in the text editors: Bravo [10], Lara [8], Point [4] and Pastiche [5]. Fraser and Krishnamurthy [7] suggest the use of piece tables as a way to implement their idea of ``live text''.

Next: A General Model Up: Recursive Sequence Data Previous: Fixed size buffers

Charlie Crowley
Thu Jun 27 15:36:10 MDT 1996