Algorithms and Data Structures

The File or Sequence

Another elementary structuring method is the sequence. A sequence is typically a homogeneous structure like the array. That is, all its elements are of the same type, the base type of the sequence. We shall denote a sequence s with n elements by
s = <s0, s1, s2, … , sn-1>
n is called the length of the sequence. This structure looks exactly like the array. The essential difference is that in the case of the array the number of elements is fixed by the array’s declaration, whereas for the sequence it is left open. This implies that it may vary during execution of the program. Although every sequence has at any time a specific, finite length, we must consider the cardinality of a sequence type as infinite, because there is no fixed limit to the potential length of sequence variables.
A direct consequence of the variable length of sequences is the impossibility to allocate a fixed amount of storage to sequence variables. Instead, storage has to be allocated during program execution, namely whenever the sequence grows. Perhaps storage can be reclaimed when the sequence shrinks. In any case, a dynamic storage allocation scheme must be employed. All structures with variable size share this property, which is so essential that we classify them as advanced structures in contrast to the fundamental structures discussed so far.
What, then, causes us to place the discussion of sequences in this chapter on fundamental structures? The primary reason is that the storage management strategy is sufficiently simple for sequences (in contrast to other advanced structures), if we enforce a certain discipline in the use of sequences. In fact, under this proviso the handling of storage can safely be delegated to a machanism that can be guaranteed to be reasonably effective. The secondary reason is that sequences are indeed ubiquitous in all computer applications. This structure is prevalent in all cases where different kinds of storage media are involved, i.e. where data are to be moved from one medium to another, such as from disk or tape to primary store or vice-versa.
The discipline mentioned is the restraint to use sequential access only. By this we mean that a sequence is inspected by strictly proceeding from one element to its immediate successor, and that it is generated by repeatedly appending an element at its end. The immediate consequence is that elements are not directly accessible, with the exception of the one element which currently is up for inspection. It is this accessing discipline which fundamentally distinguishes sequences from arrays. As we shall see in Chapter 2, the influence of an access discipline on programs is profound.
The advantage of adhering to sequential access which, after all, is a serious restriction, is the relative simplicity of needed storage management. But even more important is the possibility to use effective buffering techniques when moving data to or from secondary storage devices. Sequential access allows us to feed streams of data through pipes between the different media. Buffering implies the collection of sections of a stream in a buffer, and the subsequent shipment of the whole buffer content once the buffer is filled. This results in very significantly more effective use of secondary storage. Given sequential access only, the buffering mechanism is reasonably straightforward for all sequences and all media. It can therefore safely be built into a system for general use, and the programmer need not be burdened by incorporating it in the program. Such a system is usually called a file system, because the high-volume, sequential access devices are used for permanent storage of (persistent) data, and they retain them even when the computer is switched off. The unit of data on these media is commonly called (sequential) file. Here we will use the term file as synonym to sequence.
There exist certain storage media in which the sequential access is indeed the only possible one. Among them are evidently all kinds of tapes. But even on magnetic disks each recording track constitutes a storage facility allowing only sequential access. Strictly sequential access is the primary characteristic of every mechanically moving device and of some other ones as well.
It follows that it is appropriate to distinguish between the data structure, the sequence, on one hand, and the mechanism to access elements on the other hand. The former is declared as a data structure, the latter typically by the introduction of a record with associated operators, or, according to more modern terminology, by a rider object. The distinction between data and mechanism declarations is also useful in view of the fact that several access points may exist concurrently on one and the same sequence, each one representing a sequential access at a (possibly) different location.
We summarize the essence of the foregoing as follows:
1. Arrays and records are random access structures. They are used when located in primary, random-access store. 2. Sequences are used to access data on secondary, sequential-access stores, such as disks and tapes. 3. We distinguish between the declaration of a sequence variable, and that of an access mechanism located at a certain position within the seqence.

Elementary File Operators

The discipline of sequential access can be enforced by providing a set of seqencing operators through which files can be accessed exclusively. Hence, although we may here refer to the i-th element of a sequence s by writing si, this shall not be possible in a program.
Sequences, files, are typically large, dynamic data structures stored on a secondary storage device. Such a device retains the data even if a program is terminated, or a computer is switched off. Therefore the introduction of a file variable is a complex operation connecting the data on the external device with the file variable in the program. We therefore define the type File in a separate module, whose definition specifies the type together with its operators. We call this module Files and postulate that a sequence or file variable must be explicitly initialized (opened) by calling an appropriate operator or function:
VAR f: File f := Open(name)
where name identifies the file as recorded on the persistent data carrier. Some systems distinguish between opening an existing file and opening a new file:
f := Old(name) f := New(name)
The disconnection between secondary storage and the file variable then must also be explicitly requested by, for example, a call of Close(f).
Evidently, the set of operators must contain an operator for generating (writing) and one for inspecting (reading) a sequence. We postulate that these operations apply not to a file directly, but to an object called a rider, which itself is connected with a file (sequence), and which implements a certain access mechanism. The sequential access discipline is guaranteed by a restrictive set of access operators (procedures).
A sequence is generated by appending elements at its end after having placed a rider on the file. Assuming the declaration
VAR r: Rider
we position the rider r on the file f by the statement
Set(r, f, pos)
where pos = 0 designates the beginning of the file (sequence). A typical pattern for generating the sequence is:
WHILE more DO compute next element x; Write(r, x) END
A sequence is inspected by first positioning a rider as shown above, and then proceeding from element to element. A typical pattern for reading a sequence is:
Read(r, x); WHILE ~r.eof DO process element x; Read(r, x) END
Evidently, a certain position is always associated with every rider. It is denoted by r.pos. Furthermore, we postulate that a rider contain a predicate (flag) r.eof indicating whether a preceding read operation had reached the sequence’s end. We can now postulate and describe informally the following set of primitive operators:
1a. New(f, name) defines f to be the empty sequence. 1b. Old(f, name) defines f to be the sequence persistently stored with given name. 2. Set(r, f, pos) associate rider r with sequence f, and place it at position pos. 3. Write(r, x) place element with value x in the sequence designated by rider r, and advance. 4. Read(r, x) assign to x the value of the element designated by rider r, and advance. 5. Close(f) registers the written file f in the persistent store (flush buffers).
Note: Writing an element in a sequence is often a complex operation. However, mostly, files are created by appending elements at the end.
In order to convey a more precise understanding of the sequencing operators, the following example of an implementation is provided. It shows how they might be expressed if sequences were represented by arrays. This example of an implementation intentionally builds upon concepts introduced and discussed earlier, and it does not involve either buffering or sequential stores which, as mentioned above, make the sequence concept truly necessary and attractive. Nevertheless, this example exhibits all the essential characteristics of the primitive sequence operators, independently on how the sequences are represented in store.