Collection Connection (Previously TPF Persistent Collections Corner)

Collection Connection (Previously TPF Persistent Collections Corner)
by Michele Dalbo, IBM TPF ID Core Team, and Daniel Jacobs, IBM TPF Development

With the release of program update tape (PUT) 9, TPF collection support (TPFCS) has renamed collection types, renamed APIs ... even the collection support itself has been renamed! So, it was no surprise to the writers of this article that we were also asked to change the name of this column. Anyway, to make up for all of this, we are going to discuss not two, not three, but five different types of collections! Coincidentally enough, they are all connected to the changes being made on PUT 9! Amazing! So let's get started ...

Sequence - The Collection Previously Known As Ordered
Characteristics:
A sequence collection is a collection of ordered elements that are fixed length and are accessed by relative position (index) in the sequence starting with index 1. The elements do not have keys. Sound familiar? It should. So far, that description matches the descriptions of array and log collections. These collection types are similar and have many application programming interfaces (APIs) in common, but there also are some substantial differences.

The elements in a sequence collection are ordered by the arrival of requests; that is, new elements are added to the end of the collection just as they are in a log. However, the elements in a sequence do not wrap as they do in a log. Furthermore, elements can be placed at specific positions in the sequence as can be done in an array. However, inserting elements in a sequence results in the new element being inserted into the collection at that position and shifting all subsequent elements by one position, whereas performing this operation in an array replaces the element at that location. Similarly, elements can be deleted from a sequence at which point all subsequent elements are shifted. Finally, elements in a sequence can be nonunique.

Advantages:
A sequence is a flexible structure and a good choice for representing data that is in list format. Items are naturally added to the end of the list, but items can also be inserted at the beginning or into the middle of the list and removed from the list. As a result, the relative positions of elements in a sequence may change; so, for instances where elements must remain in exact positions, an array should be considered.

Examples:
An example of a sequence might be to represent a waiting list. Let's say there is a large demand for your product and the product is out of stock. You can start a waiting list for the product and represent it with a sequence. Customers are generally added to the end of the list in a first-come, first-served fashion. However, you may want to put some customers higher up on the list because of extenuating circumstances or cash incentives (disclaimer: we in no way encourage the use of bribes). The sequence allows this to be done. Furthermore, customers can be taken off the top of the list as they are given the product or removed from the middle of the list if they are no longer interested in the product.

Another example is a program that maintains a list of the words in a paragraph. The order of the words is obviously important; you can add or remove words at a given position, but you cannot search for individual words except by iterating through the collection and comparing each word to the word you are searching for. You can add a word that is already present in the collection because a given word may be used more than once in a paragraph.

All five of the collections that we have now discussed (array, BLOB, log, and keyed log in previous articles) have been ordered with elements accessed using a relative position (index) in the collection. TPFCS also provides several collection types that allow elements to be ordered based on keys or user-defined fields. These will be discussed ... now!

Key Sorted Set - The Collection Previously Known As Dictionary
Characteristics:
A key sorted set is an ordered collection of elements that have two components: data, and a single key. Elements are sorted in ascending order according to the value of their key, which is not part of the data. The key values must be unique, but the data values can be nonunique. Although TPFCS only supports a single key, the key could consist of multiple subkeys that are concatenated together.

Advantages:
A key sorted set is designed to be used on data that needs to be sorted on a user-defined value for retrieval efficiency. Elements are automatically added to the collection in their appropriate place, and cursors can be used to iterate through such collections in ascending or descending order based on the key. Furthermore, independent APIs can be used to locate specific elements in the collection by key without using cursors.

Examples:
An example of an appropriate use for a key sorted set would be to represent a customer database that is accessed and updated by a customer identification (ID) number. The customer ID number is the key for the collection and the data consists of information about that customer, such as their name and address. No two customers have the same customer ID number. The collection is ordered in ascending order by customer ID so reports can be easily generated by iterating through the customers, and information about an individual customer can be easily retrieved by entering their ID number.

Another example is a program that keeps track of canceled credit card numbers and the individuals to whom they are issued. Each card number occurs only once and the collection is sorted by card number. When a merchant enters a customer's card number into a point-of-sale (POS) terminal, the collection is checked to see if that card number is listed in the collection of canceled cards. If the card number is found, the name of the individual is shown and the merchant is given directions for contacting the credit card company. If the card number is not found, the transaction can proceed because the card is considered to be valid. A list of canceled cards is printed out each month, sorted by card number, and distributed to all merchants who do not have an automatic POS terminal installed.

Key Sorted Bag
Characteristics:
A key sorted bag, introduced with PUT 9, has the same set of characteristics as a key sorted set except for one: nonunique key values are allowed. (That was easy to describe!)

Advantages:
A key sorted bag has all of the advantages of a key sorted set, and between the two collection types you also have the added flexibility of choosing a model based on whether you want to allow duplicate key values or not.

Examples:
One example of a key sorted bag is a customer order database similar to the one described in the key sorted set example. Instead of containing information about customers, this database contains information about orders that a customer places. Once again, the customer ID is the key; and because one customer can place multiple orders, we take advantage of the fact that nonunique key values are allowed.

Another example of using a key sorted bag collection is a program that maintains a list of sorted zip codes for marketing purposes. The key is the zip code. You can add an element whose key is already in the collection (because two people can have the same zip code) and you can generate a list of people sorted by zip code; however, you cannot locate a person except by their key because a key sorted bag does not support element equality.

Sorted Bag - The Collection Previously Known As Sorted
Characteristics:
A sorted bag is an ordered collection of elements that have a data area and no key. The user defines a sort field within the data area, and elements are sorted in ascending order according to the value of their sort field. The values of sort fields and the remaining data can be nonunique. TPFCS only supports a single sort field. The maximum element size is 1000 bytes and the maximum sort field is 248 bytes.

Advantages:
A sorted bag is designed to be used on data that needs to be sorted for retrieval efficiency on a user-defined value that is part of the data. Elements are automatically added to the collection in their appropriate place, and cursors can be used to iterate through such collections in ascending or descending order based on the sort field.

Examples:
As an example of using a sorted bag, consider the customer database from the key sorted set example. Suppose a customer moves and needs to update the address in their customer record, but they do not know the customer ID that was assigned to them. The TO2_asSorted API can be used to maintain a separate collection that lists all customers sorted by a last name/first name concatenation. It is possible that two people with the same first and last name will be in the database (which is OK because sorted bags allow entries with duplicate sort fields), but the customer service representative who is updating the database will be able to quickly call up each of the individual records and find identifying information to select the appropriate record. Furthermore, reports sorted by name can be generated easily.

Another example of a sorted bag might be to keep a record of all transactions made to a bank account. You may remember in our last issue that we recommended using a log for this type of data model. However, let's say that you don't know how many entries per day will be stored in the collection and you don't want your collection elements to automatically wrap. A sorted bag would be a good choice for this model. The entries would be sorted in chronological order (with duplicate sort fields allowed) for easy report generation, but they do not have to be added in chronological order (as they would in a log collection). A maintenance program could be set up to delete elements that are a certain age.

Sorted Set
Characteristics
A sorted set, introduced with PUT 9, has the same set of characteristics as a sorted bag except for one: unique sort field values are required. (That was easy to describe too!)

Advantages:
A sorted set has all of the advantages of a sorted bag; and between the two collection types you also have the added flexibility of choosing a model based on whether you want to allow duplicate sort field values or not.

Examples:
An example of using a sorted set is a program for entering observations on the types of stones found in a riverbed. Observers are only concerned with the different types of stones found, not the quantity of each; so a mineral type is stored only once in the collection. Each time you find a stone on the riverbed, you attempt to enter the mineral type of the stone into the collection; duplicate requests will be ignored. You can search for stones of a particular mineral type and display the contents of the collection, sorted by mineral type, if you want a complete list of observations made to date.

Another example of using a sorted set collection is a program that tests numbers to see if they are prime. Two complementary sorted sets are used: one for prime numbers and one for nonprime numbers. When you enter a number, the program first looks in the set of nonprime numbers. If the value is found there, the number is nonprime. If the value is not found there, the program looks in the set of prime numbers. If the value is found there, the number is prime. Otherwise, the program determines whether the number is prime or nonprime and places it in the appropriate sorted set. The program can also display a list of prime or nonprime numbers beginning at the first prime or nonprime number following a given value because the numbers in a sorted set are sorted from smallest to largest.

Key Sorted versus Sorted Collections
Hopefully, these examples have given you a better idea of how each of these collection types can be used. At this point, you may still be wondering when you would use a keyed collection (key sorted bag or key sorted set) instead of a sorted collection (sorted bag or sorted set), or vice versa. Following are a couple of extra hints.

The sorted collections allow elements to be added using an atomic API (TO2_add), but reading and removing elements requires the use of cursors and the TO2_locate API. In contrast, keyed collections allow you to access a particular collection element using atomic APIs, such as TO2_atKey and TO2_removeKey, in addition to allowing cursors to be used. However, keyed collections also require separate API calls to retrieve the key and data components of an element.

If the field that you want to sort on is already naturally part of your data (such as a name or date), a sorted collection might be appropriate. On the other hand, if the field you want to order your data by is extraneous and possibly assigned to the data (such as a customer ID number), that field should be handled as a key in a keyed collection.

Additionally, by looking at the examples for the key sorted set and the sorted bag, notice two important aspects of these examples. First, if your customer database did not have a key (such as the customer ID), you could use a sorted bag or sorted set with a sort field (such as the name) instead of a key sorted bag or key sorted set and still be able to use most of the functionality provided by keyed collections. Secondly, the TO2_asSorted API can be used to sort most collection types into a sorted bag with a user-defined order.

More to Come
We know you will be anxious to hear about other types of sets and bags that are available, so make a note to read the next edition for information about more of these fascinating collections! If you cannot wait and would like an “at a glance” comparison of all the collection types, see the collection support summary tables in the TPF Application Programming publication.