Creating a successful single column index is relatively straight forward. In my last post, I showed that the performance of a single column index is based on the indexed field being one of the parameters in the query’s WHERE clause and also the cardinality of the indexed field.
Let’s examine the simple two-column index first. Two-column indexes are simpler to construct than three or more column indexes.
Cardinality becomes even more important for multi column indexes. For example, if we have the requirement to index two fields foo and bar of a table, the order in which we specify the columns within the index, is critical to its overall performance.
Suppose we have a total of some 2,000 records, of which we have 750 possible discrete values for the foo field and 1600 discrete values for bar. We then have a cardinality of 750 for foo and 1600 for bar. If we created the index we’d choose the field with the highest cardinality to be the first column of the index. Thus it would be bar first then foo second.
We can illustrate why this is good policy with a very simple example. An example of a real world index in action is a telephone book. If a telephone book is indexed by each subscriber’s first name and surname, it takes much longer to find a person’s telephone number then it normally does. This is because the surname field has a much higher cardinality then the first name field. It follows that, if you found a set of subscribers matching a particular surname (Smith say), you’d have a smaller subset of records to check the second column against, then if you had found a subset by subscriber’s first name (Joe say).
If I was searching for “Joe Smith”, in case 1, I would have to examine the second column for three values and in case 2 only two values would need to be examined. So we can see why the field with that highest cardinality should be the first column of the index.
What about the situation for indexes of three or more columns?
Deciding on which fields to locate in the second and succeeding columns provide a more complex cardinality issue in the case of indexes with three or more columns. Just using the field with next highest cardinality doesn’t necessarily produce the most efficient index. The cardinality must be examined in relation to a typical value for the previous column. In other words, for a typical value of column one, we need to compare the equivalent cardinality of each the prospective fields to find the most attractive candidate.
Let’s take an example:
We need to construct an index from the Hotel, RoomType and Date fields. Using our rules we can assign the first column of the index to the field with the highest cardinality. This is the Hotel field with a cardinality of 8. For the second column we need to assess what we believe to be the average cardinality of both the RoomType and Date fields for a typical hotel field value.
When we check this we get the following table of results.
This gives us an average cardinality of 1.875 versus 1.75, thus the RoomType field is the best candidate for the second column of the index, leaving the Date field for the last column.
I found it relatively easy to construct an SQL statement to help calculate average specific cardinalities like this for real data in real tables, but I felt my query was too database specific (MS-SQL in this case) to detail it here, I contend that any simple solution to do this would suffer from the same issue.
During the short peer-review of this article, it was pointed out to me that commercial database offerings have tools to do this task for you. My feelings are that it can’t hurt to understand the basic principals underpinning this functionality, and it always pays to understand how something works. It was also pointed out to me, that even if an efficient multi-column index is created, the query optimizer may choose to never use it. This is because the optimizer looks at further statistics to determine if the index would be efficient enough or not. So it’s important to verify that your index is used for the queries it was constructed for.
Next time we’ll discuss when not to create a new index, why having too many indexes is bad and when to maintain an index.
It may be database specific, but please detail it anyway…
> found it relatively easy to construct an SQL statement to help calculate average specific cardinalities like this for real data in real tables, but I felt my query was too database specific (MS-SQL in this case) to detail it here, I contend that any simple solution to do this would suffer from the same issue.
I think this is worthy of a separate blog entry Will. So, I’ll knock one up in the next few days.