Demystifying Performance

Scenario #1: the magic coin

Someone flips a coin four times and obtain the following values:

{Head, Tail, Head, Head}

Based on the outcome of the above experiment he reports:

My coin is magic: it has a 3/4 probability of Heads, and 1/4 of Tails.

Scenario #2: the voting visionary

Next weekend election will be held in your country. There are two candidates: A, and B. Someone interviews ten people and obtain the following voting intentions:

{A, B, B, B, A, A, A, A, A, B}

Based on the above he reports:

The result of the election will be A: 60%, B: 40%.

Scenario #3: the peak usage

You measure the system usage at the peak hour during three days. The measured values are:

{70%, 70%, 85%}

Based on the above you report:

The average system usage at the peak hour is 75%.

Think

Surely you consider that the man in the coin and the voting scenarios is silly, or utterly ignorant at least, because is making bold predictions based on a very small empirical evidence. However you probably accept with no objections what is telling the man (btw, you!) in the system usage scenario, but…

…THINK!

There is nothing significantly different in the three scenarios. So if there is something wrong with the coin and the voting case, it may be the same for the peak usage case.

The confidence interval

When you try to estimate a certain quantity based on a limited number of measurements the reported value is affected of what is called the statistical error. This is inherent to any product of sampling. This statistical error is not an “error” in the common sense of the word, but instead it express the precision or reliability of the reported figure. Instead of having a value, you have an interval, that is, you MUST say:

The value is likely to be between this and that.

The length of this interval corresponds to the imprecision / uncertainty. If the length is small, the precision is high. If the length is large, the precision is low. Clearly it is not the same to say

The head probability is likely to be between 0.2 and 0.8.

a very low precision determination, than to say

The head probability is likely to be between 0.499 and 0.501.

a much higher precision one.

The confidence interval is the official name, in maths or statistics, for such an interval.

The degree of “likelihood” (…is likely to be…) is quantified by what is known as confidence level. Typical values are 90% or 95%. The loose meaning of a 90% confidence level is is that there is a 90% probability that the true value –the one we are trying to determine- lies within the interval (and, consequently, that there is a 10% probability that it falls outside).

The confidence interval is centered around the sample mean, the arithmetic average of the data (the measured values):

Center of the confidence interval = AVERAGE(data)

where data is the sample, or set of measured data points.

The confidence interval length is typically estimated with the following formula:

Length of the confidence interval =2* T.INV.2T(1-conf, size-1)*STDEV.S(data)/SQRT(size)

where size is the sample size (number of data points), and conf is the desired confidence level, and T.INV.2T(), STDEV.S() and SQRT() are worksheet (Excel) functions. (I’m not focusing here in the details, I’m paying more attention to its consequences and dependencies than to the formula itself).

This length depends on the following factors:

the sample size or number of data points: the length decreases, and the precision increases, when the number of data points increases.
the sample variability/dispersion: the length increases, and the precision decreases, when the data is noisy, erratic, highly variable.
The confidence level: the lenght increases when the confidence level increases. Reason? If you want increased certainty in your report, more “safety” margin is needed.

To have a better idea on this, look at the following table showing the approximate confidence interval length versus the sample size for a variable that can go from 0 to 100, and for a confidence level of 90%. To achieve a precision of 3.5% you need around 1000 measurements!

Size	Length
5	64
10	38
20	25
30	20
100	11
1000	3.5

Let’s go back and revisit our scenarios, but now equipped with the above ideas and guidelines.

Scenario #1: the magic coin (REVISITED)

Someone flips a coin four times and obtain the following values:

{Head, Tail, Head, Head}

Based on the outcome of the above experiment it MUST be reported:

My coin has a head probability between 0.3 and 1 with a confidence level of 90%

If you want to increase the precision, that is, reduce the confidence interval length, you must increase the sample size, that is, the number of flips. With 100 flips you are going to obtain something like:

My coin has a probability of heads between 0.45 and 0.55 with a confidence level of 90%

Scenario #2: the voting visionary (REVISITED)

Next weekend election will be held in your country. There are two candidates: A, and B. Someone interviews ten people and obtain the following voting intentions:

{A, B, B, B, A, A, A, A, A, B}

Based on the above it MUST be reported:

The result of the election with a confidence level of 95% will be:

A between 30% and 90% and B between 10% and 70%.

This has changed a lot from the initial bold prediction; it is much blurred now. Typical opinion polls to estimate the true percentage of vote with reasonable precision and a confidence level of 95% require a sample size of around 1000 people. Have a look on the small letter next to the results when you see such a study in your newspaper.

Scenario #3: the usage peak (REVISITED)

You measure the system usage at the peak hour during three days. The measured values are:

{70%, 70%, 85%}

Based on the above you MUST report:

The average system usage at the peak hour is between 65% and 85% with a confidence level of 90%.

If you want to increase the precision, that is, reduce the confidence interval length, you must increase the sample size, that is, the number of dayly measurement. With 20 days you are going to obtain something like:

The average system usage at the peak hour is between 74% and 78% with a confidence level of 90%.

Things to remember

There is an unavoidable uncertainty in your measurements and calculations.
Any estimation based on sampled data or limited number of measurements must/should be accompanied with the precision/uncertainty
Avoid too small samples. The sample size should be large enough to obtain a reasonable precision.
What it is not said, the error, is usually as important as the reported value itself.
There are many marketing tricks that do not tell the whole story and deliberately hide the estimation error. Do not act like a malicious or ignorant people creating those marketing messages.

In the next contribution I’ll have a closer look to the system usage case. Stay tuned.

After reading “Better 1x10 than 10x1" you're convinced you need a big server. Then you go to a big enterprise computer shop and ask for one system with “10” units of processing capacity. The seller shows you two different models: one called "10x1", equipped with 10 processors of relative “speed” equal to 1. The other one, called "1x10", equipped with only 1 processor but 10 times faster. Both systems have a certificate that indicates they have 10 units of capacity. And both server prices are equal. Which one would you choose?

First of all, you must be aware that the most used measure to express the capacity of any server, and in general of any service center whatever it be, is the maximum units of work (UOW) per unit of time (UOT) it can service (Footnote 1). This is the maximum throughput it supports, also called the bandwidth. In our particular case, both servers have the same value: 10 UOW/UOT. So this is not going to help us in our election of the server. Know the maximum throughput is necessary, but not sufficient.

To find the answer let’s switch to another point of view: we move away from the overall behavior, represented by the benchmarked maximum throughput, to the individual perception of performance, represented by the response time. Does a particular user experience the “same” performance in both servers?

Suppose that 1 customer (requesting 1 UOW) arrives to an empty (100% free) server. In such a circumstance, the customer sees not queue ahead, does not have to wait, and receives immediate service. In the 1x10-server case it spends, let's say, 10 units of time, receiving service (this time is called the service time S). In the 10x1-server case it spends 1 unit of time, just because the processor speed is 10 times faster.

This makes a difference, doesn't it? In the low load zone, where queuing seldom occurs, the 10x1-server shows an obvious advantage over the 1x10-server, due to the increased speed of its processor.

In the high load zone, there are a stream of users (UOWs) arriving and long queues build up. The user has to patiently wait for the queue to advance until it is his turn. So a user request spend much more time waiting to be serviced than being serviced. The time spent in queue, waiting to be serviced, is called the waiting time (W). This waiting time, being much greater than the service time, dominates the user experience.

The waiting time is essentially the same for the two kind of servers. If you arrive to a waiting line and see 1000 customers waiting ahead of you, you’ll have to wait the same if the queue advances 10 customers every 10 units of time, as is the case for the 10x1-server, than if the queue advances 1 customer every 1 unit of time, as is the case of the 1x10-server. This means that in the high load zone both type of servers essentially deliver the same individual user performance.

Summarizing: in the low load zone the 10x1-server is the winner, and moving away into the high load zone the 10x1-server advantage gets blurred. And at the end, where the very high load zone is reached, both servers are equally performers.

You've now a good reason to prefer one kind of server, the 10x1-server, to the other, the 1x10-server: the response time will be better in the low load zone. This may or may not be significant in your particular case of interest. It should be more noticeable in long running not parallelizable tasks than in short running ones, but the fact is there and you should be aware of it.

Server vendors typically put a premium price on 10x1-servers. I would like to think that this increase in price is because of its better performance, but I suppose that the real reason relies on higher research & development, and production costs of faster processors and components.

A final word here: in the previous “Better 1x10 than 10x1" blog entry we exposed a reason to prefer bigger servers: the random nature of user arrivals or demands. And here we've seen that it is better to achieve the big capacity with less and faster processors. The pattern “10x1 better than 1x10” repeats, but from a different point of view.

Think about it.

Footnote:

(1) Units of work (UOW) and units of time (UOT) are generic units. In the computer system world they can express, i.e., transactions per minute or tpm (UOW=transaction, UOT=minute), IOPS (UOW=IO operation, UOW=second), and similar. But it is generally applicable to other areas: public service desks (UOW=certain customer request, UOT=minute), highway tolls (UOW=cars passing & paying, UOT=minute), and so on. The only limit is our imagination.