Demystifying Performance: The Effect of IO Improvement: Bandwidth

Recently I’ve had the opportunity to deal with database systems that claim to have a very smart storage subsystem that makes them superior every time, everywhere. I’m going to analyze this from the point of view of the performance: what is the actual effect of this smartness?

A visit to the doctor

Typically a visit to the doctor consists of two differentiated stages: the interaction with the nurse (when arriving check if you’re expected to come, some admin stuff...), and with the doctor himself. Let’s say you spend 2 minutes with the nurse, and 10 minutes with the doctor. That means the nurse is capable of attending 30 patients per hour, and the doctor 10 patients per hour. Technically it can be said that the nurse bandwidth is 30 pph (patients per hour) and the doctor bandwidth is 6 pph.

Those are the performance numbers when the stages are considered individually. But what is the overall bandwidth of the medical center (nurse + doctor) system? The one of the nurse? The one of the doctor? Somewhere in between? If you think a little bit the conclusion must be crystal clear: the doctor’s, and just because his bandwidth, his capability to service patients, is the lesser of the two. You may try to inject in the medical center more than 6 pph, but they will proceed through the system at a maximum 6 pph rate, the medical center bandwidth. The doctor is the bottleneck stage of the medical center.

If you were the managing director and decided it was necessary to increase the “performance” of the hospital, what would you do? Employ smarter or more nurses, or employ smarter or more doctors?

If you change the current nurse and employ a new one that is twice as efficient, that is, reduces from 2 min to 1 min the time spent with each patient, you would have increased the bandwidth of the nurse stage from 30 pph to 60 pph. But the doctor will continue being the bottleneck. The system bandwidth will remain the same! Conclusion: the nurse stage improvement is not effective.

If you change the current doctor and employ a new one that is twice as efficient, that is, reduces from 10 min to 5 min the time spent with each patient, you would have increased the bandwidth of the doctor stage from 6 pph to 12 pph. The doctor will continue being the bottleneck, but at an increased rate of 12 pph. Perfect! This is the right way to increase the medical center bandwidth, which has gone up from previous 6 pph to new 12 pph. The doctor (bottleneck) improvement is fully effective.

A very important “rule” you should keep in gold letters: the only effective way to improve the bandwidth of a composite (several stages) system is to improve the bandwidth of the bottleneck stage.

It is worth to point out that which of the stages is the bottleneck also depends on the workload: the doctor is the bottleneck for the above workload, a normal patient visit. But imagine a new workload that consists of heavy administrative work that must be processed by the nurse and the doctor only has to stamp his signature. What do you think would be the bottleneck then?

Note: why doctors and medical centers? No particular reason beyond I’ve been today at one of them and it has occurred to me they are a very simple scenario everyone would have experienced.

A visit to a database machine

For the purpose of this simple analysis the database machine is also a two stage (or subsystem) system: the database node and the storage node. One user request, or visit, to the system places demands for service on the database and on the storage.

For a particular workload the database and storage nodes have a certain processing capacity. The (maximum) processing capacity is called bandwidth, and is measured in user requests per unit time, typically requests per second (req/s) or per minute (req/min).

The stage with the lesser of the two bandwidths -for this particular workload- is the bottleneck, and the system bandwidth will be equal to the bottleneck bandwidth, no more, no less.

Consider a certain heavy IO workload, with high degree of IO contents, for which the storage is the bottleneck. Any improvement on the IO demand, being achieved by a reduction of IOs caused by smart intelligence or by using more or faster storage nodes, will increase the bandwidth of the storage stage. And, consequently, the bandwidth of the overall system / database machine.

Let’s consider these particular values for the heavy IO workload: the bandwidth of the database node (B_DB) is B_DB=10 req/s (or 600 req/min), and the bandwidth of the storage node (B_ST) is B_ST=5 req/s (or 300 req/min). The system bandwidth (B) is equal to the one of the storage (bottleneck), that is, B= B_ST=5 req/s. If the IO reduction technique halves the IO demand, the improved storage bandwidth will be B’_ST=10 req/s (or 600 req/min). The database machine bandwidth will increase from 5 req/s to 10 req/s. That is B’= B’_ST=10 req/s.

Let’s explore the other extreme: a light IO workload, for with the database is the bottleneck (and the storage is not). Any reduction of the demand placed on the storage subsystem will have null, or very limited, effect as it is not improving the bottleneck. In this case to have or not a smarter storage doesn’t matter at all. You better wouldn’t pay for it.

Let’s use this particular values for the light IO workload: B_DB=5 req/s (or 300 req/min) and B_ST=10 req/s (or 600 req/min). The system bandwidth (B) is equal to the one of the database (bottleneck), that is, B= B_DB=5 req/s. If the IO reduction technique halves the IO demand, the improved storage bandwidth will be B’_ST=20 req/s (or 1200 req/min). But the overall database machine bandwidth will remain the same: B’= B_DB=5 req/s.

This light IO scenario is exactly what happens in benchmarks that have a very low IO content, like the SAP SD benchmark that meters the SAPS of the systems. It doesn’t stress the IO subsystem and B is limited by B_DB, no matter what extraordinary capable or smart storage you use.

Let’s point out here that the bandwidth of the overall system will never be greater than the bandwidth of the database node. This particularly means that the SAPS of the database machine will be always equal or less than the SAPS of the database node.

In the real world

Back in the real world there are more tones that merely black or white. But to know the extremes should be of help to us.

Any IO improvement, for example with smarter storage, has its maximum effect when the storage is the bottleneck. It has no effect at all when the storage is not the bottleneck (and the database node is). Consequently from the bandwidth point of view there is no such an every time, everywhere absolute positive improvement for having a smarter storage. It will depend on where the bottleneck is placed.

But what it seems clear is that this smartness will cost you a positive absolute number.

Demystifying Performance

dimecres, 15 d’octubre del 2014

The Effect of IO Improvement: Bandwidth

A visit to the doctor

A visit to a database machine

In the real world

Cap comentari:

Publica un comentari a l'entrada