Recently I’ve had the opportunity to deal with database
systems that claim to have a very smart storage subsystem that makes them
superior every time, everywhere. I’m going to analyze this from the point of
view of the performance: what is the actual effect of this smartness?
A visit to the doctor
Typically a visit to the doctor consists of two
differentiated stages: the interaction with the nurse (when arriving check if
you’re expected to come, some admin stuff...), and with the doctor
himself. Let’s say you spend 2 minutes
with the nurse, and 10 minutes with the doctor. That means the nurse is capable
of attending 30 patients per hour, and the doctor 10 patients per hour.
Technically it can be said that the nurse bandwidth is 30 pph (patients per
hour) and the doctor bandwidth is 6 pph.
Those are the performance numbers when the stages are considered
individually. But what is the overall bandwidth of the medical center (nurse +
doctor) system? The one of the nurse? The one of the doctor? Somewhere in
between? If you think a little bit the conclusion must be crystal clear: the
doctor’s, and just because his bandwidth, his capability to service patients,
is the lesser of the two. You may try to inject in the medical center more than
6 pph, but they will proceed through the system at a maximum 6 pph rate, the
medical center bandwidth. The doctor is the bottleneck stage of the medical
center.
If you were the managing director and decided it was
necessary to increase the “performance” of the hospital, what would you do? Employ
smarter or more nurses, or employ smarter or more doctors?
If you change the current nurse and employ a new one that is
twice as efficient, that is, reduces from 2 min to 1 min the time spent with
each patient, you would have increased the bandwidth of the nurse stage from 30
pph to 60 pph. But the doctor will continue being the bottleneck. The system
bandwidth will remain the same! Conclusion: the nurse stage improvement is not
effective.
If you change the current doctor and employ a new one that
is twice as efficient, that is, reduces from 10 min to 5 min the time spent
with each patient, you would have increased the bandwidth of the doctor stage from
6 pph to 12 pph. The doctor will continue being the bottleneck, but at an
increased rate of 12 pph. Perfect! This is the right way to increase the medical
center bandwidth, which has gone up from previous 6 pph to new 12 pph. The
doctor (bottleneck) improvement is fully effective.
A very important “rule” you should keep in gold letters: the only effective way to improve the bandwidth of
a composite (several stages) system is to improve the bandwidth of the
bottleneck stage.
It is worth to point out that which of the stages is the
bottleneck also depends on the workload: the doctor is the bottleneck for the
above workload, a normal patient visit. But imagine a new workload that
consists of heavy administrative work that must be processed by the nurse and
the doctor only has to stamp his signature. What do you think would be the
bottleneck then?
Note: why doctors and medical centers? No particular reason
beyond I’ve been today at one of them and it has occurred to me they are a very
simple scenario everyone would have experienced.
A visit to a database machine
For the purpose of this simple analysis the database machine
is also a two stage (or subsystem) system: the database node and the storage
node. One user request, or visit, to the system places demands for service on
the database and on the storage.
For a particular workload the database and storage nodes
have a certain processing capacity. The (maximum) processing capacity is called
bandwidth, and is measured in user requests per unit time, typically requests
per second (req/s) or per minute (req/min).
The stage with the lesser of the two bandwidths -for this
particular workload- is the bottleneck, and the system bandwidth will be equal
to the bottleneck bandwidth, no more, no less.
Consider a certain heavy
IO workload, with high degree of IO contents, for which the storage is the
bottleneck. Any improvement on the IO demand, being achieved by a reduction of
IOs caused by smart intelligence or by using more or faster storage nodes, will
increase the bandwidth of the storage stage. And, consequently, the bandwidth
of the overall system / database machine.
Let’s consider these particular values for the heavy IO
workload: the bandwidth of the database node (BDB) is BDB=10
req/s (or 600 req/min), and the bandwidth of the storage node (BST)
is BST=5 req/s (or 300 req/min). The system bandwidth (B) is equal
to the one of the storage (bottleneck), that is, B= BST=5 req/s. If
the IO reduction technique halves the IO demand, the improved storage bandwidth
will be B’ST=10 req/s (or 600 req/min). The database machine
bandwidth will increase from 5 req/s to 10 req/s. That is B’= B’ST=10
req/s.
Let’s explore the other extreme: a light IO workload, for with the database is the bottleneck (and the
storage is not). Any reduction of the demand placed on the storage subsystem
will have null, or very limited, effect as it is not improving the bottleneck.
In this case to have or not a smarter storage doesn’t matter at all. You better
wouldn’t pay for it.
Let’s use this particular values for the light IO workload:
BDB=5 req/s (or 300 req/min) and BST=10 req/s (or 600
req/min). The system bandwidth (B) is equal to the one of the database (bottleneck),
that is, B= BDB=5 req/s. If the IO reduction technique halves the IO
demand, the improved storage bandwidth will be B’ST=20 req/s (or 1200
req/min). But the overall database machine bandwidth will remain the same: B’= BDB=5
req/s.
This light IO scenario is exactly what happens in benchmarks
that have a very low IO content, like the SAP SD benchmark that meters the SAPS
of the systems. It doesn’t stress the IO subsystem and B is limited by BDB,
no matter what extraordinary capable or smart storage you use.
Let’s point out here that the bandwidth of the overall
system will never be greater than the bandwidth of the database node. This
particularly means that the SAPS of the database machine will be always equal
or less than the SAPS of the database node.
In the real world
Back in the real world there are more tones that merely
black or white. But to know the extremes should be of help to us.
Any IO improvement, for example with smarter storage, has
its maximum effect when the storage is the bottleneck. It has no effect at all
when the storage is not the bottleneck (and the database node is). Consequently
from the bandwidth point of view there is no such an every time, everywhere
absolute positive improvement for having a smarter storage. It will depend on
where the bottleneck is placed.
But what it seems clear is that this smartness will cost you
a positive absolute number.
Cap comentari:
Publica un comentari a l'entrada