Demystifying Performance

I've always felt attracted to computer performance subjects: throughput, response time, sizing, bottlenecks, usage, capacity. I've not always seen clear, understandable and knowledgeable explanations on these subjects. With the arrival of the virtualization even more complexity is introduced: terms that represented constant values, like the machine capacity, gets blurred. What can be said about a virtual machine capacity if it is no longer a fixed value and may change from this hour to the next?

dilluns, 14 de novembre del 2016

Data Tiering: a simple performance analysis

A certain transaction uses 1 ms of cpu time and 20 ms of IO time. With a think time of 10 s the system supports 550 users with a response time R below 1 s.

Now the same system is equipped with a faster storage where part of the data is placed. When this data replacement is automatic, the faster storage is acting as a cache. If the transaction finds the data in the cache the IO time is reduced to 5 ms. Otherwise the IO time remains to be 20 ms.

With a hit ratio of 10%, that is, one in ten IO accesses are served from the cache, the performance of the system improves in the following ways:

With the same customer population (550 users) now the average response time drops from 1 s to 0.20 s!
The user population may increase from 550 to 611 users to reach again the limit of 1 s of average response time.

Let us build a reasonably simplified performance model for the caching in data tiering, a model suitable for a simple but illustrative performance analysis. A transaction "visits" the CPU, using C units of time (1 ms in the scenario), and "visits" the storage, using S units of time (20 ms in the scenario).

With a cache in place the same transaction may find the data in the cache -this is called a (cache) hit-, and in such fortunate case it doesn't have to visit the storage. This fast IO takes F units of time (5 ms in the scenario)

Two parameters to describe the cache effect:

α - efficiency (0 <= α <= 1), defined as α=(S-F)/S or, equivalently F=S(1-α).
h - hit ratio: (0 <= h <= 1), ratio of transactions that get their required data from the cache, typically proportional to its capacity.

KPI #1: Best Latency

Without the cache the best latency (response time) a transaction may achieve is Rmin=C+S units of time. With the cache in place, the best achievable latency reduces to R'min=C+F, corresponding to the transactions that get their IO from the cache.

In our scenario, the best latency with cache miss is 21 ms, and with cache hit is 6 ms.

KPI	No cache	With cache
Best Latency	C+S	hit: C+S(1-α) miss: C+S

KPI #2: Bandwidth

Without the cache the bandwidth (best throughput) the system delivers is B=1/S, limited by the bottleneck device, the slow storage. With the cache in place the system is able to sustain a peak throughput of B'=B/(1-h) for h<α and B'=B/(h·(1-α)) for h>α .

In our scenario, the bandwidth without cache is 1/20 ms-1, that is 50 transactions per second. With cache the bandwidth increases 11% for a hit ratio h=10%, 25% for h=20%, and 100% for h=50%.

KPI	No cache	With cache
Bandwidth	B=1/S	B'=B/(1-h) if h<=1/(2-α) B'=B/[h·(1-α)] if h>1/(2-α)

Here is a plot of the bandwidth gain with cache versus the hit ratio. The hit ratio that achieves the maximum bandwidth gain is h=1/(2-α), and the corresponding maximum gain is B'/B=(2-α)/(1-α).

Graph: bandwidth gain with cache (α=75%) versus the hit ratio

KPI #3: Response Time

The impact of the presence of the cache in the response time signature of the system, that is, the graph response time versus the number of users interacting with the system, is twofold:

a displacement to the right due to the increase of the saturation population, the point that marks the change of phase, and
a decrease in the high load slope.

Graph: response time versus the number of users

The consequence of these is that the the cache reduces the response time, a reduction that is stronger in the high load zone. We must speak of averages here, as there are two response times now: one for transactions with IO is served from cache, and other for transactions with IO served from the slow storage.

In our scenario with a hit ratio of 10% the response time of the system drops from 1 s to 0.20 s with the same customer population (550 users).

KPI #4:Supported users

Looking at the response time signature from another point of view is clear that additional users may work in the system while keeping the same level of performance (response time limit).

In our scenario the user population may increase from 550 to 611 users to reach again the limit of 1 s of average response time (see above graph).

Side effects

These performance benefits of caching don't come for free. Apart from the evident monetary cost, think about the following

The CPU usage increases: the cache alleviates the load on the storage but increase the load on the CPU, because if the CPU gets the data faster, it has more work to do. This may displace other cpu intensive work that may be competing for the CPU.

The housekeeping burden: cache housekeeping tasks (data load, data discard...) may consume CPU cycles, unless this task is offloaded to the storage.

divendres, 11 de novembre del 2016

The scalability of the software

The developers team in the ABC company has just built a five star transaction / program. The program code has a critical region. Basic performance tests with a few users result in a the total execution time of 1 s, with a residence time in the critical region of 0.05 s. These numbers are considered satisfactory by the management, so the deployment for general availability is scheduled for next weekend.

You, a performance analyst's apprentice, ask for the expected concurrency level, that is, the number of simultaneous executions of the transaction / program. This concurrency results to be 100.

What do you think about this?

A suitable performance model

A very simple model to analyze and predict the performance of the system is a closed loop, with two stages and fixed / deterministic time in each stage, as depicted here:

The total execution time of the program is divided into:

the time in the parallel region, where concurrency is allowed.
the time in the serial (critical) region, where simultaneity is not allowed..

With only one user (one copy of the program/transaction in execution) the elapsed time is P + S, that is, 1 s ( = 0.95 + 0.05 ).

But what happens when the concurrency level is N? In particular, what happens when N=100?

And the model predicts...

Calculating as explained in "The Phases of the Response Time" the model predicts the saturation point at N*=20 (=1+0.95/0.05) users. This is the software scalability limit. More than 20 users or simultaneous executions will queue at the entry point of the critical region. The higher the concurrency level, the bigger the queue, and the more the waiting time. You can easily calculate that with the target concurrency level of 100 users, the idyllic 1 s time measured by the developers team (with few users) will increase to an unacceptable 5 s level. This means that the elapsed time of any program/transaction execution will be 5 s, distributed in the following way:

0.95 s in the parallel region,
4 s waiting to enter the critical (serial) region, and
0.05 s in the critical region.

Elapsed execution time for N=1 and N=100 concurrency level

The graph of the execution time against the number of concurrent users is the following:

Elapsed execution time against the concurrency level

And, in effect, when the program is released the unacceptable response time shows up!

Corrective measures

The crisis committee hold an urgent meeting, and these are the different points of views:

Developers Team: the problem is caused by a HW capacity insufficiency. Please, growth (assign more cores to) the VM supporting the application and the problem will disappear.
Infrastructure Team: hardware undersized? No point. The CPU usage is barely 25%! We don't know what is happening.
Performance Analyst Team (featuring YOU): more cores won't solve the problem as the hardware is not the bottleneck!

Additional cores were assigned but, as you rightly predicted, things remained the same. The bottleneck here is not the hardware capacity. but the program itself. The right approach to improve the performance numbers is by reducing the residence time in the non parallelizable critical region. So the developers team should review the program code in a performance aware manner.

You go a step further and expose more predictions: if the time in the critical region were reduced from the current 0.05 s to 0,02 s the new response time for a degree of simultaneity of 100 will be 1.5 s, and the new response time graph will be this one (blue 0.05 s, red 0.02 s):

Elapsed execution time against the concurrency level for S=0.05 ms (blue) and S=0.02 ms (red).

Lessons learnt

Refrain to blame the hardware capacity by default. There are times, more than you think, in which the hardware capacity is not the limiting factor, but an innocent bystander that gets pointed as the culprit.
Plan and execute true performance tests in the development phase, and specially a high load one, because with few users you probably will not hit the performance bottleneck.
Definitively welcome the skills provided by a performance analyst. Have one in your team. You won't regret.

dijous, 1 de setembre del 2016

Phases of the SAPS Benchmark

The SAPS benchmark fits very closely in the model analyzed in the post "Phases of the Response Time". In essence the benchmark is performed by progressively increasing the customer population, and monitoring the response time. When response time reaches 1 second, the measured throughput, as expressed in dialog steps per minute, is the SAPS value. The think time remains constant at 10 s.

SAPS is a performance metric measuring throughput: 1 SAPS is 1 dialog step (a unit of work defined by SAP) per minute. This basically means 1 customer service per minute

Model Calibration

Let us consider, without loss of generality, the benchmark number 2014034, corresponding to an IBM POWER E870.

According to the benchmark certificate these are the measured values:

m = 640, the number of concurrent HW threads,
X = 26166000 ds/h = 436100 ds/min (SAPS).
N = 79750, the number of users.
R = 0.97 s, the average response time.

The service time is derived from those values:

S = m/B = 0.088 s/ds.

As R>>S the system must be well above the saturation point. The saturation point N* is N*=m(1+Z/S)= 73323, and N, as stated in the benchmark certification, is 79750. Clearly N>N*, confirming the hypothesis that the system is operating well over the saturation point.

At this point we have our simple model well calibrated. To double check it, we can see that the response time predicted by our simple model would for a population of N=79750, the benchmark population, is 0.969s, while in the certification the response time in 0.97 s, a very close agreement!

Prognosis (Prediction)

The key benefit of having a simple model for studying the performance behavior of more complex systems is the prediction ability it gives to us. We ask and the model responds.

Question .- If the response time limit in the SAP benchmark were 2 s instead of 1 s, what would the system throughput (SAPS) be?

Answer.- Contrary to what some may think... it will remain almost the same! Being above the saturation point the model predicts the same throughput, equal to the maximum throughput (system bandwidth). The delta SAPS for this change will be very slight and not significant.

Question.- If response time limit in the SAP benchmark were 0.5 s instead of 1 s, what would the system throughput (SAPS) be?

Answer.- The response is the same as before: almost the same. A response time of 0.5 is also above the saturation point and this implies that the system throughput will be the same. In the real world there would be a non significative (negative) delta.

We've varied the response time limit by a wide margin, from -50% (here) to +100% (previous case) and the SAPS remain almost the same... I bet you wouldn't have said so before reading this article.

Question.- But...what changes then between current 1 s and new 0.5 s?

Answer.- The number of benchmark users needed to get to the response time limit, less in the 0.5s than in the 1s case.

Question.- With current benchmark definition and 90000 users, what will the expected response time?

Answer.- Looking at the (number of users, response time) graph above, you can conclude the response time will be around 2.4 s.

dijous, 4 d’agost del 2016

Phases of the Response Time

Let us consider a fixed (closed) population of users interacting with a service center, as shown in the following picture below:

At t=0, an user arrives to the service center,
From t=0 to t=W the user waits in the queue if all the servers (or workers) are busy at t=0, W is the wait time.
At t=W one server gets free and takes the user for servicing him.
From t=W to t=W+S and the server gives him the requested service. S is the service time.
At t=W+S the user exits the service center after having being serviced.
From t=W+S to t=W+S+Z the user remains out of the service center. Z is called the think time.
At t=W+S+Z the user goes to step #1. that is, he arrives again at the service center, and the cycle repeats.

The think time (Z) is a measure of the user activity. It typically is much greater than the Service Time (Z>>S). For example, in SAPS benchmarks Z=10 s and S is about tens of milliseconds. In the context of SAP system sizing high activity users are assigned a Z=10s, medium activity users Z=30s, and low activity users Z=300s.

A quantitative approach

The fundamental question we would like to answer is: what is the response time (R) when the user population is N? That is, what is the relationship between the response time and the number of users R=R(N).

To focus on principles, and to keep things simple, let's derive R(N) with the simplifying assumptions:

S is constant,
Z is constant.
m=1 (It is easily generalizable to an arbitrary number of workers).

Light Load Phase

Let us suppose there is only one user, that is, total population number is equal to one (N=1). The lone user arrives to the service center, never queues, receives service, departs, is away, and comes back to the service center. This loop is periodically repeated, being the period Z+S (the time it takes the user to cycle once round the "loop"). The lone user never meets anyone, never queues. In this case Q=0, W=0 and R=S. The system throughput is 1/(Z+S).

Let's add another user to the system (N=2). It could happens that both users arrive at the same time to the service center, and in such a case one is immediately served and the other waits until the first is finished being served (remember that m=1). But at the next loop neither of them will have to wait, because the second user has been delayed by S the first time through. So after the initial transitory interval there are nobody queued, never, Q=0, W=0 and R=S. The system throughput is 2/(Z+S).

The same will happen with an increasing population number, N=3, 4... but, until how many users?

The Phase Transition

How many users can the system support without queueing (except in the initial transitory phase)? We call this specific value the saturation population, or saturation point, and it will be identified by N*. It's easy to calculate that

N*=1+Z/S

At the saturation point the throughput is N*/(Z+S)=1/S.

Heavy Load Phase

Above the saturation population (N>N*) the light-load equilibrium radically changes: we cannot add additional users and keep the number in the queue (Q) equal to zero.

Add the N*+1 user. Let's suppose, without losing generality, that this additional user (CN*+1) arrives at the service center in some point within the time interval where user 1 (C1) is receiving service. As the server is busy with C1, then CN*+1 has to wait. When C1 exits, CN*+1 is taken by the server for service. But at this very moment user C2 comes back from the Think state, and surprise! he finds the server busy and has to wait! (until now when coming back he didn't have to wait). C2 has to wait exactly S units of time, the service time for the CN*+1 user.

When the server releases user CN*+1 , it takes C2 from queue, and C3 comes back seeking for service. C3 find "his" time slot occupied by C2 and has to wait S units of time in the queue (W=S). And this pattern repeats for every other user. All users have to wait S units of time in queue (W=S)! For a population of N*+1 users there's always a user queued (Q=1), the wait time W is equal to S (W=S), and the response time equals to 2S (R=W+S=2S). The throughput remains the same, 1/S, that is, adding one user does not increase the throughput.

The same analysis may be performed with another additional user, the CN*+2, just to obtain that in such a case there will always be two queued users (Q=2), the wait time for all users will be W=2S, and response time for all users will be R=2S+S=3S. The throughput remains 1/S.

Applying the same reasoning It’s trivial to conclude that when the user population is N users (N>N*), the queue size will be Q(N)=N-N*, the wait time in queue is W(N)=S·(N-N*), and the response time R(N)=S·(1+(N-N*)).

The situation has radically changed when going above the saturation point! From an idyllic response time when below the saturation point, to an increasing response time with each additional user above the saturation point.

The Complete Relationship R(N)

The following table Table summarizing our findings so far:

	Light Load (N<=N*)	Heavy Load (N>N*)
N*	N*=1/(Z+S)
Q(N)	0	N-N*
W(N)	0	S·Q
R(N)	S	S+W(N)
X(N)	N/(Z+S)	1/S

Here are figures showing these relationships. In particular, stop and look carefully at the response time one. The full understanding of this figure is capital and, according to my experience, it is not widely known.

In the Real World

In the real world the R(N) relationship is not so clean, so straight, but the lines shaping it are present there as tendencies.

Light load phase: there are few users, most of the time the server is idle, and when a user arrives is very likely to find it idle, so he does not have to wait

When N⟶0, then R⟶S

Transition: if the user population increases, the probability of the server busy, and of queueing, also increases

When N growths, then W growths (R growths)

Heavy load phase: In a heavily used service center most of the time the server is busy. is very unlikely to find it idle, so users will have to wait, and users will have to wait more as more users are in the system.

When N⟶∞, then R⟶∞ with ∆R/∆N⟶1/S

Coming Next

Applying all this to a SAPS benchmark.

Blog URL: http://www.ibm.com/blogs/performance
Mirror blog: http://demystperf.blogspot.com