Demystifying Performance

I've always felt attracted to computer performance subjects: throughput, response time, sizing, bottlenecks, usage, capacity. I've not always seen clear, understandable and knowledgeable explanations on these subjects. With the arrival of the virtualization even more complexity is introduced: terms that represented constant values, like the machine capacity, gets blurred. What can be said about a virtual machine capacity if it is no longer a fixed value and may change from this hour to the next?

Es mostren els missatges amb l'etiqueta de comentaris sizing. Mostrar tots els missatges

dilluns, 10 d’abril del 2017

Climbing the peak

Scenario

In your usual sizing efforts you need to know the peak usage% for a certain workload and server(s). What is the right averaging time to capture this peak? Let’s see possible choices.

Too long averaging time

Averaging Time = 500 min. Peak usage% = 50%.

Total loss of finer details. Does this really mean that the usage% has been the same for the 500 min? You, like me, don’t believe that!

Too short averaging time

Averaging Time = 10 s. Peak usage% = 100%.

Too much detail. This may be good for performance analysis, but it is confusing for sizing. Spikes goes up to usage%=100%, meaning that for 10 s -the averaging time- in a row the server is 100% busy. But I wouldn’t consider that the server usage% is 100% for sizing purposes. If you do so you most probably are oversizing, a common (and safe) strategy by the way.

The right averaging time?

Averaging Time = 10 min. Peak usage% = 68%.

Eureka! This is the right amount of detail. In my point of view the right averaging time for OLTP lies somewhere between 10 min and 1 hour, depending on the workload, on the available data, and on the degree of tolerance to a high usage%.

dilluns, 3 d’abril del 2017

The Degraded Operations Pitfall

Let's consider an information system supporting a highly critical online activity. Critical means it cannot fail, and if it fails there must be a contingency infrastructure that allows operations to continue "reasonably" well, with zero or tolerable performance impact.

Someone trying to reduce acquisition cost decides having half of the processing capacity in the contingency infrastructure. Should you, as an expert sizer, feel comfortable with this decision or you shouldn't?

To illustrate the problem let us consider that the workload is the SAP SD benchmark (see "Phases of the SAP Benchmark" entry in this blog). The simplified response time curve is in Figure 1, and it can be seen that the system supports 80000 users with a response time of 1 s.

Figure 1: The response time versus the number of users graph for the normal mode server (blue).

If we put this workload in the 50% capacity degraded model infrastructure, that is: the same population, the same activity, but with 50% less capacity, what happens? Look closely at the figure 2.

Figure 2: The response time graph for the normal mode server (blue) and for the contingency one (red) with 50% performance capacity.

With a 50% capacity the response time for 80000 users (1 s in the normal mode server) would be around 12 s! Would anyone consider this is an usable system?

How to successfully solve the above situation? Two lines of action are possible:

At the workload side: reduce the number of users, that is, propose a significant restriction in the number of users that can use the system in degraded mode.
At the capacity side: increase capacity in contingence, that is, increase the capacity of the contingency server ideally to 100% of the normal mode.

Summarizing: when sizing degraded mode infrastructures you have to pay much attention to the response time, and not only to the bandwidth (maximum throughput),

dilluns, 27 de març del 2017

The upgrade sizing pitfall

The art of sizing is not exempt of pitfalls, and you must be aware of them if you want your sizing to be accurate and adequate. Let us talk about a typical scenario: a server upgrade.

All the metrics in sizing are measures of throughput, and this has an implication you must take into account: not always the service center (the server, here) with higher throughput capacity is better. What are you saying man?

Let's consider two servers, the base one and its intended upgrade:

Base: single core server with a capacity (maximum throughput = bandwidth) of 1 tps (transaction per second). Therefore the transaction service time is 1 second.
Upgrade: four core server with a capacity of 2 tps. Therefore the transaction service time is 2 seconds.

If you exclusively look at the throughput, B (2 tps) is better than A (1 tps). Period.

But from the response time perspective such a superiority must be revised. Let us graph the response time versus the number of users:

Figure: Best response time (average) versus the number of users, with a transaction think time of 30 seconds. Server A (base) in blue. Server B (upgrade) in red.

In the light load zone, that is, when there are no or few queued transactions, A is better than B. This is consequence of a better (lower) service time for server A. In the high load zone B is better than A. consequence of a better (higher) capacity (throughput). If the workload wanders in the light zone such an upgrade would be a bad idea.

So when you perform a sizing you must know wich point of view is relevant to your sizing exercise: the throughput or the response time. Don't fall in the trap. A higher capacity (throughput) server is not unconditionally better. For an upgrade server to be unconditionally better its capacity (throughput) must be higher and its service time lower.

dilluns, 20 de març del 2017

What does Usage mean?

My objective today is to clarify the meaning of one of the most used metric in sizing and in performance analysis: the usage.

First things first

The Usage exposes the state of the service center. It is a binary quantity:

usage=0 when idle (not working)
usage=1 when busy (working)

Any service center at any point in time is idle (usage=0) or busy (usage=1).

Usage Percentage (Usage%)

This is the "usage" metric we are familiarized with. It is the average of the usage over a certain time interval, called the averaging time. Typically is expressed as percentage.

In example usage%=20% (for a certain time interval) means that for that time interval the service center has been:

20% of the time busy
80% of the time idle

Averaging Time

The averaging time for calculating the usage% is of capital importance. Saying 20% is not enough, saying 20% over an hour is right.

Theory is very simple, but in practice the averaging time is frequently dropped. Always ask for it (but don't expect to have a crystal clear response).

To stress its importance look at these graphs representing the time evolution of the usage% metric for the same workload and different averaging time.

Averaging Time = 10 s

Averaging Time = 10 min

Averaging Time = 500 min

I insist: in the three graphs the workload is the same (and the total amount of work done -the area under the usage% curve- remains the same). The short averaging time is best suited for performance analysis, the medium is for sizing, and the long is for trend analysis.

dilluns, 13 de març del 2017

Phases of the response time with variability

Let's go to the response time versus the number of users signature (graph below) in the simple system described in "Phases of the Response Time". We made two simplifying assumptions: the service time is constant, and the interarrival time is constant (the the think time is constant). These assumptions allowed us to distill the essence of the response time dependencies, providing us with very useful insights regarding the response time behaviour, the primary parameters it depends on, how it depends on them, and what are its trends and its limits.

Figure 1: The response time versus the number of users for the simple model.

In the real world... variability!

Now let's go a step further introducing the variability. Customers seldom arrive to a service center at uniform intervals (arrivals side variability). Customers seldom demand the same service time (service side variability). Both magnitudes are essentially variable.

The analysis of the response time with variability is usually done using a powerful mathematical tool called probability analysis. Every time you hit into a queueing theory textbook or article, you'll find probabilities. Our constant values are transformed in something like this: "it's twice more probable to have a service time of 2 s than 1 s", "there's a probability of 80% that service time lies in the interval (1 s, 2 s)", “the average response time is 2 s”, and so on. We enter the probabilistic world, where the raw material is random variables, on which we can only state probabilistic facts.

But, for now, I'm interested in highlighting the main consequences of the variability, not in performing an analytical in-depth analysis.

Variability Effects

Look at the figures 2 and 3.

Figure 2: No variability case. Uniform arrivals + Uniform service --> No waits --> Best and uniform response time.

This "no variability" figure corresponds to our simple all-constant case (Figure 1). Uniform arrivals and uniform service time result in no waits and the best and uniform response time.

Figure 3: With variability case (arrivals side). Non-uniform arrivals + Uniform service --> Waits --> Worse and variable response time

The "with variability" diagram illustrates the arrivals side variability, in particular batch arrival of users. Non-uniform arrivals result in waits, and the response time seen by users is worse and variable (or volatile).

Ideas to take --> Variability effects are:

The response time is variable: the response time varies from user to user and from successive visits from the same customer.
The average response time is worse: waits show up due to the lack of uniformity. increasing the response time.

An analysis of the same simple model but allowing random -exponentially distributed- variation of the service time and the think time results in the graph in figure 4.

Figure 4: Average response time for the all-constant (blue) and the random (red) cases..

We can see, for example, that when the user population reaches 80% of the saturation value the average response time for the random model is 4 times the one for the all-constant model, and for 100% of the saturation population the ratio increases to 8 times!

By the way. don’t underestimate the effects of the variability, as it is one of the causes of unacceptable performance from the customer point of view: in general a customer is willing to accept higher but uniform response times than lower but highly variable ones.

divendres, 11 de novembre del 2016

The scalability of the software

The developers team in the ABC company has just built a five star transaction / program. The program code has a critical region. Basic performance tests with a few users result in a the total execution time of 1 s, with a residence time in the critical region of 0.05 s. These numbers are considered satisfactory by the management, so the deployment for general availability is scheduled for next weekend.

You, a performance analyst's apprentice, ask for the expected concurrency level, that is, the number of simultaneous executions of the transaction / program. This concurrency results to be 100.

What do you think about this?

A suitable performance model

A very simple model to analyze and predict the performance of the system is a closed loop, with two stages and fixed / deterministic time in each stage, as depicted here:

The total execution time of the program is divided into:

the time in the parallel region, where concurrency is allowed.
the time in the serial (critical) region, where simultaneity is not allowed..

With only one user (one copy of the program/transaction in execution) the elapsed time is P + S, that is, 1 s ( = 0.95 + 0.05 ).

But what happens when the concurrency level is N? In particular, what happens when N=100?

And the model predicts...

Calculating as explained in "The Phases of the Response Time" the model predicts the saturation point at N*=20 (=1+0.95/0.05) users. This is the software scalability limit. More than 20 users or simultaneous executions will queue at the entry point of the critical region. The higher the concurrency level, the bigger the queue, and the more the waiting time. You can easily calculate that with the target concurrency level of 100 users, the idyllic 1 s time measured by the developers team (with few users) will increase to an unacceptable 5 s level. This means that the elapsed time of any program/transaction execution will be 5 s, distributed in the following way:

0.95 s in the parallel region,
4 s waiting to enter the critical (serial) region, and
0.05 s in the critical region.

Elapsed execution time for N=1 and N=100 concurrency level

The graph of the execution time against the number of concurrent users is the following:

Elapsed execution time against the concurrency level

And, in effect, when the program is released the unacceptable response time shows up!

Corrective measures

The crisis committee hold an urgent meeting, and these are the different points of views:

Developers Team: the problem is caused by a HW capacity insufficiency. Please, growth (assign more cores to) the VM supporting the application and the problem will disappear.
Infrastructure Team: hardware undersized? No point. The CPU usage is barely 25%! We don't know what is happening.
Performance Analyst Team (featuring YOU): more cores won't solve the problem as the hardware is not the bottleneck!

Additional cores were assigned but, as you rightly predicted, things remained the same. The bottleneck here is not the hardware capacity. but the program itself. The right approach to improve the performance numbers is by reducing the residence time in the non parallelizable critical region. So the developers team should review the program code in a performance aware manner.

You go a step further and expose more predictions: if the time in the critical region were reduced from the current 0.05 s to 0,02 s the new response time for a degree of simultaneity of 100 will be 1.5 s, and the new response time graph will be this one (blue 0.05 s, red 0.02 s):

Elapsed execution time against the concurrency level for S=0.05 ms (blue) and S=0.02 ms (red).

Lessons learnt

Refrain to blame the hardware capacity by default. There are times, more than you think, in which the hardware capacity is not the limiting factor, but an innocent bystander that gets pointed as the culprit.
Plan and execute true performance tests in the development phase, and specially a high load one, because with few users you probably will not hit the performance bottleneck.
Definitively welcome the skills provided by a performance analyst. Have one in your team. You won't regret.

dijous, 1 de setembre del 2016

Phases of the SAPS Benchmark

The SAPS benchmark fits very closely in the model analyzed in the post "Phases of the Response Time". In essence the benchmark is performed by progressively increasing the customer population, and monitoring the response time. When response time reaches 1 second, the measured throughput, as expressed in dialog steps per minute, is the SAPS value. The think time remains constant at 10 s.

SAPS is a performance metric measuring throughput: 1 SAPS is 1 dialog step (a unit of work defined by SAP) per minute. This basically means 1 customer service per minute

Model Calibration

Let us consider, without loss of generality, the benchmark number 2014034, corresponding to an IBM POWER E870.

According to the benchmark certificate these are the measured values:

m = 640, the number of concurrent HW threads,
X = 26166000 ds/h = 436100 ds/min (SAPS).
N = 79750, the number of users.
R = 0.97 s, the average response time.

The service time is derived from those values:

S = m/B = 0.088 s/ds.

As R>>S the system must be well above the saturation point. The saturation point N* is N*=m(1+Z/S)= 73323, and N, as stated in the benchmark certification, is 79750. Clearly N>N*, confirming the hypothesis that the system is operating well over the saturation point.

At this point we have our simple model well calibrated. To double check it, we can see that the response time predicted by our simple model would for a population of N=79750, the benchmark population, is 0.969s, while in the certification the response time in 0.97 s, a very close agreement!

Prognosis (Prediction)

The key benefit of having a simple model for studying the performance behavior of more complex systems is the prediction ability it gives to us. We ask and the model responds.

Question .- If the response time limit in the SAP benchmark were 2 s instead of 1 s, what would the system throughput (SAPS) be?

Answer.- Contrary to what some may think... it will remain almost the same! Being above the saturation point the model predicts the same throughput, equal to the maximum throughput (system bandwidth). The delta SAPS for this change will be very slight and not significant.

Question.- If response time limit in the SAP benchmark were 0.5 s instead of 1 s, what would the system throughput (SAPS) be?

Answer.- The response is the same as before: almost the same. A response time of 0.5 is also above the saturation point and this implies that the system throughput will be the same. In the real world there would be a non significative (negative) delta.

We've varied the response time limit by a wide margin, from -50% (here) to +100% (previous case) and the SAPS remain almost the same... I bet you wouldn't have said so before reading this article.

Question.- But...what changes then between current 1 s and new 0.5 s?

Answer.- The number of benchmark users needed to get to the response time limit, less in the 0.5s than in the 1s case.

Question.- With current benchmark definition and 90000 users, what will the expected response time?

Answer.- Looking at the (number of users, response time) graph above, you can conclude the response time will be around 2.4 s.