Demystifying Performance

I've always felt attracted to computer performance subjects: throughput, response time, sizing, bottlenecks, usage, capacity. I've not always seen clear, understandable and knowledgeable explanations on these subjects. With the arrival of the virtualization even more complexity is introduced: terms that represented constant values, like the machine capacity, gets blurred. What can be said about a virtual machine capacity if it is no longer a fixed value and may change from this hour to the next?

Es mostren els missatges amb l'etiqueta de comentaris scalability. Mostrar tots els missatges

dilluns, 27 de març del 2017

The upgrade sizing pitfall

The art of sizing is not exempt of pitfalls, and you must be aware of them if you want your sizing to be accurate and adequate. Let us talk about a typical scenario: a server upgrade.

All the metrics in sizing are measures of throughput, and this has an implication you must take into account: not always the service center (the server, here) with higher throughput capacity is better. What are you saying man?

Let's consider two servers, the base one and its intended upgrade:

Base: single core server with a capacity (maximum throughput = bandwidth) of 1 tps (transaction per second). Therefore the transaction service time is 1 second.
Upgrade: four core server with a capacity of 2 tps. Therefore the transaction service time is 2 seconds.

If you exclusively look at the throughput, B (2 tps) is better than A (1 tps). Period.

But from the response time perspective such a superiority must be revised. Let us graph the response time versus the number of users:

Figure: Best response time (average) versus the number of users, with a transaction think time of 30 seconds. Server A (base) in blue. Server B (upgrade) in red.

In the light load zone, that is, when there are no or few queued transactions, A is better than B. This is consequence of a better (lower) service time for server A. In the high load zone B is better than A. consequence of a better (higher) capacity (throughput). If the workload wanders in the light zone such an upgrade would be a bad idea.

So when you perform a sizing you must know wich point of view is relevant to your sizing exercise: the throughput or the response time. Don't fall in the trap. A higher capacity (throughput) server is not unconditionally better. For an upgrade server to be unconditionally better its capacity (throughput) must be higher and its service time lower.

divendres, 11 de novembre del 2016

The scalability of the software

The developers team in the ABC company has just built a five star transaction / program. The program code has a critical region. Basic performance tests with a few users result in a the total execution time of 1 s, with a residence time in the critical region of 0.05 s. These numbers are considered satisfactory by the management, so the deployment for general availability is scheduled for next weekend.

You, a performance analyst's apprentice, ask for the expected concurrency level, that is, the number of simultaneous executions of the transaction / program. This concurrency results to be 100.

What do you think about this?

A suitable performance model

A very simple model to analyze and predict the performance of the system is a closed loop, with two stages and fixed / deterministic time in each stage, as depicted here:

The total execution time of the program is divided into:

the time in the parallel region, where concurrency is allowed.
the time in the serial (critical) region, where simultaneity is not allowed..

With only one user (one copy of the program/transaction in execution) the elapsed time is P + S, that is, 1 s ( = 0.95 + 0.05 ).

But what happens when the concurrency level is N? In particular, what happens when N=100?

And the model predicts...

Calculating as explained in "The Phases of the Response Time" the model predicts the saturation point at N*=20 (=1+0.95/0.05) users. This is the software scalability limit. More than 20 users or simultaneous executions will queue at the entry point of the critical region. The higher the concurrency level, the bigger the queue, and the more the waiting time. You can easily calculate that with the target concurrency level of 100 users, the idyllic 1 s time measured by the developers team (with few users) will increase to an unacceptable 5 s level. This means that the elapsed time of any program/transaction execution will be 5 s, distributed in the following way:

0.95 s in the parallel region,
4 s waiting to enter the critical (serial) region, and
0.05 s in the critical region.

Elapsed execution time for N=1 and N=100 concurrency level

The graph of the execution time against the number of concurrent users is the following:

Elapsed execution time against the concurrency level

And, in effect, when the program is released the unacceptable response time shows up!

Corrective measures

The crisis committee hold an urgent meeting, and these are the different points of views:

Developers Team: the problem is caused by a HW capacity insufficiency. Please, growth (assign more cores to) the VM supporting the application and the problem will disappear.
Infrastructure Team: hardware undersized? No point. The CPU usage is barely 25%! We don't know what is happening.
Performance Analyst Team (featuring YOU): more cores won't solve the problem as the hardware is not the bottleneck!

Additional cores were assigned but, as you rightly predicted, things remained the same. The bottleneck here is not the hardware capacity. but the program itself. The right approach to improve the performance numbers is by reducing the residence time in the non parallelizable critical region. So the developers team should review the program code in a performance aware manner.

You go a step further and expose more predictions: if the time in the critical region were reduced from the current 0.05 s to 0,02 s the new response time for a degree of simultaneity of 100 will be 1.5 s, and the new response time graph will be this one (blue 0.05 s, red 0.02 s):

Elapsed execution time against the concurrency level for S=0.05 ms (blue) and S=0.02 ms (red).

Lessons learnt

Refrain to blame the hardware capacity by default. There are times, more than you think, in which the hardware capacity is not the limiting factor, but an innocent bystander that gets pointed as the culprit.
Plan and execute true performance tests in the development phase, and specially a high load one, because with few users you probably will not hit the performance bottleneck.
Definitively welcome the skills provided by a performance analyst. Have one in your team. You won't regret.

dimecres, 22 de juny del 2016

More or Faster?

The More or Faster dilemma

What would you choose? MORE but slow workers or to have fewer but FASTER workers?

Let's explore a little bit to find an answer.

Hairdresser’s

You have the opportunity to choose between two hairdresser's that have advertised, and it's true indeed, they are capable of 4 haircuts per hour:

The Faster with only 1 worker (m=1) and 15 minutes service time (S=15min),
The More with 4 hairdressers (m=4) and 1 hour service time (S=1h).

If just before entering you see a queue of 5 (Q=5) in both shops, which one should you choose?

Response:

The More: the response time, the total time you'll spend there, is between 2 h and 3 h,

The Faster: the response time, the total time you'll spend there, is between 1 h 30 min and 1 h 45 min.

It's better to choose "The Faster".

Company Department

The scheduling manager has to decide which department to send incoming jobs to. She has to make a choice between two departments delivering the same maximum performance (2 jobs per day).

The Faster department with 4 experienced and very efficient professionals (m=4) spending 2 days per job (S=2d).
The More department with 8 less experienced people (m=8) who spend 4 days per job (S=4d).

If the queue is 10 jobs long (Q=10) in both departments, what should she choose?

Response:

The More: the next incoming job would take between 24 and 28 days to be processed.
The Faster: the next incoming job would take between 12 and 14 days to be processed.

It's better to choose "The Faster".

Below is the plot of the response time I see against the number in the queue (Q) for "The Faster" (red) and "The More" (yellow) alternatives in the hairdresser's case. It cleanly illustrates that "The Faster" is better, because of the delivered response time is smaller.

Learning point: given the same bandwidth (maximum throughput) always choose the service center with the fastest workers.

New blog URL: http://www.ibm.com/blogs/performance

Mirror blog: http://demystperf.blogspot.com

divendres, 5 de desembre del 2014

SAPS Olympics: 10 charts for fun and profit

I’ve prepared 10 charts I think they are interesting enough. You may use them for your fun or your professional presentations. Of course the data source is the official SAPS (SD 2-Tier) results posted at http://global.sap.com/solutions/benchmark/sd2tier.epx. Please, feel free to ask for additional charts not included here, as I plan for new ones.

The list here:

Count of benchmarked systems by benchmark version.
All the SAPS results.
All The SAPS results (logarithmic scale).
The latest SAPS results (logarithmic scale).
Top 1 SAPS evolution.
SAPS evolution (50 benchmarks moving average).
SAPS evolution (50 benchmarks moving average) (logarithmic scale).
All the SAPS per core results.
SAPS per core evolution (50 benchmarks moving average).
Top 1 SAPS per core evolution.

Chart #1: Benchmarked systems by benchmark version

Chart #2: All the SAPS results

Chart #3: All the SAPS results (logarithmic scale)

Chart #4: The latest SAPS Results (logarithmic scale)

Chart #5: Top 1 SAPS System Evolution

Chart #6: SAPS evolution (50 benchmarks moving average)

Chart #7: SAPS evolution (50 benchmarks moving average) (logarithmic scale)

Chart #8: All the SAPS per core results

Chart #9: SAPS per core evolution (50 benchmarks moving average)

Chart #10: Top 1 SAPS per core evolution

dijous, 27 de novembre del 2014

SAPS Olympics: 10 years ago

Let’s go back around ten years, what did happen in the SAPS arena back then? In this analysis I have considered the 130 systems that were measured with the SAP R/3 Enterprise 4.70 benchmark specification. The first one was in 2003 April 4^th, a Mitsubishi Apricot with certification number 2003032, and the last one was published in Jul 4^th 2005, an Egenera pBlade 950-000084 with certification number 2005037. That is, more than two years of time span. All the numbers and calculations are based on the official SAPS (SD 2-Tier) results posted at http://global.sap.com/solutions/benchmark/sd2tier.epx.

Remember that you have to be careful when comparing two SAPS values if they correspond to two different benchmark specs. You have to take into account software release effects and other benchmark specs changes. This is like the need to take into account the inflation rates when comparing the value of year 2008 dollars to year 2014 dollars. Current SAPS are heavier than past ones.

SAP Technology Partners

The SAP Technology Partners (a SAP concept) that were actively benchmarking SAPS. Fujitsu sometimes appears alone and sometimes with Siemens, but I’ve grouped both in the count.

By CPU family

The Intel Xeon was the dominating family, and this is a constant in the history of SAPS Olympics. AMD Opteron had strong presence. Intel Itanium was alive, and they were also IBM POWER5, UltraSPARC IV, SPARC64 V, and PA-RISC times.

By Operating System

Operative systems seen: Windows Server 2000 and 2003, IBM AIX 5, Solaris 9, Linux SLES 8, and HP-UX 11.

By Relational Database Management System

Relational Database Management Systems seen: Microsoft SQL Server 2000, IBM DB2 UDB 8 and 9.5, Oracle 9i and SAP DB. All of them transitioning from 32-bit to 64-bit flavors.

Absolute Number of SAPS (SAPS per system)

Gold -> Fujitsu PRIMEPOWER 2500 with 128 SPARC64 V @2080 MHz processors: 105820 SAPS.

Silver-> IBM eServer p5 Model 595 with 64 POWER5 @1900 MHz: 100700 SAPS.

Bronze -> Sun Fire Model E25k with 72 UltraSPARC IV @1200 MHz: 51070 SAPS.

The last -> Fujitsu Siemens Computers PRIMERGY Model BX300 with 1 Intel Pentium M @1800 MHz: 830 SAPS.

SAPS per core / per thread

In those days the processor and core terms were managed by marketing and, consequently, blurred and misdefined. The problem is that sometimes processors are equal to cores, and sometimes they are not. In the SAPS official table the cores (and threads) column are zero, and only the processor column is filled. Thus, I cannot offer a significative analysis unless I take big time analyzing system by system (and that is not in my near scope.