divendres, 11 de novembre del 2016

The scalability of the software


The developers team in the ABC company has just built a five star transaction / program. The program code has a critical region. Basic performance tests with a few users result in a the total execution time of 1 s, with a residence time in the critical region of 0.05 s. These numbers are considered satisfactory by the management, so the deployment for general availability is scheduled for next weekend.

You, a performance analyst's apprentice, ask for the expected concurrency level, that is, the number of simultaneous executions of the  transaction / program. This concurrency results to be 100.

What do you think about this?

A suitable performance model

A very simple model to analyze and predict the performance of the system is a closed loop, with two stages and fixed / deterministic time in each stage, as depicted here:



The total execution time of the program is divided into:
  • the time in the parallel region, where concurrency is allowed.
  • the time in the serial (critical) region, where simultaneity is not allowed..

With only one user (one copy of the program/transaction in execution) the elapsed time is P + S, that is, 1 s ( = 0.95 + 0.05 ).

But what happens when the concurrency level is N? In particular, what happens when N=100?

And the model predicts...

Calculating as explained in "The Phases of the Response Time" the model predicts the saturation point at N*=20 (=1+0.95/0.05) users. This is the software scalability limit. More than 20 users or simultaneous executions will queue at the entry point of the critical region. The higher the concurrency level, the bigger the queue, and the more the waiting time. You can easily calculate that with the target concurrency level of 100 users, the idyllic 1 s time measured by the developers team (with few users) will increase to an unacceptable 5 s level. This means that the elapsed time of any program/transaction execution will be 5 s, distributed in the following way:
  • 0.95 s in the parallel region,
  • 4 s waiting to enter the critical (serial) region, and
  • 0.05 s in the critical region.



Elapsed execution time for N=1 and N=100 concurrency level


The graph of the execution time against the number of concurrent users is the following:
Elapsed execution time against the concurrency level


And, in effect, when the program is released  the unacceptable response time shows up!

Corrective measures

The crisis committee hold an urgent meeting, and these are the different points of views:
  • Developers Team: the problem is caused by a HW capacity insufficiency. Please, growth (assign more cores to) the VM supporting the application and the problem will disappear.
  • Infrastructure Team: hardware undersized? No point. The CPU usage is barely 25%! We don't know what is happening.
  • Performance Analyst Team (featuring YOU): more cores won't solve the problem as the hardware is not the bottleneck!  

Additional cores were assigned but, as you rightly predicted, things remained the same. The bottleneck here is not the hardware capacity. but the program itself. The right approach to improve the performance numbers is by reducing the residence time in the non parallelizable critical region. So the developers team should review the program code in a  performance aware manner.

You go a step further and expose more predictions: if the time in the critical region were reduced from the current 0.05 s to 0,02 s the new response time for a degree of simultaneity of 100 will be 1.5 s, and the new response time graph will be this one (blue 0.05 s, red 0.02 s):

Elapsed execution time against the concurrency level for S=0.05 ms (blue) and S=0.02 ms (red).

Lessons learnt

  • Refrain to blame the hardware capacity by default. There are times, more than you think, in which the hardware capacity is not the limiting factor, but an innocent bystander that gets pointed as the culprit.
  • Plan and execute true performance tests in the development phase, and specially a high load one, because with few users you probably will not hit the performance bottleneck.
  • Definitively welcome the skills provided by a performance analyst. Have one in your team. You won't regret.




Cap comentari:

Publica un comentari a l'entrada