dimarts, 17 de desembre del 2013

Better 1x10 than 10x1 (part II)



This is a mirrored from: ibm.biz/demystperf.

Here are two additional graphs illustrating the fact “the bigger, the better”.

Here we are in the same scenario considered in the previous “Better 1x10 than 10x1” blog post, introducing a new XXXL server, 10 times bigger than the XL one.

The first chart illustrates the usage of the servers when loaded with a customer population that fill the server to an average value of 50% of its capacity, that is, 10 users for the 10x sized (M) server, 100 users for the 100x sized (XL) server and 1000 users for the 1000x (XXXL) sized server.
The green zone is where the usage resides 50% of the time. The yellow zone is visited 40% of the time. And the server is in the remaining white zone only the 10% of the time. It is visually clear that the more users, the more close to the average the usage wanders (the relative variability reduction).


  



The second chart shows the result of a much better sizing: reduce the almost deserted white zone to begin at 90% usage. How many users do the servers support with this sizing target?  In another words: how many users push the 10% of the time zone to 90%-100% usage? You agree this is clearly a better sizing,  don’t you? Here are the results:





Answers:

  • You can load the 10x sized server with 13 users, driving up the average usage to 65%.
  • You can load de 100x sized server with 164 users, driving up the average usage to 82%.
  • You can load the 1000 sized server with 1746 users, driving up the average usage to 87%.

A technical remark: I’ve obtained the above values with the binomial distribution. When the number of users is “big enough”, the binomial approaches to the well known normal distribution. Stop for the moment; perhaps I’ll come back on this in a later post.

dimecres, 11 de desembre del 2013

Better 1x10 than 10x1

Here is one of the slides from "The Sizer Advisor, part II", at ibm.biz/sizadv02. Let us discuss about it by means of  a thought experiment, a cheap and always convenient way to improve our knowledge.
 
image
 
Let us consider three servers, sized XS, M and XL. The XS capacity is 1 unit of work (UOW) per unit time (UOW/T).   M is 10 times bigger than XS, and XL is 10 times bigger than M.  So the M capacity is 10 UOW/T, and the XL capacity is 100 UOW/T.  

Let's assign workloads scaled according to the capacities: XS will support 1 user, M 10 users, and XL 100 users. To simplify, the type of user is the same for the three servers. One single user is modeled in the following way, to account for the individual random behaviour: she/he throws a fair coin for deciding what to do in the time slot:
  • If H(ead) then the user requests one UOW to the server,
  • If T(ail) then the user remains idle.
Here is the graph resulting from running our experiment once for each server, with the sized workload (1 user for XS, 10 users for M, and 100 users for XL):
 
image
 
If we analyze the above samples we see the following:
  • The average demand in the XS server is very close to 0.5 UOW/T, in the M server  5 UOW/T and in the XL server 50 UOW/T. Those numbers represent the 50% of the server capacity in all cases. This is as expected because the probability of Heads and Tails is the same and equal to 1/2.
  • The absolute variability is the biggest in the XL. This is also expected, because its workload corresponds to the sum of 100 individual decisions! In the XL server the range of possible values goes from 0 UOW/T to 100 UOW/T, in the M server from 0 UOW/T to 10 UOW/T, and in the XS it varies from 0 UOW/T to 1 OUW/T.
  • On the contrary, the relative variability, meaning the variation relative to the server capacity (or to the average demand), is the least in the bigger server. You could easily see this in the following graphs, featuring the same samples than above but scaled to the capacities of the servers.
 
image
image
image
 
What is the reason for this reduced relative variability in the bigger servers? The reason has a  statistical / probabilistic nature: when more users add their "random" behaviour it is more improbable to reach "extreme" values. In particular, we have:
  • XS server: the probability of 1 Head is 1/2.
  • M server: the probability of 10 simultaneous Heads is 1/(2^10).
  • XL server: the probability of 100 simultaneous Heads is 1/(2^100).
That naturally means or implies:
  • The XS server will reach 100% usage about  50% of time (1 of every 2 time slots),
  • The M server will reach 100% usage about the  0,098% of the time (1 of every 1024=2^10 time slots),
  • The XL server will reach 100% usage about the 7.9*10^(-29)% of the time (approx 1 of every 1000000000000000000000000000000 time slots). That is practically never!
Extremely simple, but very illustrative facts!!!!! Do you see by yourself the necessary next conclusion? Think! It's inevitable. As in the bigger server is much less probable to reach 100% usage, we could increase the number of users it services without risking to reach the server saturation too frequently!  Use the sentence you like best to express it:
  • The big server can be more filled (relative to the server capacity) than the smaller ones.
  • The big server can be more utilized than the smaller ones.
  • The bigger the server, the bigger the usage it may run at.
In our particular experiment, maths and/or further experimentation allow us to conclude that to reach 100% usage only the 10% of the time:
  • The M (x10) capacity server can be loaded with 15 users, that is 5 more than the sized value of 10. This means that it may run at an average usage of 75%.
  • The XL (x100) capacity server can be loaded with 182 users, that is 82 more than the sized value of 100. This means that it may run at an average usage of 82%.
Let's go back from our thought experiment to the real world. We do meet all the aforementioned in two different ways.
First and more important is the fact that bigger servers do usually have more utilization than smaller ones. How many times have you heard that a small server average utilization is around 5%, medium server around 20%, and big servers around 80%?  I deliberately avoid to put names to the servers, because I'd like to stress that I'm not comparing different server architectures or brands. I'm simply comparing a big XYZ server with small XYZ server, and under the same workload type. The relative variability in the workload is one of the reasons to explain these differences. But it is not the sole reason.
Second is that you probably have already seen the clear differences in the relative variability. Look at the two real workloads below, the first one corresponding to a small workload, and the second one to a big one. The relative variability, seen as the size of the "noise", is less in the big workload, being the graph smoother.

image
image
By now it should be clear the meaning of the title I've chosen for this blog entry. Better 1x10 than 10x1 simply means 1 server of 10x size can support more users than 10 servers 1x size. And, and least regarding the supported workload, this make it better.

Of course, pricing is another matter.

One final word: I think that existing sizing guidelines generally don't reflect the big servers ability to run at an increased usage.
 
I hope you enjoy this post.
Jorge L. Navarro
 
PS: I have decided not to use maths expressions here, but we could express all the above in a more rigurous mathematical way, using binomial probability distribution or its limiting normal probability when the number of users is large. It can easily be shown that when the number of users increase from N to N', the workload average value increases according to the factor (N'/N), but its standard deviation (a mesure of the relative variability) grows with sqrt(N'/N), much slower.

Demystifying Virtual Capacity, Part I

This is another contribution. Here is the link  http://ibm.biz/demvirtcap01

Abstract: When we enter in the realm of virtualization, performance related concepts and terms must be revisited: we move from the fixed capacity of physical machines, to the variable capacity of virtual machines. Which are the factors this virtual capacity depends on? How does this virtual capacity depend on those factors?

Currently I'm working in the Part II. I hope it will be ready soon, very soon.

I hope you like it and, please, give me feedback!
Jorge L. Navarro

The 65% Rule Revisited

Another document I authored:  http://ibm.biz/65-rule.

Abstract: This article deals with the 65% load limit warning that everyone in the sizing business is (or should be) aware with: don't go beyond 65% of the service center capacity! But in the virtual world a service center capacity is no longer a fixed value, so what happens then? How should this 65% rule be reinterpreted?

Hope you like it!
Jorge L. Navarro

The Sizer Advisor

Here are the links to two documents I authored on sizing, and a brief description:
  • The Sizer Advisor, part I ( http://ibm.biz/sizadv01): This presentation expose ten elementary advices for sizers, the people dealing with sizing, sizing guidelines and related topics. Advices are generic, not particularly related to any specific sizing. Beginners should find it interesting, and experts may extract and reuse any slide for teaching their own trainees. 
  • The Sizer Advisor, part II ( http://ibm.biz/sizadv02): This presentation is the second part of The Sizer Advisor series. It exposes ten additional elementary advices for sizers, the people dealing with sizing, sizing guidelines and related topics. Advices are generic, not particularly related to any specific sizing. 
I hope you like them!
Jorge L. Navarro

Welcome Message

During my long career I've always felt attracted to computer system performance themes: bandwidth, throughput, response time, sizing, bottlenecks, usage, capacity, saturation degree, and so on. Unfortunately, I've not always seen clear, understandable and knowledgeable explanations on these subjects. With the arrival of the virtualization paradigm even more complexity is introduced: terms that represented constant values, like the machine capacity, gets blurred. What can be said about a (virtual) machine capacity if it is no longer a fixed value and may change from this hour to the next? With this blog and related documents, I'd like to fulfill an educational purpose: illuminate -with a very dim light, I agree- any other's way into this huge field, by documenting and commenting on subjects I'd liked to know when I was a beginner. And still I am! Jorge L. Navarro