dijous, 28 de gener del 2016

More on ESX vCPU versus PowerVM VP

Let’s explore further singularities of virtual CPUs (virtual CPUs in ESX parlance, virtual processors in PowerVM parlance). In particular we will try to determine the relationship between the throughput delivered by a single and solitary Virtual Machine and the number of assigned vCPU/VP.

Once more we will execute a thought SAPS benchmark with our already iconic systems in the following table.

Dell PowerEdge R730 2s/36c/72t Intel Xeon E5-2699 v3 @2.30 GHz	Physical System	IBM POWER S824 2s/24c/192t POWER8 @3.52GHz
36	Cores	24
72	Threads	192
90120	SAPS	115870

Just a reminder (see “Don’t put in the same bag Xeon and POWER virtual CPUs” post in this blog):

ESX maps a vCPU to a HW thread, and there can be one or two running threads per core.
PowerVM maps a VP to a core.

We are not taking into account capacity reductions due to the virtualization or CPU overcommitment (there are no CPU overcommit here).

PowerVM VP

The graph of throughput versus number of VPs for the PowerVM/POWER system is perfectly linear: an additional VP enables an additional running core, and thus contributes with the same number of SAPS (4800 SAPS per VP).

ESX vCPU

The graph of throughput versus number of vCPUs for the ESX/Xeon case is very different, as you see below.

Why this shape? Because there are two types of ESX vCPUs:

high capacity: vCPUs running on solitary threads (only one thread active in a core), estimated to deliver 2015 SAPS
low capacity: vCPUs running on threads from cores with two active threads, estimated to deliver 1250 SAPS (2500 SAPS/core * 1 core/2 threads).

From 1 to 36 vCPUs there are only high capacity vCPUs, assuming the hypervisor always tries to obtain the best capacity, so the slope is 2015 SAPS for each new vCPU. Alternatively, the hypervisor may dispatch units of work to the two threads in the same core before activating the next core, but in this case the throughput would always be smaller for a given number of vCPUs.

From 37 to 72 vCPUs each additional vCPU, being mapped to the the second active thread in the core, runs at low capacity, and forces the companion thread in the same core to run at low capacity as well. Thus adding only net 485 SAPS (=2500 - 2015) for each additional vCPU.

Summarizing

The (performance) capacity of the PowerVM VP is greater than the ESX vCPU, this is well stated and known. And the capacity delivered by each PowerVM VP is uniform, no matter how many more are there.

But it cannot be said the same for ESX vCPU: when you have a ESX vCPU, which kind is it, high or low capacity? How much capacity are you going to obtain when adding an additional ESX vCPU, 2015 or 485 SAPS?

A little bit disturbing don’t you think so?

Mirror:

https://www-304.ibm.com/connections/blogs/performance/entry/more_on_esx_vcpu_versus_powervm_vp

dilluns, 11 de gener del 2016

Don’t put in the same bag Xeon and POWER virtual CPUs

Just a reminder: this blog is a mirror of the "main" site https://www-304.ibm.com/connections/blogs/performance (http://ibm.biz/demystperf)

In the pervasive virtual world the standard unit of performance capacity happens to be the virtual CPU (vCPU). “This virtual machine (VM) has 6 vCPUs”, “you will have to provide 8 vCPUs for that VM”, and the like are common sentences. It could be a reasonable metric of performance if the underlying physical CPUs and the hypervisor layer were all the same. But this is seldom the case: Intel Xeon processors combined with VMware ESX virtualization, and IBM POWER processors combined with POWERVM virtualization are very different beasts.

If you are in need to size or convert capacity between these dissimilar systems, or in need to have a solid comparison base, or would like to unmask tricks and pitfalls that plague the virtual world sizing, continue reading.

IBM POWERVM

The POWERVM term for vCPU is Virtual Processor (VP). The VM has, or sees, VPs. And these VPs are scheduled, in a time-shared manner, on POWER cores. Yes, read it again: one VP is scheduled on one core. I remark this because in the ESX / Intel world this is different, as you will see later.

Given this VP-to-core mapping, the VP capacity ranges between two values:

In the best case the capacity of one VP is the capacity that one core can deliver, that is 1 VP is 1 core
In the worst case 1 VP is (1/10)th of core.

The actual VP capacity depends on the following factors (revisit “Why is the Virtual Capacity so Important?” and “The Playground of the Virtual Capacity” in this blog for a detailed explanation):

configuration parameters of the VM the VP belongs to (entitlement, capped/uncapped attribute, uncapped weight)
configuration parameters of all the other VMs sharing the same physical machine (PM)
actual usage of capacity from all the other VMs sharing the same physical machine

Given this highly variable value, contrary to what one unit of measure must be, how have VPs been promoted to be a “standard” measure of capacity? Amazing, don’t you think so?

VMWARE ESX

The VM has, or sees, vCPUs. And those vCPUs are scheduled on Intel processor threads, in a time shared manner. The mapping is vCPU-to-thread, and is different than the POWER / POWERVM case (VP-to-core).

Given this vCPU-to-thread mapping, the vCPU capacity ranges between two values:

In the best case the capacity of one vCPU is the capacity that one thread can deliver, that is 1 vCPU is 1 thread
In the worst case one vCPU is very small (I’m not aware of a low limit)

The actual vCPU capacity depends on the same factors described in the POWERVM case,

configuration parameters of the VM the VP belongs to
configuration parameters of all the other VMs sharing the same PM
actual usage of capacity from all the other VMs sharing the same PM.

Benchmarking vCPUs

The reputation of the vCPU as a stable capacity unit of measure has been destroyed. A vCPU capacity can range from a full core (or to a full thread) to a small fraction, and even depends on alien factors (from other VMs)!

Is there a way to put some sense in this nihilism?

Yes, it is. By taking a practical approach: use the best case values. You know that the actual performance will always be equal or worse than that, but we have to live with this.

To evaluate the best case let’s consider the two systems we analyzed in SAPS Olympics: single thread performance post in this blog.

Dell PowerEdge R730 2s/36c/72t Intel Xeon E5-2699 v3 @2.30 GHz	Physical System	IBM POWER S824 2s/24c/192t POWER8 @3.52GHz
36	Cores	24
72	Threads	192

The best performance VM setup running on these systems is a single VM will all the processors assigned, that is:

IBM POWER S824 2s/24c/192t POWER8 @3.52GHz with 24 VPs (= 1 VP/core x 24 cores).
Dell PowerEdge R730 2s/36c/72t Intel Xeon E5-2699 v3 @2.30 GHz with 72 vCPUs (= 1 vCPU/thread x 2 thread/core x 36 cores).

And the final results would be:, without taking into the reduction of capacity due to virtualization:

Dell PowerEdge R730 2s/36c/72t Intel Xeon E5-2699 v3 @2.30 GHz	Physical System	IBM POWER S824 2s/24c/192t POWER8 @3.52GHz
36	Cores	24
72	Threads	192
1	VM	1
72	vCPU / VP	24
90120	SAPS	115870
1250	SAPS/vCPU	4828

The dramatic difference, 4282 SAPS/VP vs 1250 SAPS/vCPU, would be even greater taking into account virtualization effects, as is widely known that POWERVM is more efficient than ESX. We may consider a 3-5% reduction for POWERVM and a 10-15% for ESX.

If the Intel Xeon Hyperthreading is switched off (HT=0ff), and this seldom happens in a virtualized environment, the capacity numbers would improve for Intel vCPUs to 2019 SAPS per vCPU ( = 2019 SAPS/core x 1 core/thread x 1 thread/vCPU ), again without taking into account virtualization overheads.

Summarizing

Which capacity should be assigned to vCPUs? I would take the above estimated values, representing best cases in benchmark conditions, reduced between 3-5% for POWERVM and 10-15% for ESX. This results in this approximate and simple relationship:

1 POWER8 VP ≈ 4 Xeon Haswell-EP vCPU (HT=On)

Test Your Performance Skills 2nd part (Responses)

Ten haircuts per hour is a measure of...

Response Time
Velocity
Throughput
Utilization / usage.

The throughput of a service center measures...

Units of work processed per unit of time
The elapsed time to process one unit of work
The elapsed time to process many units of work

The bandwidth is...

The average response time of many requests
The maximum achievable throughput

The response time can grow to infinity (I’ve seen graphs in which response time goes to infinity!)...

In practical cases
In theoretical models

Never
Always

Is possible to be at 100% cpu usage and, at the same time, to have an acceptable response time?

If we move a certain workload to a new server with half performance capacity of the original one, what will happen to the response time?

Increase slightly
Increase heavily
Simply increase (but cannot predict how much)
Remain stable
Decrease

To get serviced a customer has to visit two desks, A and B in strict sequence, The customer spends 5 min in A, and 10 min in B. Which is the total residence time?

5 min
10 min
15 min
7.5 min (the average of 5 and 10 min)

Which is the bandwidth of the previous service center?

9 customers per hour
4 customers per hour
12 customers per hour
6 customers per hour

You, as a great performance analyst, has been asked to place an additional clerk (to work in parallel) in one of the desks, A or B. Where would you place him/her?

At the entrance (for marketing purposes)
Desk A
Desk B
At the exit (for customer satisfaction feedback)

What of the above sentences truly quantify the improvement of your previous recommendation, if any.

There is hardly a performance gain
The new residence time is 10 min
The new bandwidth is 12 customers per hour
The new throughput is 12 customers per hour
The new bandwidth is 6 customers per hour