Untitled

For scalability, you must distinguish between scale-out and scale-up. Scale-out is a cluster, just add a new node and you have increased scalability. Clusters are used for HPC number crunching workloads where you run a tight for loop on some independent data (ideally each node fits everything in the cpu cache). Some examples of Scale-out servers (clusters) are all servers on the Top-500 supercomputer list. Other examples are SGI Altix / UV2000 servers or the ScaleMP server, they have 10,000s of cores and 64 TB RAM or more, i.e. cluster. Sure, they run a single unified Linux kernel image - but they are still clusters. If you read a bit, you will see that the SGI servers are using MPI. And MPI are used on clusters for HPC number crunching.

Scale-up servers, are one single fat huge server. They might have 16 or 32 sockets, some even have 64 sockets! They weigh 1000 kg and costs many many millions. For instance the old IBM P595 Unix server for the old TPC-C record, has 32 sockets and costs $35 million (no typo). One single server with 32 cpus, costs $35 million. You will never ever see this prices on clusters. If you buy a SGI server with 100s of sockets, you will essentially pay the same price as buying individual nodes with the same nr of sockets. But scale-up servers, need heavy redesign and innovative scalability tech, and that is the reason a 16 or 32 socket server costs many many many more times than a SGI cluster having 100s of sockets. They are not in the same arena. These scale-up servers are typically used for SMP workloads, not HPC workloads. SMP workloads are typically large databases or Enterprise ERP workloads. This code is heavy branch intensive, so you can not fit into a cpu cache. It branches everywhere, and clusters can not run these Enterprise workloads because the performance would be very bad. If you need to run Enterprise workloads (where the big margin and big money is) you need to go to 32 socket servers. And they are all RISC or Mainframe servers. Examples are IBM P795, Oracle M6-32, Fujitisu M10-4S, HP Superdome/Integrity. They all run AIX, Solaris, HP-UX and they all have up to 32 sockets or 64 sockets. Some attempts have been made to compile Linux to these huge servers, but the results have been bad because Linux has problems scale above 8-sockets. The reason is the Linux kernel devs does not have access to 32 socket SMP server, because they dont exist, so how can Linux kernel be optimized for 32 sockets? Ted Tso, the famous Linux kernel developer writes:
http://thunk.org/tytso/blog/2010/11/01/i-have-the-...
"...Ext4 was always designed for the “common case Linux workloads/hardware”, and for a long time, 48 cores and large RAID arrays were in the category of “exotic, expensive hardware”, and indeed, for much of the ext2/3 development time, most of the ext2/3 developers didn’t even have access to such hardware...."
Ted Tso considers servers with 48 cores in total, to be huge and out of reach for Linux developers. He is not talking about 48 socket servers, but 48 cores which is chicken shit in the mature Enterprise arena.

For instance the Big Tux HP server, compiled Linux to 64 socket HP integrity server with catastrophic results, the cpu utilization was ~40%, which means every other cpu idles under full load. Google on Big Tux and read it yourself.

There is a reason the huge Linux servers such as SGI UV2000 with 1000s of cores are so cheap in comparison to 16 socket or 32 socket Unix servers, and why the Linux servers are exclusively used for HPC number crunching workloads, and never SMP workloads:

SGI servers are only used for HPC clustered workloads, and never for SMP enterprise workloads:
http://www.realworldtech.com/sgi-interview/6/
"Typically, much of the work in HPC scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time."

Same with the ScaleMP Linux server with 1000s of cores, is never used for SMP workloads:
http://www.theregister.co.uk/2011/09/20/scalemp_su...
"The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit."

The difficult thing is to scale well above 8 sockets. You can release one single strong cpu, which does not scale. To scale above 8-sockets are very difficult, ask Intel. Thus, this Intel Xeon E7 cpu are only used up to 8-sockets servers. For more oomph, you need 32 socket or even 64 sockets - Unix or Mainframes. SGI Linux servers can not replace these large Unix servers. And that is the reason Linux never will venture into the lucrative Enterprise arena, and never replace large Unix servers. The largest Linux servers capable of Enterprise SMP workloads are 8 sockets. The Linux clusters dont count.

Another reason why this Intel Xeon E7 can not touch the high end server market (beyond scalability limitations) is that the RAS is not good enough. RAS is very very expensive. For isntance, IBM Mainframes and high end SPARC cpus, can replay an instruction if it were an error. x86 can not do this. Some Mainframes have three cpus and compare every computation, and if there is an error, the failing cpu will shut down. This is very very expensive to create this tailor made hardware. It is easy to get good performance, just turn up the GHz up to unstability point. But can you rely on that hardware? No. Enterprise need reliability above else. You must trust your hardware. It is much better to have one slower reliable server, than a super fast cranked up GHz where some computations are false. No downtime! x86 can not do this. The RAS is lacking severly behind and will take decades before Intel can catch up on Unix or Mainframe servers. And at that point - the x86 cpus will be as expensive!

Thus:
-Intel Xeon E7 does not scale above 8-sockets. Unix does. So you will never challenge the high end market where you need extreme performance. Besides, the largest Unix servers (Oracle) have 32TB RAM. Intel Xeon E7 has only 6TB RAM - which is nothing. So x86 does not scale cpu wise, nor RAM wise.
-Intel Xeon E7 has no sufficient RAS, and the servers are unreliable, besides the x86 architecture which is inherently buggy and bad (some sysadmins would not touch a x86 server with a ten feet pole, and only use OpenVMS/Unix or Mainframe):
http://www.anandtech.com/show/3593
-Oracle is much much much much cheaper than IBM POWER systems. The Oracle SPARC servers pricing is X for each cpu. So if you buy the largest M6-32 server with 32TB of RAM you pay 32 times X. Whereas IBM POWER systems costs more and more the more sockets you buy. If you buy 32 sockets, you pay much much much more than for 8 sockets.

Oracle will release a 96-socket SPARC server with up to 96TB RAM. It will be targeted for database work (not surprisingly as Oracle is mainly interested in Databases) and other SMP workloads. Intel x86 will never be able to replace such a huge monster. (Sure, there are clustered databases running on HPC servers, but they can not replace SMP databases). Look at the bottom pic, to see how all sockets are connected to each other in 32 socket configuration. There are only 2-3 hops to reach each node, which is very good. For HPC clusters, the worst case requires many many hops, which makes them unusable for SMP workloads
http://www.theregister.co.uk/2013/08/28/oracle_spa...