Intel Xeon E5-2697 v2 and Xeon E5-2687W v2 Review: 12 and 8 Cores
by Ian Cutress on March 17, 2014 11:59 AM EST- Posted in
- CPUs
- Intel
- Xeon
- Enterprise
Real World CPU Benchmarks
Rendering – Adobe After Effects CS6: link
Published by Adobe, After Effects is a digital motion graphics, visual effects and compositing software package used in the post-production process of filmmaking and television production. For our benchmark we downloaded a common scene in use on the AE forums for benchmarks and placed it under our own circumstances for a repeatable benchmark. We generate 152 frames of the scene and present the time to do so based purely on CPU calculations.
With AE6 being an optimized software package, more cores and threads rather than more MHz makes sense in our test.
Compression – WinRAR 5.0.1: link
Our WinRAR test from 2013 is updated to the latest version of WinRAR at the start of 2014. We compress a set of 2867 files across 320 folders totaling 1.52 GB in size – 95% of these files are small typical website files, and the rest (90% of the size) are small 30 second 720p videos.
Due to the variable nature of the WinRAR test, our Xeons come out on top but it is hard to choose between them.
Image Manipulation – FastStone Image Viewer 4.9: link
Similarly to WinRAR, the FastStone test us updated for 2014 to the latest version. FastStone is the program I use to perform quick or bulk actions on images, such as resizing, adjusting for color and cropping. In our test we take a series of 170 images in various sizes and formats and convert them all into 640x480 .gif files, maintaining the aspect ratio. FastStone does not use multithreading for this test, and thus single threaded performance is often the winner.
FastStone is a single threaded application where IPC and MHz matter. As a result, the newest architectures and platforms do better here than the Ivy Bridge-E based Xeons.
Video Conversion – Xilisoft Video Converter 7: link
The XVC test I normally do is updated to the full version of the software, and this time a different test as well. Here we take two different videos: a double UHD (3840x4320) clip of 10 minutes and a 640x266 DVD rip of a 2h20 film and convert both to iPod suitable formats. The reasoning here is simple – when frames are small enough to fit into memory, the algorithm has more chance to apply work between threads and process the video quicker. Results shown are in seconds and time taken to encode.
When going through lots of small frames, our XVC test working on one file prefers cores and threads over MHz.
When the workload has some room to grow with larger frames, segments of each frame can be dispatched to cores more approprately and the 12-core Xeon comes out on top.
Video Conversion – Handbrake v0.9.9: link
Handbrake is a media conversion tool that was initially designed to help DVD ISOs and Video CDs into more common video formats. The principle today is still the same, primarily as an output for H.264 + AAC/MP3 audio within an MKV container. In our test we use the same videos as in the Xilisoft test, and results are given in frames per second.
Similar to the XVC test, when the frames are small the software has to fight against thread dispatch of smaller pieces that get in the way of opening up the trottle.
Move to larger frames again and the Xeons can use their full force. Cores over MHz wins here.
Rendering – PovRay 3.7: link
The Persistence of Vision RayTracer, or PovRay, is a freeware package for as the name suggests, ray tracing. It is a pure renderer, rather than modeling software, but the latest beta version contains a handy benchmark for stressing all processing threads on a platform. We have been using this test in motherboard reviews to test memory stability at various CPU speeds to good effect – if it passes the test, the IMC in the CPU is stable for a given CPU speed. As a CPU test, it runs for approximately 2-3 minutes on high end platforms.
PovRay becomes an embarrassingly parallel benchmark where cores x frequency come out on top. This pattern of results is a common sight in our synthetic testing.
71 Comments
View All Comments
vLsL2VnDmWjoTByaVLxb - Monday, March 17, 2014 - link
> TrueCrypt is an off the shelf open source encoding tool for files and folders.Encoding?
Brutalizer - Monday, March 17, 2014 - link
I would not say these cpus are for high end market. High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets! These expensive Unix RISC servers or IBM Mainframes, have extremely good RAS. For instance, some Mainframes do every calculation in three cpus, and if one fails it will automatically shut down. Some SPARC cpus can replay instructions if something went wrong. Hotswap cpus, and hotswap RAM. etc etc. These low end Xeon cpus have nothing of that.PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category.
In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node. Many small computers in a cluster.
Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/
"The success of Altix systems in the high performance computing market are a very positive sign for both Linux and Itanium. Clearly, the popularity of large processor count Altix systems dispels any notions of whether Linux is a scalable OS for scientific applications. Linux is quite popular for HPC and will continue to remain so in the future,...However, scientific applications (HPC) have very different operating characteristics from commercial applications (SMP). Typically, much of the work in scientific code is done inside loops, whereas commercial applications, such as database or ERP software are far more branch intensive. This makes the memory hierarchy more important, particularly the latency to main memory. Whether Linux can scale well with a SMP workload is an open question. However, there is no doubt that with each passing month, the scalability in such environments will improve. Unfortunately, SGI has no plans to move into this SMP market, at this point in time."
Kevin G - Tuesday, March 18, 2014 - link
@BrutalizerAnd here we go again. ( http://anandtech.com/comments/7757/quad-ivy-brigde... )
“These low end Xeon cpus have nothing of that.”
This is actually accurate as the E5 is Intel’s midrange Xeon series. Intel has the E7 line for those who want more RAS or scalability to 8 sockets. Features like memory hot swap can or lock step mirroring can be found in select high end Xeon systems. If you want ultra high end RAS, you can find it if you need it as well as pay the premium price premium for it.
“In contrast to this, every server larger than 32/64 sockets, is a cluster. For instance the SGI Altix or UV2000 servers, which sports up to 262.000 cores and 100s of TB. These are the characteristica of supercomputer clusters. These huge clusters are dirt cheap, and you pay essentially the hardware cost. Buy 100 nodes, and you pay 100 x $one node.”
Incorrect on several points but they’ve already been pointed out to you. The UV2000 is fully cache coherent (with up to 64 TB of memory) with a global address space that operates as one uniform, logical system that only a single OS/Hypervisor is necessary to boot and run.
Secondly, the price of the UV2000 does not scale linearly. There are NUMALink switches that bridge the coherency domains that have to be purchased to scale to higher node counts. This is expected of how the architecture scales and is similar to other large scale systems from IBM and Oracle.
“Clusters are only used for HPC number crunching.”
Incorrect. Clustering is standard in what you define as SMP applications (big business ERP). It is utilized to increase RAS and prevent downtime. This is standard procedure in this market.
“SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems,”
Why? As long as underlaying architecture is the same, they can run. You may not get the same RAS or scale as high in a single logical system but they’ll work. Performance is where you’d expected it on these boxes: a dual socket HPC system will perform roughly one quarter the speed of as the same chips occupying an 8 socket system.
“as SGI explains in this link:
http://www.realworldtech.com/sgi-interview/6/“
As pointed out numerous times before, that link is you cite is a decade old. SGI has moved into the SMP space with the Altix UV series. Continuing to use this link as relevant is plain disingenuous and deceptive.
As for an example of a big ERP application running on such an architecture, the US Post Office run’s Oracle Data Warehousing software on a UV1000. ( https://www.fbo.gov/index?s=opportunity&mode=f... )
Brutalizer - Tuesday, March 18, 2014 - link
Do you really think that UV (which is the successor to Altix) is that different? Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later. Windows will not be superior to Unix after some development. You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?Altix is only for HPC number crunching, says SGI in my link. Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research.
In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu. Do you really think that is feasible? Does it sound reasonable to you? IBM and Oracle and HP has had great problems connecting 32 sockets to each other, just look at the connections on the last picture at the bottom, do you see all connections? Now imagine half a billion of them in a server!
http://www.theregister.co.uk/2013/08/28/oracle_spa...
But on the other hand, if you keep the number of connection downs to islands, and then connect the islands to each other, you dont need half a billion. This solution would be feasible. And then you are not in SMP territory anymore: SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf
"...SMP is based on intra-node communication using memory shared by all cores. A cluster is made up of SMP compute nodes but each node cannot communicate with each other so scaling is limited to a single compute node...."
Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC.
Do you have any benchmarks where one 32.768 cpu SGI UV2000 demolishes 50-100 of the largest Oracle SPARC M6-32 in business systems? And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?
Kevin G - Tuesday, March 18, 2014 - link
Wow, I think the script you're copy/pasting from needs better revision."Do you really think that UV (which is the successor to Altix) is that different?"
Yes. SGI changed the core achitecture to add cache coherent links between the entire system. Clusters tend to have an API on top of a networking software stack to abstract the independent systems so they may act as one. The UV line does not need to do this. For one processor to use memory and performance calculations on data residing off of CPU on the other end, a memory read operation is all that is needed on the UV. It is really that simple.
"Windows is Windows, and it will not magically challenge Unix or OpenVMS, in some iterations later."
The UV can run any OS that runs on modern x86 hardware today. Windows, Linux, Solaris (Unix) and perhaps at some point NonStop (HP's mainframe OS http://h17007.www1.hp.com/us/en/enterprise/servers... ). The x86 platform has plenty of choices to choose from.
"You think that HPC- Altix will after some development, be superior to Oracle and IBM's huge investments of decades research in billions of USD? Do you think Oracle and IBM has stopped developing their largest servers?"
What I see SGI offering is another tool alongside IBM and Oracle systems. Also you mention decades of research, then it is also fair to put SGI into that category as that link you love to spam IS A DECADE OLD. Clearly SGI didn't have this technology back in 2004 when that interview was written.
"Today the UV line of servers, has up to 262.000 cores and 100s of TB of RAM. Whereas the largest Unix and IBM Mainframes have 64 sockets and couple of TB RAM, after decades of research."
Actually this is a bit incorrect. IBM can scale to 131,072 cores on POWER7 if the coherency requirement is forgiven. Oh, and this system can run either AIX or Linux when maxed out. Source: http://www.theregister.co.uk/Print/2009/11/27/ibm_...
"In a SMP server, all cpus will have to be connected to each other, for this SGI UV2000 with 32.768 cpus, you would need (n²) 540 million (half a billion) threads connecting each cpu.
http://www.theregister.co.uk/2013/08/28/oracle_spa...
Wow, do you not read your own sources? Not only is your math horribly horribly wrong but the correct methodology is found for calculating the number of links as things scale is in the link you provided. To quote that link: "The Bixby interconnect does not establish everything-to-everything links at a socket level, so as you build progressively larger machines, it can take multiple hops to get from one node to another in the system. (This is no different than the NUMAlink 6 interconnect from Silicon Graphics, which implements a shared memory space using Xeon E5 chips...)"
The full implication here is that if the UV 2000 is not a socket machine, then neither is Oracle's soon-to-be-released 96 socket device. The topology to scale is the same in both cases per your very own source.
"SGI say like this on page 4 about the UV2000 cluster:
www.sgi.com/pdfs/4395.pdf"
Fundamentally false. If you were to actually *read* the source material for that quote, it is not describing the UV2000. Rather is speaking generically abou the differences between a cluster and large SMP box on page 4. If you got to page 19, it further describes the UV 2000 as a single system image unlike that of a cluster as defined on page 4.
"Dont you think that a 262.000 core server and 100s of TB of RAM sounds more like a cluster, than a single fat SMP server? And why do the UV line of servers focus on OpenMPI accerators? OpenMPI is never used in SMP workloads, only in HPC."
All I'd say about a 262,000 core server is that it wouldn't fit into a single box. Then again IBM, Oracle and HP are spreading their large servers across multiple chassis so this doesn't bother me at all. The important part is how all these boxes are connected. SGI uses NUMAlink6 which provides cache coherency and a global address space for a single system image. OpenMPI can be used inside of a cache coherent NUMA system as it provides a means to gurantee memory locality when data is used for execution. It is a means of increasing efficiency for applications that use it. However, OpenMPI libraries do not need to be installed for software to scale across all 256 sockets on the UV200. It is purely an option for programmers to take advantage of.
"And why is the UV2000 much much cheaper than a 16/32 socket Unix server? Why does a single 32 socket Unix server cost $35 million, whereas a very large SGI cluster with 1000 of sockets is very very cheap?"
First, to maintain coherency, the UV2000 only scales to 256 sockets/64 TB of memory. Second, the cost of a decked out P795 from IBM in terms of processors (8 sockets, 256 cores) and memory (2 TB) but only basic storage to boot the system is only $6.7 million whole sale. Still expensive but far less than what you're quoting. It'll require some math and reading comprehension to get to that figure but here is the source: http://www-01.ibm.com/common/ssi/ShowDoc.wss?docUR...
I couldn't find pricing for the UV2000 as a complete system but purchasing the Intel processors and memory seperately to get to a 256 socket/64 TB system would be just under $2 million. Note that that figure is just processor + memory, no blade chassis, racks or interconnect to glue everything together. That would also be several million. So yes, the UV2000 does come out to be cheaper but not drastically. That IBM pricing document does highlight why their high end systems costs so much, mainly capacity on demand. The p795 is getting a mainframe like pricing structure where you purchase the hardware and then you have to activate it as an additional cost. Not so on the UV2000.
psyq321 - Tuesday, March 18, 2014 - link
Xeon 2697 v2 is not a "low end" Xeon.It is part of "expandable server" platform (EP), being able to scale up to 24 cores.
That is far from "low end", at least in 2014.
alpha754293 - Wednesday, March 19, 2014 - link
"High end market are huge servers, with as many as 32 sockets, some monster servers even have 64 sockets!"Partially true. The entire cabinet might have that many sockets/processors, but on a per-system, per-"box" level, most max out between two and four. You get a few odd balls here and there that would have a daughter board for a true 8-socket system, but those are EXTREMELY rare in actuality. (Tyan, I think had one for the AMD Opterons, and they said that less than 5% of the orders were for the full fledge 8-socket systems).
"PS. Remember that I distinguish between a SMP server (which is a single huge server) which might have 32/64 sockets and are hugely expensive. For instance, the IBM P595 32-socket POWER6 server used for the old TPC-C record, costed $35 million. No typo. One single huge 32 socket server, costed $35 million. Other examples are IBM P795, Oracle M5-32 - both have 32 sockets. Oracle have 32TB RAM which is the largest RAM server on the market. IBM Mainframes also belong to this category."
Again, only partially true. The costs and stuff is correct, but the assumptions that you're writing about is incorrect. SMP is symmetric multiprocessing. BY DEFINITION, that means that "involves a multiprocessor computer hardware and software architecture where two or more identical processors connect to a single, shared main memory, have full access to all I/O devices, and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes." (source: wiki) That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI).
Furthermore, the old TPC-C that you mention, they do NOT process as one monolithic sequential series of events in parallel (so think of like how PATA works...), but rather more like a JBOD SATA (i.e. the processing of the next transaction does NOT depend on ALL of the current block of transactions to be completed, UNLESS there is an inherent dependency issue, which I don't think would be very common in TPC-C). Like bank accounts, they're all treated as discrete and separate, independent entities, which means you can send all 150,000 accounts against the 32-socket or 64-socket system and it'll just pick up the next account when the current one is done, regardless.
The other failure in your statement or assumption is that's why there's something called HA - high avialability. Which means that they can dynamically hotswap an entire node if there's a CPU failure, so that the node can be downed and yanked out for service/repair while another one is hotswapped in. So it will failover to a spare hotswap node, work on it, and then either fall over back to the newly replaced node or it would rotate the new one into the hotswap failover pool. (There are MANY different ways of doing that and MANY different topologies).
The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)
"In contrast to this, every server larger than 32/64 sockets, is a cluster."
Again, not entirely true. You can actually get 4 socket systems that are comprised of two dual-socket nodes and THAT is enough to meet the requirements of a cluster. Heck, if you pair up two single-socket consumer-grade systems, that TOO is a cluster. That's kinda how Beowulf clusters got started - cuz it was an inexpensive way (compare to the aforementioned RISC UNIX based systems) to gain computing power without having to spend a lot of money.
'These huge clusters are dirt cheap"
Sure...if you consider IBM's $100 million contract award "cheap".
"Clusters are only used for HPC number crunching. SMP servers (one single huge server, extremely expensive because it is very difficult to scale beyond 16 sockets) are used for ERP business systems. An HPC cluster can not run business systems, as SGI explains in this link:"
So there's the two problems with this - 1) it's SGI - so of course they're going to promote what they ARE capable of vs. what they don't WANT to be capable of. 2) Given the SGI-biased statements, this, again, isn't EXACTLY ENTIRELY true either.
HPCs CAN run ERP systems.
"HPC vendors are increasingly targeting commercial markets, whereas commercial vendors, such as Oracle, SAP and SAS, are seeing HPC requirements." (Source: http://www.information-age.com/it-management/strat...
But that also depends on the specific implementation of the ERP system given that SAP is NOT the ONLY ERP system that's available out there, but it's probably one of the most popular one, if not THE most popular one. (There's a whole thing about distributed relational databases so that the database can reside in smaller chunks across multiple nodes, in-memory, which are then accessed via a high speed interconnect like Myrinet or Infiniband or something along those lines.)
Furthermore, the fact that ERP runs across large mainframes (it grows as the needs grows), is an indications of HPC's place in ERP. Alternatively, perhaps rather than using it for the backend, HPC can be used on the front end by supporting many, many, many virtualized front-end clients.
Like I said, most of the numbers that you wrote are true, but the assumptions behind them isn't exactly all entirely true.
See also: http://csserver.evansville.edu/~mr56/Publications/...
Kevin G - Wednesday, March 19, 2014 - link
"That means that it is a monolithic system, again, of which, few are TRULY such systems. If you've ever ACTUALLY witnessed the startup/bootup sequence of an ACTUAL IBM mainframe, the rest of the "nodes" are actually booted up typically by PXE or something very similiar to that, and then the "node" is ennumerated into the resource pool. But, for all other intents and purposes, they are semi-independent, standalone systems, because SMP systems do NOT have the capability to pass messages and/or memory calls (reads/writes/requests) without some kind of a transport layer (for example MPI)."Not exactly. IBM's recent boxes don't boot themselves. Each box has a service processor that initializes the main CPU's and determines if there are any additional boxes connected via external GX links. If it finds external boxes, some negotiation is done to join them into one large coherent system before an attempt to load an OS is made. This is all done in hardware/firmware. Adding/removing these boxes can be done but there are rules to follow to prevent data loss.
It'll be interesting to see what IBM does with their next generation of hardware as the GX bux is disappearing.
"The statement you made about having 32TB of RAM is again, partially true. But NONE of the single OS instances EVER have full control of all 32TB at once, which again, by DEFINITION, means that it is NOT truly SMP. (Course, if you ever get a screenshot which shows that, I'd LOVE to see it. I'd LOVE to get corrected on that.)"
Actually on some of these larger systems, a single OS can see the entire memory pool and span across all sockets. The SGI UV2000 and SPARC M6 are fully cache coherent across a global memory address space.
As for a screenshot, I didn't find one. I did find a video going over some of the UV 2000 features displaying all of this though. It is only a 64 socket, 512 core, 1024 thread, 2 TB of RAM configuration running a single instance of Linux. :)
https://www.youtube.com/watch?v=YUmBu6A2ykY
IBM's topology is weird in that while a global memory address space is shared across nodes, it is not cache coherent. IBM's POWER7 and their recent BlueGene systems can be configured like this. I wouldn't call these setups clusters as there is no software overhead to read/write to remote memory addresses but it isn't fully SMP either due to multiple coherency domains.
silverblue - Monday, March 17, 2014 - link
The A10-7850K is a 2M/4T CPU.Ian Cutress - Monday, March 17, 2014 - link
Thanks for the correction, small brain fart on my part when generating the graphs.