Intel Core has in a very short time been able to steal all focus in Intel’s processor assortment and the reason for this is quite simple. Core is an extremely promising architecture and we’ve taken a closer look at the Core 2 Duo family to see what it has to offer.
It’s not often pioneering changes are made on the CPU market. The times we’ve seen large performance improvements have been when the CPU manufacturers moved to new and more efficient manufacturing processes, and in between smaller steps in the shape of updated revisions and steppings. Historically a substantial revolution happens about every 3 to 4 years due to the launch of a completely new CPU architecture. With a slow entry on the portable market the Pentium M came about a year and a half ago and by then Intel started realizing that there maybe is a solution to the rampant problem the Pentium 4 CPUs were. Then nothing, until 2006 when Intel showed preliminary performance figures of its upcoming architecture. Enthusiast websites and forums were having a fit – the results presented were just too good to be true. By mid May we were invited for some quality time and to test the new CPUs. The benchamrks were still the same, but the setups were still configured by Intel. Even though we by now were in general convinced, we would be more confident having some of our own setups to run tests with, which is what we will show you today.
Two weeks ago Intel introduces its new series of CPUs for desktop computers, which in the beginning of the development had the acronym NGMA, Next Generation Micro Architecture, and later became known by the more famous name Conroe. To even further mark the start of a new era, the, to say the least, well known name Pentium has been demoted in favor of the new ”Core 2 Duo”. We will today have a closer look at the architecture behind the name Conroe, the different flavors it comes in, performance comparisons with it’s competitors and last, but certainly not least, overclocking.
We start by having a look at how things were a couple of years back.
People who are new to the computer scene of today might think it is obvious that Intel has gone for high frequencies while AMD has been putting in more performance per clock cycle, with the different sacrifices these methods come with. It hasn’t always been like this, actually far from it, and to get a better understanding of how the companies reasoned when they created the different architectures, we will do a quick historical flashback.
From P6 to NetBurst
About six years ago, in the beginning of the year 2000, information about Intel’s successor for the PIII processor, the P4, started to surface. The PIII processor operated at about 800 MHz at the time, and was based on Intel’s P6 architecture that had made its entrance at 150 MHz. The engineers at both AMD and Intel worked frantically to refine the manufacturing process to be able to increase the clock frequencies. The clock frequency was of the utmost importance, which at that time gave a very precise indication of the processor’s performance, comparable to the competitor. In marketing, 1 GHz was a clear milestone and the hysteria increased more and more when both of the manufacturers launched products closer and closer to the 1 GHz level. AMD was declared victorious, which we assume Intel wasn’t very happy about. Intel was working on its new architecture, dubbed NetBurst, which would greatly increase clock frequencies.
With NetBurst, Intel had taken the frequency initiative very seriously, which was the thing that was focused upon when the core was constructed. To make this increase in frequency possible, some areas had to be compromised, such as the length of the so called pipeline. To explain it in layman’s terms, the pipeline can be thought of as a queue. Every step in this queue prepares data and instructions for a final execution. The P6 architecture used a pipeline that consisted of 10 steps, which Intel increased by the double for NetBurst at its launch. Thanks to refined manufacturing processes, the manufacturer had succeeded in increasing the frequency of a P6 processor from 150 MHz to 1 GHz (and later on, 1.4 GHz with the Tualatin core). With this in mind and with the changes that had been made for NetBurst, Intel simply aimed for increasing the frequency tenfold, to 10 GHz. NetBurst started its era at 1,4 GHz and was made under the code name Willamette. In spite of the increased clock frequency, the performance was at its best equivalent to the PIII processor operating at 1 GHz, but in several cases slower because of the longer pipeline.
The successor, Northwood, increased the pipeline by one step, totaling at 21 steps. Thanks to a finer manufacturing process (130nm), it could be introduced at 1.6 GHz and finally scaled to 3.4 GHz. Everything seemed to be fine, and the architecture was starting to get pretty fast. At the time, Intel though that there were only a few refinements to be made to the manufacturing process until it would reach those really high frequencies. This proved to be easier said than done, and the problems were surfacing already at the 90nm manufacturing process. With the Prescott core, Intel hit some serious physical issues in the shape of increased leakage. This leakage made the processor substantially warmer than its predecessor at the same clock frequency, which was an effect the direct opposite of what had been experienced at previous process refinements. The NetBurst era had started to fall apart, and the expected Tejas core that would take over at 4 GHz was simply scrapped.
From NetBurst and back to P6
While the development team for NetBurst was working frantically with getting control of the new manufacturing process, another team within Intel received the mission to create a mobile processor with one requirement: extremely low power consumption. Some minor adjustments and optimizations were made to save energy, increase the performance and raise the frequencies. The code-name of this Pentium M processor, Banias, is far from well known. Its successor that was improved upon further is known better, not at least among enthusiasts, as Dothan. With the NetBurst architecture impoverished and constantly inferior to competing products, it was now time to create a new architecture, which would at most get minimal functions from NetBurst. It was simply time to learn from the development of the mobile processors and change alignment and the basic idea about the processors. Thus, Intel’s next-generation microarchitecture was born, officially dubbed Core.
Before we go deeper with the details of the architecture, we’ll take a look at the products being released by Intel based on the Core architecture.
Formally Intel had different CPU architectures for server, desktop and portable computers. The difference in some cases were marginal while in the case of the Pentium M, the name, Pentium, was basically all it had in common with the Pentium 4. What Intel now does is start with a common CPU and then alters the amount of cache memory and frequency between the market segments.
|# of cores||2||2||2|
|Frequency range||1.6 – 3.0GHz||1.8 – 2.93GHz||1.66 – 2.33GHz|
Here it clearly shows how Intel is planning to attack the different market segments. For starters, we clearly see that not much differs between the series. All CPUs have dual cores and 4 MB L2 cache, with a few exceptions for the smallest flavors for the desktop and mobile. Furthermore we see that the Woodcrest has a few models with, when needed, extra low heat development. The Conroe will initially have a system bus of maximum 1066FSB, but will possibly incorporate a 1333FSB version of the Extreme Edition CPUs. The Merom, for the mobile platform, doesn’t have the same performance need as its siblings and will have to do with a 666FSB and lower clock frequencies, an obvious gain as far as power efficiency goes. In this article we will focus on the desktop CPUs, so let’s have a look.
As we can see the slower models has another code-name, Allendale. The only thing separating them is the amount of L2 cache as you can see from the table. High initial FSB and thus also low multipliers will bring some challenges when it comes to overclocking these processors. Further ahead we will also see 800FSB versions of Allendale in the E4000 series.
Now it is time to check what is beneath the heat plate.
More performance per clock cycle
It was now time to change the way to think completely. The earlier attitude was to just build the processor for increased frequency, which we’ve earlier pointed out wasn’t an especially wise strategy. And luckily for us there are multiple factors in the theoretical performance function than only frequency scaling, and that is Instructions Per Clock cycle, also known as IPC.
Not altogether illogically we see that doubling the number of instructions per clock cycle theoretically compares to doubling the frequency. The problem is that it’s much more difficult to increase the IPC factor than the frequency factor, because you have to redesign the whole core. This is very costly when manufacturing silicone. Therefore the goal is to find a good compromise between IPC and expected frequency scaling in the architecture.
Less power consumption
We also talked about how it wants to stick to a unified processor architecture for three completely different platforms, with different demands on the product. To be able to use the architecture in mobile platforms it had to design the architecture with power consumption in mind. The formula for power consumption in a micro processor is given by the factors frequency, core voltage squared and dynamic capacitance.
And again we see that the frequency factor comes in to play, which is not surprising. One factor that has more influence is the voltage you feed the processor, V. Power consumption depends on the core voltage squared and it’s important that it’s kept low. C describes the processor’s dynamical capacitance and is far from simple to quantify. It depends on how many transistors that switches position each clock cycle, which is different depending on which kind of calculation is done and how the processor is currently being used. One thing is sure is that this term has a close relation to the IPC factor in the expression for performance. That tells us that we can’t get pass the power consumption by increasing the IPC and lower the frequency, which would have been a nice scenario.
Performance / Watt
So now Intel has set a new course, more performance compared to power consumption. When we look at the equation for performance/watt we see that the frequency factor is missing, which may look confusing in the beginning. The reason for this is that both performance and power frequency depends in a linear fashion on the frequency. If the frequency is increased the performance is increased with the same amount and the reverse is also true.
Since we’ve already concluded that Intel has given up on the frequency hysteria to achieve performance we see that the main challenge to develop the Core has been to increase the IPC value in the architecture. And how has this been done? Continue on!
We start out with a schematic of the previously mentioned architectures, P6, NetBurst and Core. It is basically just the P6 architecture that Intel has released full scale information about, while there is a lot of information available about NetBurst and for understandable reasons very little detailed such about Core. The schematic of Core can be interpreted as well-founded evaluations by experts on the processor industry together with information from Intel containing the alterations it has made from previous architectures.
You don’t have to be a microprocessor expert to guess the ancestor to Core, namely the old P6, which was the foundation of the processors up until the PIII models. Here you can see the entire flow of data from the Instruction Fetch to the data being prepared for calculation as the processor’s pipeline. Even though the picture may give the impression that there about the same number of steps there are in fact a varying amount of steps inside the blocks which all in all adds up to 31 with NetBurst and only 14 with Core. The length of the pipeline is a very important part of a processor’s total performance and here Intel has chosen to work from the much shorter pipeline of the P6 with only 10 steps. These have been expanded to 14 to allow further frequency scaling. Further more Core has also been designed as a dual-core processor from the base and up and for a complete schematic over the structure this picture might be more suitable.
We’ve already realized that Core makes a considerable leap when it comes to the performance, but also takes a great step down when it comes to the power consumption thanks to the shorter pipeline. We move on and take a look at some of the big changes since the previous architecture.
Wide Dynamic Execution
Wide Dynamic Execution is a flashy description for the overall widening Intel has done with the architecture. Despite there are a lot of differences between the previous architectures from both AMD and Intel they’ve all only had the possibility to handle 3 instructions at the same time. With Core Intel has added another X86 decoder and thus expanded this capacity to 4. Theoretically this simply means 33% more performance.
To further improve the performance it’s taken along something it introduced with the Pentium-M processors, Micro-OP Fusion. This technology basically means that the processor can, when the opportunity arises, calculate two microinstructions at the same time in the execution unit. To top it off Core also brings Macro-OP Fusion. Macro-OP Fusion is used before the decoders and has in some cases the possibility to fuse two X86 instructions into one to further speed things up. Thanks to Macro-OP Fusion Core is under ideal circumstances capable of decoding 5 X86 instructions, which in theory results in a great increase in performance.
Advanced Digital Media Boost
Advanced Digital Media Boost (ADMB) is perhaps the vaguest name of innovations made with the new architecture. Once again it is a matter of widening the execution unit so it doesn’t have to divide instructions and thus speed up the calculations. ADMB is name for a set of additions Intel has made in the form of SSE4 and adjustments of the core for faster calculations of these, and previous generations of SSE. The connection to the name ADMB comes from that several of these instructions are frequently used when coding and decoding videos and music. To improve the execution rate of these Intel has made it possible for Core to calculate up to 128 bit long instructions per cycle, where previous generations had to divide these into two 64 bit instructions, which would of course result in twice the number of cycles needed for calculation. In the execution unit it has also added 2 SSE units, up to a total of 3, compared to only 2 in AMD’s K8 architecture and only 1 with earlier generations of Intel processors.
To sum up what these two changes mean to the performance we can say in general we will have an overall theoretical boost of 33% and in best case scenario during heavy SSE calculations a six-fold increase compared to earlier generations. We move on and take a look at the structure of the cache.
Advanced Smart Cache
Core was from the start designed as a dual-core processor, which opened up for a completely new implementation of the L2 cache. Previous dual-core processors from Intel have simply been two regular processor cores beside each other. This is a very simple solution, which comes with a set of drawbacks though. It’s not uncommon that the different cores work with common data, which in this case would mean that data for these calculations has to be available in both processors’ L2 cache. Through Intel’s older design this meant that data had to be copied and updated via the comparably slow system bus. AMD uses a dedicated bus with its processors through which the two cores can communicate. The problem is far from solved then as data still has to be updated in each cache and substantial amount of traffic occurs to check what data has been changed and what has not been changed. Intel’s solution to this is called Advanced Smart Cache (ASC) which means that two cores share a large cache. This has several obvious advantages: data doesn’t have to be duplicated, which means that a larger portion of the cache is available for the cores. No checking whether the data is updated or not is necessary, both cores have direct access to fresh data. As single-threaded applications are executed the program will believe it is a processor with the total amount of cache available, with Conroe that would be 4MB.
Smart Memory Access
There are not many areas where see any specific technologies taken from NetBurst and implemented into Core, but early in the pipeline Intel has learned from it. NetBurst was extremely dependent on good planning of the calculations in an early stage and Intel spent a considerable amount of time on a developing what it calls Prefetchers in the processors. The technology does what the name implies; it predicts what data the processor will request next. Together with the processor’s Branch Prediction Unit, a unit that statistically predicts jumps in the program code, and analyses the streams of data to be able to move the right data from both the system memory to the shared L2 cache and from the L2 cache to the L1 cache of each core. This leads to that the system bus is used in a more efficient manner and the processor will get faster access to correct data.
The things we’ve gone through so far has meant improved performance, yet still with the power consumption in mind. Intel has also made several improvements at the logic level to further lower the power consumption, which we will discuss on the next page.
Intelligent Power Capability
We have so far only been discussing the performance improving elements with Core, but as we said before Intel also had the power consumption in mind when designing Core. As we then concluded it is of the utmost importance that the voltage is kept low and to be able to achieve this it has mainly used its well calibrated 65nm manufacturing process to be able to use more efficient transistors. Thanks to the wider architecture it has also had the possibility to lower the initial clock frequencies, which in turn leads to lower power consumption, and the possibility to work with a lower multiplier. Intel has also taken into consideration that a processor is designed in the form of blocks as you can see on the picture below. Since it’s rare that all kind of logic in a processor is working constantly and Intel has given the processor the possibility to turn off inactive parts to lower the power consumption. Further improvements far also been made with EIST (enhanced Intel Speedstep technology) so that during a low workload it will automatically switch to a 6x multiplier and thus reduce the frequency.
Intel has not shared any official information about how much each of the main alterations we’ve presented have contributed to the final performance, but implied they make up about a quarter each. For the performance it is clearly the wider execution unit that is the main source, but to be able to supply it with data you need smarter memory access and for a scalable performance with dual cores you need a shared L2 cache. Thus, all changes are dependent on each other.
It is not often you see as rigorous theoretical improvements as the ones Intel has made with the Core architecture. Intel claims an overall increase in performance of 40% with the desktop series’ processors while it claims that power consumption has been reduced by the same amount, 40%, compared to previous generations of processors. These are some big claims, and it’s going to be interesting to see if the processors can really live up to this.
We start the testing phase by presenting our test system.
|Motherboard||Intel D975XBX ”BadAxe”||Asus M2N32-SLI Deluxe|
|Processor||Intel Core 2 Extreme X6800|
Intel Core 2 Duo E6700 (emulated)
Intel Core 2 Duo E6600 (emulated)
Intel Pentium Extreme Edition 955
|AMD Athlon64 FX-62 AM2|
|Memory||Corsair XMS2 6400 (2x1024MB)|
|Graphics card||nVidia GeForce 7950GX2|
|Power Supply Unit||OCZ PowerStream 520W|
|Operating system||Windows XP (SP2)|
|Chipset drivers||Intel Chipset Driver 188.8.131.523||nVidia nForce 9.35 (x16)|
|Graphics card drivers||nVidia ForceWare 91.31|
|Benchmark programs||SiSoft Sandra 2005 SR3|
VirtualDub 1.6.10, XviD 1.0.3
Need For Speed: Most Wanted
Thanks to the unlocked multipliers of the X6800 processor we have the possibility to simulate the performance of the E6700 and E6600 models, as these other than the frequency share specifications. As a reference system we’ve used an FX-62, which is AMD’s fastest dual-core processor of today, together with its AM2 platform. To make the comparison as even as possible both systems had memories running at 400MHz/800DDR2 with the memory latencies 4-4-4-12.
Let’s start with the synthetic tests.
Theoretical figures say rather little about how the processor performs during every day use, but still gives you a hint of the available performance. Except from a victory for the 955XE processor with one of the tests the X6800 and E6700 has no problem with taking home this round. AMD’s flagship has to work hard to keep up with the E6600 model, despite a 400MHz advantage.
We move on to the more practical tests.
WinRAR has been gifted with multi-threading capabilities and Core is the unthreatened champion, followed by 955XE which thanks to HyperThreading
manages to defeat the FX-62. Cinebench goes just as smooth for the X6800 and E6700. The FX-62 manages to place itself right after the E6600 model. The reason for the incredible increase in performance going from single-thread to multi-thread with 955XE is spelled HyperThreading.
We move on and take a look at some regular media conversions.
The results speak for themselves really and Core takes home the victory with alls tests. We’ve used the same scales with the single and multi-threaded diagrams to show how much of a difference the dual cores make, also with common non-synthetic applications.
Next are the not so unknown 3DMark and AquaMark.
Even though 3DMark, especially 3DMark2001, can almost be considered phased out it still offers a hint of how the development of the computing power with modern PCs, and what a development. Over 46000 with a system running at default frequencies, without any software optimizations, is the least to say impressive and was completely unthinkable a few years back. Core is in a class of its won where FX-62 can only measure up to the E6600 with two of the tests.
Let’s take a look at the more specific details of the processor performance in 3DMark and how the processors perform with PCMark05.
The results are still in heavy favor of Core and E6600 continues to keep its and its faster siblings’, back free for most of the time .
We move on and test if the performance is maintained when playing games.
All game tests were performed with maximal image quality and at 1280×1024 pixels resolution, thus a likely user scenario. Core continues to dominate and has no problem to take home all victories. With Prey we get the same result with all systems which most likely mean that our graphics card, a 7950GX2, is the limiting factor. When we set the resolution to 1024×768 it was just the 955XE processor that lost ground by not being able to feed the graphics card. An important thing to point out here is that if we would increase the resolution even further and activate heavy image quality improving technologies such as antialiasing and anisotropic filtering the graphics card would quickly become a bottleneck despite that we’re using the fastest card on the market.
The performance is alright, but what about the heat that has always been great Achilles heals with Intel’s previous generations?
We measured the system’s total power consumption with an ampere meter connected to the PSU. Thus it is the entire system’s power that is measured. To load the processor during the load tests we used multiple instances of Prime. The reason we haven’t published any figures with the FX-62-system is that we received abnormally high figures with the hardware we used. As there are a lot of differences between the motherboards we chose to remove this figure. The 955XE processor was used with the exact same motherboard as the Core 2 Duo processors, with the exact same specifications on the system.
955XE is known as a very hot processor and we can see that Core has a clear advantage here. The total system effect is 20% higher with the 955XE processor with the exact same system and can see that there is actually a bigger difference in practice than what the specifications claims. Whether it is the 955XE processor that surpasses its given figure or X6800 that actually consumes less is hard to say. If we had the possibility to measure the processor’s power consumption the relative difference would be larger.
We will conclude our experiences of Core on the next page.
Faithful readers knows that we’ve already overclocked this processor quite a lot, you can read on here. We will publish a detached article with overclocking of Core 2 -processors soon. In that article we will make it very clear what overclocking margin you can expect from the stock cooler, water cooling and more powerful devices.
Pretty much everyone doubted the figures Intel presented six months ago, simply because they were too good to be true. With the results in hand we can see they were in fact not. Intel has with its new Core architecture taken a huge leap ahead by taking a step back, to one of its older and most successful architectures. Core 2 Extreme X6800 and Core 2 Duo E6700 wins without a problem all tests when compared to AMD’s FX-62. Core 2 Duo E6600 also performs very well with many tests. We can also see that the lead is evenly divided over a large portion of tests and both competing processor series from Intel and AMD display more equal performance than against NetBurst which at some occasions could really shin, while in others it could act completely hopeless. When it comes to the performance Core 2 Extreme and Core are unchallenged.
With the basic idea to create an overall architecture for the three market segments; server, desktop and mobile, Intel had to create a power efficient architecture to begin with. A lot of this work was started with the development of the Pentium-M processors and with Core it continued to develop these technologies. Compared to the previous desktop models the processors have been granted more aggressive power saving functions where the processors are downclocked during low load. At the logic level it has given the core the possibility to dynamically adjust itself by the workload and turn off certain parts that’s not used. Intel says it has lowered the power consumption compared to its previous desktop architecture with as much as 40% at similar models, which we have no reason to doubt after performing these tests. Between the new and previous top models we measured a difference of 50W with identical systems. With official figures at 80W and 130W with the top model this seems to match our results quite well.
With improved performance and at the same time reduced power consumption we get an ever more powerful rash when we compare the competing processor. Not only does it go 25% faster with Intel’s top model to compress an AVI movie to DivX, to take an example, it also consumes about 20% less power during this time. I.e. the performance/power ratio is with Intel’s processor up to 40% faster here. This isn’t limited to this test but can actually be seen with most tests we’ve performed.
Performance/Price and availability
The availability of the new processors is clearly limited and there are reasons to believe that the large system builders will suck up a considerable amount of the processors Intel manages to produce. AMD has done some serious price cuts just recently to keep its processors
attractive to consumers. This makes it very hard to make an evaluation of ho much performance you get per buck from the two manufacturers.
A hot topic during the spring has been what platform Core 2 will work with. Theoretically a large portion of Intel’s latest chipsets would be able to support Core 2, but the problem is not with the chipsets. Intel has chosen to update the specifications for the voltage sections of the motherboards, which has made pretty much all current motherboards incompatible. The latest chipset, P965, has genuine support for Core 2 while older cards with the i975X requires physical updates to work, which in practice means new motherboards. We hope that manufacturers are clear when it comes to the support for the processors. If we look into the future we’ve been informed that Intel’s coming processors with four cores, which is expected to arrive sometime during the fourth quarter, will work without a problem with motherboards that supports Core 2.
What do you get with a heavily improved P6 architecture with speed optimizations from NetBurst? The answer is a 100% monster.
Intel Core 2 Extreme and Core 2 Duo
+ Best performing no matter the environment
+ Low power consumption and good power saving functions
+ Support for future Quad-core processors
– A bit uncertain motherboard support
Last we would like to thank Intel for supplying processor and motherboard.