Intel Penryn architecture

27 January, 2008

We’ve put Intel’s first 45nm processor in the test bench to see what the new Yorkfield core brings in terms of performance and overclocking potential. Extreme is the word…

Time really flies. Two years ago Intel launched its first 65nm desktop processor. 3.46GHz was the frequency and the power consumption was astonishingly high. Six months later, the Core architecture was uncovered, also made with 65nm technology. Core made a sharp U-turn when it came to power consumption and Intel showed the sceptics that it had complete control of the 65nm process. Now it had managed to gain an advantage both when it comes to power consumption and performance. This is the evolution of the products so far and now it’s time for the next step on the processor market, namely down-scaling of the process technology to 45nm. It’s basically the same Core architecture, but smaller and with a couple of adjustments and improvements. These changes have been gathered under the code-name Penryn.

One of the most significant changes to the hardware is that the amount of cache has increased. Our previous reviews of models of the 65nm Core family have shown that this has a significant affect on performance. The logic has been expanded with a number of new instructions known as SSE4, where most of them are for hardware accelerating the compression of both audio and video. Thanks to the new 45nm process, the power leakage has been severely reduced, and we now have a much cooler and more efficient processor. But what does a refining of the process really mean? How much performance does the increased cache really bring? During which scenarios do the extra instructions help the end-user? Penryn, Wolfdale, Yorkfield and the rest of the code-names, what do they mean? These questions and a lot more, will be answered in this technical article about Intel’s latest revision of the Core architecture.

We start from the top and investigate a number of code-names.

It can seem to be a bit confusing that Intel uses Penryn as both a name
for the overall developing for the updated core as well as name of the
successor to the mobile CPU Merom. To sort things out before the rest of the
article we have arranged all names in the following table.

Intel’s 65nm and 45nm processors
	65nm		45nm
	Code name	Product name	Code name	Product name
Server	Clovertown	E5300	Harpertown	E5400
Server	Woodcrest	E5100	Wolfdale	E5200
Desktop	Kentsfield	Q6000	Yorkfield	Q9000
Desktop	Conroe	E6000 E4000	Wolfdale	E8000
Mobile	Merom	T7000 T5000	Penryn	T9000 T8000
Only a general summary, several models are missing and may be changed

To the left we see Intel’s present 65nm-models with their code and
product names. To the right we see the equivalent model and names with the
updated core and the new 45nm technology. The names for these new models are not
100% definite yet and may be changed before the release. We take a closer look
on the 45nm processors and specific the desktop models.

45nm desktop processors
Model	Codename	Cores	L2 cache	Frequency	Multiplier	FSB
QX9650	Yorkfield	4	2 x 6MB	3000MHz	9	333MHz
Q9550	Yorkfield	4	2 x 6MB	2833MHz	8.5	333MHz
Q9450	Yorkfield	4	2 x 6MB	2666MHz	8	333MHz
Q9300	Yorkfield	4	2 x 3MB	2500MHz	7.5	333MHz
E8500	Wolfdale	2	6MB	3166MHz	9.5	333MHz
E8400	Wolfdale	2	6MB	3000MHz	9	333MHz
E8300	Wolfdale	2	6MB	2833MHz	8.5	333MHz
E8200	Wolfdale	2	6MB	2666MHz	8	333MHz

First of all we can see that all models are now running at 333MHz FSB, like
the latest models of the Core 2 Duo series. Further we see that Intel have
introduced half multipliers to get more models inside the frequency spectrum. The new
core has more L2 cache, up to 6MB from the previous maximum of 4MB. This
will most likely vary due to introduction of new models. The new model of the entry quad core series have been cut to 2x3MB. On the coming pages we
will further discuss Intel’s strategy with the amount of memory
and the relevance to manufacturing cost. A closer look at the names
of the models reveal a few interesting points. The QX9650 is the fastest quad core
processor, which can indicate on at least 3 faster models before they’ll run out
of model names (QX9750, QX9850, QX9950). There are QX9770 processors in circulation, which confirms our
suspicions that Yorkfield have a frequential margin. The same analysis on
the dual core models also reveals a big margin fore future models.

We continue with a glance of what the transition in process technology
means and discuss its hardware aspects.

There are a number of reasons for a manufacturer of integrated circuits
to migrate to a smaller process technology. They can mainly be summarized to
less consumption of silicon, higher frequencies and lower power
consumption. First we should explain what 65nm and 45nm stands for. 45nm
stands for 45 nanometers, which is 0.000045mm. The same as 1/20,000 of a
millimeter. This distance defines the shortest length of a gate in a transistor.
With this in mind, we’ll discuss the different benefits of reducing this
distance.

Silicon consumption

A direct and easy to understand
effect of decreasing the size in process technology is that every transistor
needs less space on the chip. Theoretically, a 45nm transistor is 30% slimmer
than a corresponding 65nm transistor. A transistor shrinks both in length
and in width, which means that a 45nm transistor needs less than half the
space of a 65nm transistor. One could wonder why this makes any difference,
a few nanometers back or forth. The answer is that it makes a significant
difference. Once a processor is designed it will be made out of
silicon and it’s not unrealistic that a certain type of processor will be made
in quantities of several hundred millions. If the cost for silicon can be
halved the manufacturing cost is more or less halved.

Sadly, due to physical limitations an ideal scaling
isn’t possible. Exact numbers are not official, but a good guess is a size
reduction of somewhere around 40%. Another important
aspect concerning this it what you call yield, which is how many working circuits you get from a silicon wafer. Since the silicon wafer is round and a transistor is rectangular you can’t use all the available space. The lesser size of the processor the more processors can fit on inside the rim of the circular wafer.

The pictures above are sketches of a silicon wafer and how the rectangular processor cores are placed on them. The images are out of proportion but illustrate the manufacturing profit you can make with a new technology. The left picture shows only 8 whole processor cores while the right shows 30 whole cores. The rectangles are about 36% smaller, which means almost 4 times as many whole cores.

To make it more realistic we introduce some defective spots. In both cases we have three defective cores, which means that there is only 5 fully working cores in the left picture while in the right we get 27. These numbers are exaggerated to show the difference with a smaller manufacturing technology, and to show the effects of high quality silicon and what the size of the wafers means for these companies.

For the performance cores they have chosen to use some
of the advantage in size to add
more memory. To the left you see a highly magnified picture of the new Penryn core with 6MB L2 cache, and to the right the Conroe die with 4MB L2 cache. The L2 cache is the large areas to the left. Today this takes a big portion of the total space of the silicon surface of the processor. Even with 50% more cache, Penryn is still smaller than Conroe. Here is a comparison between different models and how much space the L2 cache takes of the whole circuit.

Silicon consumption
		Transistors [million]
Core	Process	Cache	Logic	Total	Normalized silicon surface
Conroe 4MB	65nm	133	160	293	100%
Allendale 2MB	65nm	66	160	226	77%
Allendale 1MB	65nm	33	160	193	66%
Wolfdale 6MB	45nm	200	210	410	67%
Wolfdale 3MB	45nm	100	210	310	57%
Wolfdale 1MB	45nm	33	210	243	44%

The silicon surface is normalized to Conroe and the
calculations assume that Penryn
uses half of its space for cache. This results an estimated amount of transistors for the logic, which is the same for the 65nm and the 45nm models, which means that all 65nm circuits has the same logic, and in the same way for all 45nm circuits. The reasoning is based on that 3 different models will be made of every processor. In some cases it can be an economical benefit to make fewer models, but afterwards deactivate different amount of cache. Fact is that Wolfdale only represents 67% of Conroe’s surface, even with larger cache. If we speculate further we see a probable future 1MB model of Wolfdale would be less than half the size of Conroe.

These outputs will obviously affect
the prize of the processors, but there are more benefits with a finer process technology.

Higher clock frequencies

There are a number of aspects, which affects how fast a transistor can switch between on and off, which becomes the actual limit of how high clock frequency a processor can work at. Because the transistor gets physically smaller it also requires less current to switch. Another important aspect is interference in the lanes inside the processor. Smaller lanes results in a smaller area towards the surrounding which in turn gives a reduced capacitive load. When this interference is reduced it also lessens the stress on the transistors, which leads to faster switches. The discussion is based on that one doesn’t change the voltage between the different manufacturing techniques, which rarely is the case. The voltage has a large impact on how high clock frequency the processor can work at but it also has a impact on the power consumption. Therefore one often chooses a compromise of these; a bit faster and a bit less power hungry.

Lower power consumption

As mentioned above one has the possibility to lower the processor’s voltage together with finer manufacturing technology. The formula above shows the connections between frequency, voltage and dynamic capacitance, and how these affect the power consumption of a processor. The dynamic capacitance depends partly on the manufacturing process and partly on the processor’s architecture. This parameter cannot be changed when the processor has been manufactured. The voltage affects the power consumption in square, which makes it the absolute most important factor when it comes to reducing the heat dissipation. If we cut the voltage in half the power consumption is reduced by three quarters, while if you double the voltage the power consumptions quadruple. The frequency scales linearly to the power consumption, which means that a doubling of the frequency doubles the power consumption.

This mentioned compromise between clock frequency and voltage has worked satisfyingly in earlier generations. If higher clock frequencies were needed, one increased the voltage and if a low power processor was needed you lowered it. Nowadays the distance between performance processors and low power processors is becoming very big and Intel admits that it is looking closer on alternating the manufacturing process for different types of models, to in turn manufacture processors more specifically to its intended usage.

A great deal in the reduction in power consumption can be related to the new technologies that are used to manufacture the transistor’s gate. To take a crash course in how a CMOS transistor works, you could say that the gate controls if current should pass from Drain (D) to Source (S). In other words, the gate is a like a switch. What Intel has done is that it has gone from using silicon in different doped shapes, to a combination of metal and dielectricum. By competitive reasons it doesn’t reveal what type of metal is used, and nothing more than that the dielectricum is based on the element Hafnium. We will take a closer look on the power consumption of the processor later in the article and see if these changes have made any difference.

We move on and look at what has been added to the architecture.

Not much separates Penryn from the first generation of Core 2 architecture-wise, but Intel has taken the opportunity to make a few improvements. We present some of the most notable changes below.

Fast Radix-16 divider and Super Shuffle Engine

Fast Radix-16 divider is an improvement of the general division instruction. Earlier techniques worked on two bits at a time while the processor now has the possibility to work on 4 bits at a time, which results in twice the performance when dividing. A prominent advantage of changes to basic instructions is that these affect already existing software which uses division, as opposed to special instructions which require a recompile of the program with an updated compiler, or be manually updated. Super Shuffle Engine is also a function which works at a lower level and brings overall performance increases. The Super Shuffle unit’s task is to restructure bits in SSE instructions, which are typically used in image and video editing applications. This unit has now become wider to match the 128 bit long SSE instructions. How these improvements affect the performance can’t be answered in generic terms since they heavily depend on what types of calculations are being calculated and what data is being interpreted.

Advanced Digital Media Boost

Advanced Digital Media Boost is really nothing more than another name for SSE, which consists of a number of instructions which performs advanced functions and calculates special algorithms. Together with the release of Penryn new instructions have been added, which together form SSE4.1. Several of these are intended to treat media such as pictures, audio and video, hence the name Digital Media Boost. As mentioned above, programs are required to be manually optimized or recompiled to make use of these new instructions. In other words, older programs which don’t make use of these will not see any performance increases that can be lead to SSE4.

We present the test system before we move on to the different benchmarks.


Test system
Hardware
Motherboard	Asus P5K-E
Processors	Intel Core 2 Extreme QX9650 (2x6MB) Intel Core 2 Extreme QX6850 (2x4MB)
Memory	Corsair 8500C5D Dominator (2048MB)
Graphics card	nVidia GeForce 8800GTX
Power supply	Silverstone Zeus 850W
Software
Operating system	Windows XP (SP2)
Drivers	Intel Chipet Driver 8.3.1.1009 nVidia Forceware 158.22
Benchmarks	EVEREST Ultimate Edition 4.20.1170 SuperPi 1.5 wPrime 1.52 Cinebench 9.5 Cinebench 10 Lame 3.97 WinRAR 3.70 3DMark2001 3.3.0 3DMark03 3.6.0 3DMark05 1.2.0 3DMark06 1.0.2 PCMark05 1.1.0 FarCry 1.33 Doom 3 Quake 4

We start by comparing the power consumption of the 65nm and 45nm chips.

To get a real life view of the power consumption we’ve performed some power tests. The measurements was done with an ampere meter connected in series with the +12V rail of the motherboard supplying the processor with power. The first measurement showed 12.04V. It’s worth noting this figure as it shows how much power the motherboard drains from this connector, which includes the heat dissipation of the processor and the voltage regulators due to losses from the voltage conversions. Second, additional power can be drained from the connector, for example the northbridge, which will also affect the results. All measurements were performed with high-quality equipment. As mentioned above, the feeding voltage has a great effect on the power consumption of the processor. We set the CPU voltage to AUTO in BIOS, which resulted in 1.280V for Kentsfield and 1.177V for Yorkfield. These were measured during load after the temperatures of the processor had stabilized.

Power consumption – Idle
Processor	Voltage	Current	Power
Yorkfield	12.04V	1.305A	15.7W
Kentsfield	12.04V	3.156A	38.0W

Intel has been boasting the low current leakage of the 45nm process. After double-checking our measurements we can only conclude that it has succeeded in doing so. During idle, Yorkfield consumes almost 60% less power than Kentsfield. About 15 percentages of these come from the lower voltage.

Power consumption – Load
Processor	Voltage	Current	Power
Yorkfield	12.04V	5.16A	62.1W
Kentsfield	12.04V	8.61A	103.7W

We used wPrime and its 1024M calculation to create full load. Data was gathered when voltage, current and temperature had stabilized. Also during load, Yorkfield was significantly better than its predecessor. Once again this is mainly due to the more efficient manufacturing process, although part of it is due to the lower voltage.

Time to see how Penryn performs in some regular benchmarks.

The new model has no problems getting ahead of the competition.

We move on to SuperPi and wPrime.

We know from previous experiences that SuperPi is very dependent on the size of the cache, and these results come as no surprise. wPrime is more reluctant to show a difference related to the cache or FSB, but the fine tuning of the architecture has given results.

Next we have Cinebench.

We have used the same scale with each program in order to show how much faster a quad-core CPU is compared to single-core CPU. Regardless if the benchmarks are single or quad, the QX9650 perform better than the QX6850.

Next, we test Lame and WinRAR.

Lame is a benchmark that’s picky when it comes to cache and FSB. The graphs clearly shows that QX9650 is the better performer in these kinds of benchmarks thanks to the improvements of the architecture. WinRAR however, tends to be very sensitive when it comes to changes in the cache and memory performance. The bigger cache has an undeniably effect and the QX9650 performs almost 10% better than the QX6850.

Let’s take a closer look at Futuremark’s test suites.

3DMark 2001 reflects, to a great extent, an increased system performance and it is also in these series of tests that Yorkfield distances itself from Kentsfield. The other versions of 3DMark are more dependent on the graphics performance of the system and thus don’t show the same difference between the two CPUs.

We look more in detail at 3DMark06 and its CPU test and PCMark05.

Yorkfield performs about 5% better in the 3DMark06 CPU test while it’s a lot more when it comes to PCMark05, where it’s about the same in the CPU test and a few percents difference in the memory test

Finally, we run a few game tests.

Earlier we have seen that the amount of cache can have a big influence in games and the QX9650 is definite proof of this. In our game test we get between 4% to 7% increase of the performance, which to a large extent, can be derived to the larger cache.

Time to overclock!

QX9650 will most likely not be the heart of some office PC used for Word, Excel and so forth. The people that are going to buy, or already have, bought it, will overclock it. We’ve previously published articles where we’ve used exotic kinds of cooling, but here we have the whole spectrum of cooling for you. We start with some regular air tests.

We chose 3DMark06 and its CPU test for stress testing the processor during the benchmarks. With a not so healthy voltage of 1.700V we managed to reach quite respectable frequency of 4.5GHz with air cooling. All four cores fully loaded. We used a Thermalright Ultra 120 eXtreme together with a 120x38mm fan from Sunon and an ambient temperature of 18°C. With a more sensible voltage of 1.35V we looped Orthos and 3DMark06 for a whole day at 3.8GHz. This test was executed using a lot quieter 120x25mm fan, which was still capable of keeping the temperatures at bay. Below is a temperature log from the first two loops of 3DMark06. Orthos is the reason for the higher temperatures of core 1 and 2.

When we were up visiting Intel in Stockholm we decided to stick with 3DMark06, back home we decided to run some more tests with extreme cold. Here are SuperPi 1M, 2M, 4M, 8M and wPrime 32M and 1024M from when the processor was running at 5350MHz.

As we told you in our article about the Stockholm adventure we ran through 3DMark06 stable as high as 5450MHz and the reason we haven’t used the same high frequency here is because the motherboard didn’t allow higher multipliers than 11, and the maximum voltage after doing some voltage modifications was 1.87V. There will be more about overclocking and Yorkfield in the future. We’ve also seen several Yorkfield processors 3D-stable at frequencies way beyond 5.8GHz.

Let’s conclude our experience of Penryn on the next page.

Performance

It’s always nice to see hardware that improves the performance of regular applications and games, more than just synthetic benchmarks, which is definitely the case with Yorkfield. In all of the benchmarks, QX9650 outperforms QX6850 and so far we haven’t found any scenario where Penryn would be worse than Kentsfield. Considering the overclocking potential there is room for more if Intel wants. As we mentioned earlier there are concrete plans to launch even faster models soon, as we’ve already seen with the server processors.

Overclocking

The overclocking potential of QX9650 is more than impressive and judging by what we’ve been able to accomplish there should be plenty of people being able to reach stable round-the-clock clocks over 4GHz, using nothing but air cooling. The processor scales well with additional voltage and it’s easy to forget that you’re dealing with a more delicate manufacturing process that might be more sensitive to higher voltages in the long run. When the frequencies are closing in on 5GHz using nothing but air cooling you expect similar solutions with more exotic cooling, which doesn’t seem to be the case today. The overclocking potential and the good temperature tolerance vary quite a lot, which isn’t very surprising considering the new manufacturing process. Intel has announced that a new stepping are on the way, which should stabilize things and improve overclocking even further.

Product value

The enthusiast models have a tendency of being rather expensive. QX9650 is no exception and the price/performance ratio is no argument you will be hearing from any (intelligent) sales people. At this point the QX9650 is listed at $1,100 USD. As we mentioned in the beginning there will be more affordable quad-core models in a few months.

Conclusion

Intel Core 2 Extreme QX9650 sports a welcomed addition of improved performance in all of our benchmarks and foremost practical applications. Thanks to Intel’s new 45nm transistor technology the power consumption has been cut in half, compared to an equivalent model made with 65nm technology. The overclocking potential is impressive and enthusiasts will push it past 4.5GHz without much fuss and with a new stepping 5GHz should be doable – with air cooling!

Intel Core 2 Extreme QX9650

Pros

+ Improved performance all around

+ Reduced power consumption
+ Overclocking potential

Cons

– Expensive

We want to thank Intel for sending us the sample.

AMD sägs återuppliva B650 moderkort på grund av RAM-bristen

Intel Arc Battlemage BMG-G31 hittas i officiell mjukvara

Mer information om Ryzen 7 9850X3D läcker ut

Crucial drar sig tillbaka från konsumentmarknaden

Skyblivion försenas till nästa år

Test: Philips Evnia 49M2C8900 – tungdriven men imponerande välvd 32:9-skärm

Test: Logitech PRO X Superlight 2 – underväldigande uppgradering

Test: Samsung 990 Pro 4 TB – nya V-NAND ger nya…

Test: Lenovo Yoga Pro 9i – RTX 4070, för kreatörer!

Intel Penryn architecture

Sponsrat

Tre myter om portabla projektorer – och varför de inte längre...

Nyheter

AMD sägs återuppliva B650 moderkort på grund av RAM-bristen

Intel Arc Battlemage BMG-G31 hittas i officiell mjukvara

Mer information om Ryzen 7 9850X3D läcker ut