I purchased a DRAM from G.Skill, F4-3600C14Q-32GTZN. This equips Samsung B-dies, and which has the lowest timings (14-15-15-35) in the range of 3600 MHz, at 1.4 V. G.Skill has also released the DRAM, F4-3800C14Q-32GTZN, which have a lower actual latency. But it requires 1.5 V, thus there are little potential. So I think that F4-3600C14Q-32GTZN is the best memory for Zen2 at this time.
In this article, I will write about the HBM2 which had been used in high-end class GPU, especially Radeon Vega series from AMD or TITAN V from NVIDIA. The HBM (High Bandwidth Memory) is a stacked memory that realize the very high memory bus width.
Two ways to extend the memory bandwidth
There are two ways to extend the memory bandwidth. One is increasing the memory clock, which can be seen in the progress of GDDR. The other is expanding the bus width, and the stacked memory such as the HBM aims at this.
The memory clock and the memory bus width are related as shown below. Extending the memory bandwidth can be thought as conveying more cargo to a location. If cargo are carried by trucks, getting up speed the memory clock corresponds to accelerating the trucks, and expanding the bus width corresponds to increasing the lane of the road. It can be seen that both methods contribute to extending the memory bandwidth.
The Structure of the HBM
HBM has two main features. One is that multiple memories are stacked and connected by TSV (Through-Silicon Via), and the other is that a sub-board called Silicon Interposer is interposed between the processor and memory.
Some advantages of the HBM
When trying to achieve a high bus width, a physical distance between a memory and a processor has increased, and as a result, the operating voltage and the power consumption get higher. On the other hand, a stacking memory can save the mounting area (see below) and solves the above problems.
Also, a TSV has a short connection distance, so it makes less resistance and less possibility to suffer noise. Thus, power consumption can be reduced, waveform deterioration and signal delay can be restrained, and high-speed operation can be achieved.
The silicon Interposer is a substrate made from silicon, and it can reduce the operating voltage and the power consumption due to high electrical conductivity of silicon. In addition, silicon allows large amounts of wiring in tight spaces, so it can made that wire which has high bus width are connected directly between memory and processors (without to bundle signals). It also contributes to reduce in mounting area compared to wire on a substrate.
The serious disadvantage of the HBM
However, the HBM has a fatal disadvantage of high cost. It is inevitable as long as the TSV and the Silicon Interposer are adopted. Due to this problem, and the progress of GDDR, HBM2 is no longer used in GPUs for consumers (actually, AMD had adopted HBM2 in the Vega series, but in followng Navi series, they has adopted GDDR6).
Probably, the best and last GPU for consumers which equips HBM2 will be Radeon VII (with 16GB VRAM). HBM2 which realize high bus width is attractive, but will it become a relic?
This article introduces the types of memory layouts of the motherboard that connect memories and memory controllers in a CPU. Memory layouts affect raising a memory clock, so it can say that actual memory behavior depends not only on the CPU and memory but also on the motherboard (and these combination). Especially when using Ryzen, it is important to tune the memory ( https://ocod.home.blog/2019/12/08/why-tune-ddr4-is-important-for-ryzen/ ), and for example, when you attempt to overclock the memory while keeping low latency, should consider the memory layouts of the motherboard which you will use.
Two Types of Memory Layouts
For ATX motherboards, it is currently standard to equip 4 memory slots and support dual channels. Therefore, two memories are connected to one memory controller, and there are two patterns for connecting two memories.
One is called daisy chain (or flyby) and the other is t-topology. The outline of each is shown below.
daisy chain (fly-by)t-topology
The most important difference between these two types is the distance of the transmission lines which connect the two memory slots corresponded to one channel and the memory controller. Daisy chain has the different distance between each memory slots and the memory controller, in contrast, the t-topology has the same distance.
The Distance of Transmission Lines and the Memory Clock
How does the distance between the slot and memory controller affects the memory clock? If the distance of the transmission lines is different as in a daisy chain, the characteristic impedance is different between two slots, thus it is hard to reduce noise with impedance matching. On contrary, as in a t-topology, the distance of the transmission lines between slots of the same channel is same, so the impedance matching can be done easily, thus the frequency of occurring noise and signal degradation can be reduced. Because of that, if you use two memories on the same channel in a daisy chain (that is, you use a total of four memories), there are more possibility that noise occur more than tolerability than in t-topology. And when a memory clock is faster, there are much frequently to occurring noise. In these sense, in a t-topology, there are more possibility to achieve a faster clock when you use four memories.
However, a t-topology has a disadvantage. A t-topology has a longer physical distance of transmission lines between slots to memory controllers than a daisy chain, and a long physical distance is not suitable for increasing a clock speed. Also, unused transmission lines will be source of noise. In these sense, a t-topology which have longer transmission lines have a disadvantage in achieving faster clock.
Conclusion
In summary, in memory overclocking, a daisy chain has advantage when operating two memories and a t-topology is advantageous when operating four memories. As the number of chips of memory are increases, it becomes more difficult to operate with a fast clock. When overclock memory, it is general to use only two memories which are single rank operated on separate channels. So if you try to operate memories in faster clock, you should choose motherboards which have daisy chain layouts about memory.
Appendix: ITX Motherboards is the Best for Pursuit the Faster Clock of Memories
Most of ITX motherboards which have only two memory slots are the best for challenging the fastest memory clock. This is because that these have short transmission lines between the memories and the memory controllers as shown below, and there is no extra transmission lines that cause noise.
If you want to know the importance of DRAM on Ryzen, you should understand the basics of the structure of the zen architecture. The latest product is the Zen2 architecture, so I will introduce it with Zen2 as an example.
The greatest feature of the Zen architecture is that it is modularized by combining CCX (Core Complex) which consist of 4 cores. It makes it easier to combine overwhelming many core processors than so far, and to prepare wide lineup of products, including products for enterprises, according to the number of CCX combinations. CCX is connected by fabrics called Infinity Fabric (consisting of Scalable Data Fabric and Scalable Control Fabric). This connection method of multiple cores is very different from the Ring Bus that Intel has adopted so far.
Two Types of Dies in Zen2 Architecture
The Zen2 architecture announced at Computex 2019 has a distinctive structure in which the compute die and the I/O die are separated and each is made by a different process.
The compute die consists of a CCD (Core complex Die) in which two CCXs manufactured in a 7nm process are connected by an Infinity Fabric, and the I/O die is manufactured in a 12nm process. The reason why the 12nm process is adopted for the I/O die is that the component size of the analog part cannot be changed, so it isn’t effective to adopt a small process; it is driven by various voltages (smaller process is unsuitable for use with high voltage); and the manufacturing cost of the 7nm process is expensive.
The design of the connection between the compute die and the I/O die is shown below. It should be noted that Infinity Fabric is used to connect CCXs and CCDs.
Why TuningDDR4 is Important for Ryzen?
Let’s move on to the main subject. Why tuning DDR4 is important for Ryzen processors? The reason for this is, it is very interesting, the operating speed of Infinity Fabric (fClk) is synchronized to the memory controller clock (uClk), and uClk is according to the memory clock (mClk), so the memory clock is higher, Infinity Fabric will run faster. The feature that internal interconnect of the processor is defined by the operating frequency of external devices is not found in the Intel’s Ring Bus (thus, when using an Intel CPU, the importance of the memory clock is relatively lower).
However, the operating speed of Infinity Fabric has limited, so it cannot said always that you get higher performance with a higher memory clock indefinitely. AMD has be recommended to set mClk 3733 (thus, uClk and fClk take 1867), and fClk is halved if memory clocks is higher (if you have a fortunate CPU, it can achieve fClk1900 (= mClk3800)). Indeed, you can chose higher memory clocks even if Infinity Fabric is slow, but, an advantage of Ryzen is high performance with much cores, so it can be said that it is more reasonable to increase the connection speed between cores.
As of now, Ryzen has less possibility to achive a higher CPU overclocking than the Intel Core i series. Therefore, memory tuning is indispensable for pursuit of higher performance on Ryzen. Since an appropriate memory clock has already been apparent, it is necessary to reduce the memory latency if you attempt to get further performance improvement. Tuning memory, including reducing latency, is more complex than CPU overclocking (though the risk of breaking devices is low), so you should make trial and error while gaining a basic understanding of memory .
Zenアーキテクチャの最大の特徴は,それが4コアから成るCCX(Core Complex)を組み合わせるモジュール化されたものであることです.それにより,これまでに比べ圧倒的多コアのプロセッサを容易に可能にでき,CCXの組み合わせの数に応じて,エンタープライズ向けを含め幅広い製品ラインナップを用意できるようになります.CCXは,(Scalable Data FabricとScalable Control Fabricからなる)Infinity Fabricと呼ばれるファブリックで接続されます.この複数のコアの接続方式は,Intelが現在に至るまで採用してきたリングバスとは大きく異なる設計です.
I integrated Ryzen 9 3950X into my existing PC because I couldn’t stand it and also I wanted to confirm if there was some defect. For the time being, it can achieve fClk 1900 but it is very unstable. I will attempt to achieve higher performance step by step while improving the environment.