Buscar

Zen - Microarchitectures - AMD - WikiChip

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 30 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 30 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 30 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes

Prévia do material em texto

04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 1/30
Semiconductor & Computer Engineering
 WikiChip 
WikiChip
WikiChip
Home
Random Article
Recent Changes
Chip Feed
The Fuse Coverage
Recent News
ISSCC 2018
IEDM 2018
VLSI 2018
Hot Chips 30
SuperComputing 18
Social Media
Twitter
Popular
Companies
Intel
AMD
ARM
Qualcomm
Microarchitectures
Skylake (Client)
Skylake (Server)
Zen
Coffee Lake
Zen 2
Technology Nodes
14 nm
10 nm
7 nm
 Architectures 
Popular x86
Intel
Client
Skylake
Kaby Lake
Coffee Lake
Ice Lake
Server
Skylake
Cascade Lake
Cooper Lake
Ice Lake
Big Cores
Sunny Cove
Willow Cove
Small Cores
Goldmont
Goldmont Plus
Tremont
Gracemont
AMD
Zen
Zen+
Zen 2
Zen 3
Popular ARM
https://en.wikichip.org/wiki/WikiChip
https://en.wikichip.org/wiki/WikiChip
https://en.wikichip.org/wiki/WikiChip
https://en.wikichip.org/wiki/Special:Random?nocache=1
https://en.wikichip.org/wiki/Special:RecentChanges
https://en.wikichip.org/wiki/WikiChip:chip_feed
https://fuse.wikichip.org/
https://fuse.wikichip.org/
https://fuse.wikichip.org/news/category/conferences/isscc/isscc-2018/
https://fuse.wikichip.org/news/category/conferences/iedm/iedm-2018/
https://fuse.wikichip.org/news/category/conferences/vlsi/vlsi-2018/
https://fuse.wikichip.org/news/category/conferences/hot-chips/hot-chips-30/
https://fuse.wikichip.org/news/category/conferences/supercomputing/supercomputing-18/
https://twitter.com/WikiChip
https://en.wikichip.org/wiki/intel
https://en.wikichip.org/wiki/amd
https://en.wikichip.org/wiki/arm_holdings
https://en.wikichip.org/wiki/qualcomm
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
https://en.wikichip.org/wiki/technology_node
https://en.wikichip.org/wiki/14_nm_lithography_process
https://en.wikichip.org/wiki/10_nm_lithography_process
https://en.wikichip.org/wiki/7_nm_lithography_process
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(client)
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake
https://en.wikichip.org/wiki/intel/microarchitectures/cooper_lake
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
https://en.wikichip.org/wiki/intel/microarchitectures/sunny_cove
https://en.wikichip.org/wiki/intel/microarchitectures/willow_cove
https://en.wikichip.org/wiki/intel/microarchitectures/goldmont
https://en.wikichip.org/wiki/intel/microarchitectures/goldmont_plus
https://en.wikichip.org/wiki/intel/microarchitectures/tremont
https://en.wikichip.org/wiki/intel/microarchitectures/gracemont
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen_+
https://en.wikichip.org/wiki/amd/microarchitectures/zen_2
https://en.wikichip.org/wiki/amd/microarchitectures/zen_3
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 2/30
ARM
Server
Neoverse N1
Zeus
Big
Cortex-A75
Cortex-A76
Cortex-A77
Little
Cortex-A53
Cortex-A55
Cavium
Vulcan
Samsung
Exynos M1
Exynos M2
Exynos M3
Exynos M4
 Chips 
Popular Families
Intel
Core i3
Core i5
Core i7
Core i9
Xeon D
Xeon E
Xeon W
Xeon Bronze
Xeon Silver
Xeon Gold
Xeon Platinum
AMD
Ryzen 3
Ryzen 5
Ryzen 7
Ryzen Threadripper
EPYC
EPYC Embedded
Ampere
eMAG
Apple
Ax
Cavium
ThunderX
ThunderX2
HiSilicon
Kirin
MediaTek
Helio
NXP
i.MX
QorIQ Layerscape
Qualcomm
Snapdragon 400
Snapdragon 600
Snapdragon 700
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/neoverse%20n1
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/zeus
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a75
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a76
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a77
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a53
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a55
https://en.wikichip.org/wiki/cavium/microarchitectures/vulcan
https://en.wikichip.org/wiki/samsung/microarchitectures/m1
https://en.wikichip.org/wiki/samsung/microarchitectures/m2
https://en.wikichip.org/wiki/samsung/microarchitectures/m3
https://en.wikichip.org/wiki/samsung/microarchitectures/m4
https://en.wikichip.org/wiki/intel/core_i3
https://en.wikichip.org/wiki/intel/core_i5
https://en.wikichip.org/wiki/intel/core_i7
https://en.wikichip.org/wiki/intel/core_i9
https://en.wikichip.org/wiki/intel/xeon_d
https://en.wikichip.org/wiki/intel/xeon_e
https://en.wikichip.org/wiki/intel/xeon_w
https://en.wikichip.org/wiki/intel/xeon_bronze
https://en.wikichip.org/wiki/intel/xeon_silver
https://en.wikichip.org/wiki/intel/xeon_gold
https://en.wikichip.org/wiki/intel/xeon_platinum
https://en.wikichip.org/wiki/amd/ryzen_3
https://en.wikichip.org/wiki/amd/ryzen_5
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/amd/ryzen_threadripper
https://en.wikichip.org/wiki/amd/epyc
https://en.wikichip.org/wiki/amd/epyc_embedded
https://en.wikichip.org/wiki/ampere_computing/emag
https://en.wikichip.org/wiki/apple/ax
https://en.wikichip.org/wiki/cavium/thunderx
https://en.wikichip.org/wiki/cavium/thunderx2
https://en.wikichip.org/wiki/hisilicon/kirin
https://en.wikichip.org/wiki/mediatek/helio
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 3/30
Edit Values
Zen+
Snapdragon 800
Renesas
R-Car
Samsung
Exynos
chip, part #, µarch, family, 
From WikiChip
Zen µarch
General Info
Arch Type CPU
Designer AMD
Manufacturer GlobalFoundries
Introduction March 2, 2017
Process 14 nm
Core Configs 4, 6, 8, 12, 16, 24, 32
Pipeline
Type Superscalar
OoOE Yes
Speculative Yes
Reg
Renaming Yes
Stages 19
Decode 4-way
Instructions
ISA x86-64
Extensions MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, RDRND, F16C, BMI, BMI2, RDSEED,ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SHA, CLZERO
Cache
L1I Cache 64 KiB/core4-way set associative
L1D Cache 32 KiB/core8-way set associative
L2 Cache 512 KiB/core8-way set associative
L3 Cache 2 MiB/core16-way set associative
Cores
Core Names
Naples,
Whitehaven,
Summit Ridge,
Raven Ridge,
Snowy Owl,
Great Horned Owl,
Banded Kestrel
Succession
Excavator
Puma
Zen (family 17h) is the microarchitecture developed by AMD as a successor to both Excavator and Puma. Zen is an entirely new design, built from the ground up for
optimal balance of performance and power capable of covering the entire computing spectrum from fanless notebooks to high-performance desktop computers. Zen
was officially launched on March 2, 2017. Zen is set to be gradually replaced by Zen+.
For performance desktop and mobile computing, Zen is branded as Ryzen 3, Ryzen 5, Ryzen 7 and Ryzen Threadripper processors. For servers, Zen is branded as
EPYC.
Contents
1 Etymology
2 Codenames
3 Brands
3.1 Identification
4 Release Dates
5 Process Technology
6 Compatibility
7 Compiler support
7.1 CPUID
8 Architecture
8.1 Key changes from Excavator
8.1.1 New instructions
8.2 Block Diagram
8.2.1 Client Configuration
https://en.wikichip.org/wiki/Special:FormEdit/microarchitecture/amd/microarchitectures/zen
https://en.wikichip.org/wiki/amd/microarchitectures/zen%2Bhttps://en.wikichip.org/wiki/amd/microarchitectures/zen
https://en.wikichip.org/wiki/renesas/r-car
https://en.wikichip.org/wiki/samsung/exynos
https://en.wikichip.org/wiki/14_nm_process
https://en.wikichip.org/wiki/4_cores
https://en.wikichip.org/wiki/6_cores
https://en.wikichip.org/wiki/8_cores
https://en.wikichip.org/wiki/12_cores
https://en.wikichip.org/wiki/16_cores
https://en.wikichip.org/wiki/24_cores
https://en.wikichip.org/wiki/32_cores
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/puma&action=edit&redlink=1
https://en.wikichip.org/wiki/microarchitecture
https://en.wikichip.org/wiki/AMD
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/puma&action=edit&redlink=1
https://en.wikichip.org/wiki/2017
https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B
https://en.wikichip.org/wiki/amd/ryzen_3
https://en.wikichip.org/wiki/amd/ryzen_5
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/amd/ryzen_threadripper
https://en.wikichip.org/wiki/amd/epyc
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 4/30
8.2.1.1 Entire SoC Overview
8.2.1.2 Individual Core
8.2.2 Single/Multi-chip Packages
8.2.2.1 Single-die
8.2.2.2 2-die MCP
8.2.2.3 4-die MCP
8.2.2.3.1 4-die CCX configs
8.3 Memory Hierarchy
9 Core
9.1 Pipeline
9.1.1 Broad Overview
9.1.2 Front End
9.1.2.1 Fetching
9.1.2.2 µOP cache & x86 tax
9.1.2.3 Decode
9.1.2.3.1 MSROM
9.1.2.4 Optimizations
9.1.2.4.1 Stack Engine
9.1.2.4.2 µOP-Fusion
9.1.3 Execution Engine
9.1.3.1 Move elimination
9.1.3.2 Integer
9.1.3.3 Floating Point
9.1.4 Memory Subsystem
10 Infinity Fabric
11 Clock domains
12 Security
12.1 Secure Memory Encryption (SME)
12.2 Secure Encrypted Virtualization (SEV)
13 Power
13.1 System Management Unit
14 Features
14.1 Simultaneous MultiThreading (SMT)
14.2 SenseMI Technology
15 Scalability
15.1 CPU Complex (CCX)
15.2 Multiprocessors
15.2.1 Die-die memory latencies
15.3 Modules (Zeppelin)
16 Memory Modes
17 Accelerated Processing Units
17.1 Power
17.1.1 Enhanced power gating
18 Die
18.1 Core
18.2 CCX
18.3 Memory Controller
18.4 Zeppelin
18.5 APU
19 Sockets/Platform
20 All Zen Chips
21 References
22 Documents
22.1 Manuals
23 See also
Etymology
Zen was picked by Michael Clark, AMD's senior fellow and lead architect. Zen was picked to represent the balance needed between the various competing aspects of a
microprocessor - transistor allocation/die size, clock/frequency restriction, power limitations, and new instructions to implement.
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 5/30
Ryzen brand logo
Codenames
Core C/T Target
Naples Up to 32/64 High-end server multiprocessors
Whitehaven Up to 16/32 Enthusiasts market processors
Summit Ridge Up to 8/16 Mainstream to high-end desktops
Raven Ridge Up to 4/8 Mobile processors with Vega GPU
Snowy Owl Up to 16/32 Embedded edge processors
Great Horned Owl Up to 4/8 Embedded processors with Vega GPU
Banded Kestrel Up to 2/4 Low-power/Cost-sensitive embedded processors with Vega GPU
Brands
AMD Zen-based processor brands
Logo Family General Description
Differentiating Features
Cores Unlocked AVX2 SMT XFR IGP ECC MP
Mainstream
Ryzen 3 Entry level Performance Quad ✔ ✔ ✘ ✔ ✔/✘ ✔ ✘
Ryzen 5 Mid-range Performance
Quad ✔ ✔ ✔ ✔ ✔/✘ ✔ ✘
Hexa ✔ ✔ ✔ ✔ ✘ ✔ ✘
Ryzen 7 High-end Performance Octa ✔ ✔ ✔ ✔ ✘ ✔ ✘
Enthusiasts / Workstations
Ryzen
Threadripper Enthusiasts 8-16 ✔ ✔ ✔ ✔ ✘ ✔ ✘
Servers
EPYC High-performance ServerProcessor 8-32 ✘ ✔ ✔ ✘ ✔ ✔
Embedded / Edge
EPYC
Embedded
Embedded / Edge Server
Processor 8-16 ✘ ✔ ✔/✘ ✘ ✔ ✘
Ryzen
Embedded Embedded APUs 4 ✔ ✔/✘ ✔ ✔ ✘
Note: While a model has an unlocked multiplier, not all chipsets support overclocking. (see §Sockets)
Note: 'X' models will enjoy "Full XFR" providing an additional +100 MHz (200 for 1500X and Threadripper line) when sufficient thermo/electric requirements
are met. Non-X models are limited to just +50 MHz.
Identification
Identification
Ryzen7 1 7 00 X 
Ryzen5 3 5 50 H 
 
Power Segment
(none) Standard Desktop
U Standard Mobile
X High Performance, with XFR
WX High Core Count Workstation
G Desktop + IGP
E Low-power Desktop
GE Low-power Desktop + IGP
M Low-power Mobile
H High-performance Mobile
 Model NumberSpeed bump and/or differentiator for high core count chips (8 cores+).
 Performance Level
9 Extreme (Ryzen Threadripper)
8 Highest (Ryzen 7)
6-7 High (Ryzen 5 & 7)
https://en.wikichip.org/wiki/File:amd_ryzen_black_bg_logo.png
https://en.wikichip.org/wiki/amd/cores/naples
https://en.wikichip.org/wiki/multiprocessors
https://en.wikichip.org/wiki/amd/cores/whitehaven
https://en.wikichip.org/wiki/amd/cores/summit_ridge
https://en.wikichip.org/wiki/amd/cores/raven_ridge
https://en.wikichip.org/wiki/amd/microarchitectures/vega
https://en.wikichip.org/w/index.php?title=amd/cores/snowy_owl&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/cores/great_horned_owl&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/microarchitectures/vega
https://en.wikichip.org/wiki/amd/cores/banded_kestrel
https://en.wikichip.org/wiki/amd/microarchitectures/vega
https://en.wikichip.org/w/index.php?title=x86/avx2&action=edit&redlink=1
https://en.wikichip.org/wiki/SMT
https://en.wikichip.org/wiki/amd/xfr
https://en.wikichip.org/wiki/IGP
https://en.wikichip.org/w/index.php?title=ECC&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Multiprocessing&action=edit&redlink=1
https://en.wikichip.org/wiki/Ryzen_3
https://en.wikichip.org/wiki/amd/ryzen_3
https://en.wikichip.org/wiki/quad-core
https://en.wikichip.org/wiki/Ryzen_5
https://en.wikichip.org/wiki/amd/ryzen_5
https://en.wikichip.org/wiki/quad-core
https://en.wikichip.org/wiki/hexa-core
https://en.wikichip.org/wiki/Ryzen_7
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/octa-core
https://en.wikichip.org/wiki/Ryzen_Threadripper
https://en.wikichip.org/wiki/amd/ryzen_threadripper
https://en.wikichip.org/wiki/8_cores
https://en.wikichip.org/wiki/16_cores
https://en.wikichip.org/wiki/amd/epyc
https://en.wikichip.org/wiki/amd/epyc
https://en.wikichip.org/wiki/8_cores
https://en.wikichip.org/wiki/32_cores
https://en.wikichip.org/wiki/amd/epyc_embedded
https://en.wikichip.org/wiki/amd/epyc_embedded
https://en.wikichip.org/wiki/8_cores
https://en.wikichip.org/wiki/16_cores
https://en.wikichip.org/wiki/amd/ryzen_embedded
https://en.wikichip.org/wiki/amd/ryzen_embedded
https://en.wikichip.org/wiki/4_cores
https://en.wikichip.org/wiki/amd/1500x
https://en.wikichip.org/wiki/amd/threadripper
https://en.wikichip.org/wiki/IGP
https://en.wikichip.org/wiki/IGP
https://en.wikichip.org/wiki/File:amd-zen-black-logo.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 6/30
First 16-core HEDT market CPU
4-5 Mid (Ryzen 5)
1-3 Low (Ryzen 3)
 
Generation
1 First generation Zen (2017)
2 First generation Zen (2017) for Mobile APUs; First generation Zen with enhanced node (2018)
3 First generation Zen with enhanced node(2018) for Mobile APUs; Second generation Zen (Zen2)(2019)
 
 
Market segment
3 Low-end performance
5 Mid-range performance
7 Enthusiast / High-end performance
Threadripper High-end performance/ Workstation
 
Brand Name
Ryzen
Release Dates
The first set of processors, as part of the Ryzen 7 family were introduced at an AMD event on February 22, 2017
before the Game Developer Conference (GDC). However initial models don't get shipped until March 2. Ryzen 5
hexa-core and quad-core variants were released on April 11, 2017. Server processors are set to be released in by the
end of Q2, 2017. In October 2017, AMD launched mobile Zen-based processors featuring Vega GPUs.
Process Technology
See also: 14 nmprocess
Zen is manufactured on Global Foundries' 14 nm process Low Power Plus (14LPP). AMD's previous microarchitectures were based on 32 and 28 nanometer processes.
The jump to 14 nm was part of AMD's attempt to remain competitive against Intel (Both Skylake and Kaby Lake are also manufactured on 14 nm). The move to 14 nm
will bring along related benefits of a smaller node such as reduced heat, reduced power consumption, and higher density for identical designs.
Compatibility
Linux added initial support for Zen starting with Linux Kernel 4.10. Microsoft will only support Windows 10 for Zen.
Vendor OS Version Notes
Microsoft Windows
Windows 7 No Support
Windows 8 No Support
Windows 10 Support
Linux Linux
Kernel 4.10 Initial Support
Kernel 4.15 Full Support
Compiler support
With the release of Ryzen, AMD introduced their own compiler: AMD Optimizing C/C++ Compiler (AOCC). AOCC is an LLVM port especially modified to generate
optimized x86 code for the Zen microarchitecture.
Compiler Arch-Specific Arch-Favorable
AOCC -march=znver1 -mtune=znver1
GCC -march=znver1 -mtune=znver1
LLVM -march=znver1 -mtune=znver1
Visual Studio /arch:AVX2 ?
CPUID
https://en.wikichip.org/wiki/File:ryzen_threadripper.png
https://en.wikichip.org/wiki/amd/ryzen
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/amd/ryzen_5
https://en.wikichip.org/wiki/hexa-core
https://en.wikichip.org/wiki/quad-core
https://en.wikichip.org/wiki/amd/microarchitectures/vega
https://en.wikichip.org/wiki/File:amd_zen_ryzen_rollout.png
https://en.wikichip.org/wiki/14_nm_process
https://en.wikichip.org/w/index.php?title=Global_Foundries&action=edit&redlink=1
https://en.wikichip.org/wiki/14_nm_process
https://en.wikichip.org/wiki/32_nm
https://en.wikichip.org/wiki/28_nm
https://en.wikichip.org/wiki/intel/skylake
https://en.wikichip.org/wiki/intel/kaby_lake
https://en.wikichip.org/w/index.php?title=Linux&action=edit&redlink=1
https://en.wikichip.org/wiki/Microsoft
https://en.wikichip.org/wiki/Microsoft
https://en.wikichip.org/wiki/amd/ryzen
https://en.wikichip.org/w/index.php?title=AMD_Optimizing_C/C%2B%2B_Compiler&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=LLVM&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=AOCC&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=GCC&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=LLVM&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Visual_Studio&action=edit&redlink=1
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 7/30
See also: AMD CPUIDs
Core ExtendedFamily Family
Extended
Model Model
Naples, Whitehaven, Summit Ridge
0x8 0xF 0x0 0x1
Family 23 Model 1
Raven Ridge
0x8 0xF 0x1 0x1
Family 23 Model 17
Architecture
AMD Zen is an entirely new design from the ground up which introduces considerable amount of improvements and design changes over Excavator. Mainstream Zen-
based microprocessors utilize AMD's Socket AM4 unified platform along with the Promontory chipset.
Key changes from Excavator
Zen was designed to succeed both Excavator (High-performance) and Puma (Low-power) covering the entire range in one architecture
Cover the entire spectrum from fanless notebooks to high-performance desktops
More aggressive clock gating with multi-level regions
Power focus from design, employs low-power design methodologies
>15% switching capacitance (CAC) improvement
Utilizes 14 nm process (from 28 nm)
52% improvement in IPC per core for a single-thread (AMD Claim)
From Piledriver to Zen
Based on the industry-standardized SPECint_base2006 score compiled with GCC 4.6 -O2 at a fixed 3.4GHz
Up to 3.7× performance/watt improvment
Return to conventional high-performance x86 design
Traditional design for cores without shared blocks (e.g. shared SIMD units)
Large beefier core design
Core engine
Simultaneous Multithreading (SMT) support, 2 threads/core (see § Simultaneous MultiThreading for details)
Branch Predictor
Improved branch mispredictions
Better branch predicitons with 2 branches per BTB entry
Lower miss latency penalty
BP is now decoupled from fetch stage
Large μop cache (2K instructions)
Wider μop dispatch (6, up from 4)
Larger instruction scheduler
Integer (84, up from 48)
Floating Point (96, up from 60)
Larger retire throughput (8, up from 4)
Larger Retire Queue (192, up from 128)
duplicated for each thread
Larger Load Queue (72, up from 44)
Larger Store Queue (44, up from 32)
duplicated for each thread
Quad-issue FPU (up from 3-issue)
Faster Load to FPU (down to 7, from 9 cycles)
Cache system
L1
64 KiB (double from previous capacity of 32 KiB)
Write-back L1 cache eviction policy (From write-through)
2× the bandwidth
L2
2× the bandwidth
Faster L2 cache
Faster L3 cache
Better L1$ and L2$ data prefetcher
5× L3 bandwidth
Move elimination block added
Page Table Entry (PTE) Coalescing
New instructions
Zen introduced a number of new x86 instructions:
ADX - Multi-Precision Add-Carry Instruction extension
RDSEED - Hardware-based RNG
SMAP - Supervisor Mode Access Prevention
SHA - SHA extensions
CLFLUSHOPT - Flush Cache Line
XSAVE - Privileged Save/Restore
CLZERO - Zero-out Cache Line (AMD exclusive)
While not new, Zen also supports AVX, AVX2, FMA3, BMI1, BMI2, AES, RdRand, SMEP. Note that with Zen, AMD dropped support for XOP, TBM, and LWP.
Note: WikiChip's testing shows FMA4 still works despite not being officially supported and not even reported by CPUID. This has also been confirmed by Agner here
(http://agner.org/optimize/blog/read.php?i=838). Those tests were not exhaustive. Never use them in production.
Block Diagram
Client Configuration
https://en.wikichip.org/wiki/amd/cpuid
https://en.wikichip.org/wiki/amd/cores/naples
https://en.wikichip.org/wiki/amd/cores/whitehaven
https://en.wikichip.org/wiki/amd/cores/summit_ridge
https://en.wikichip.org/wiki/amd/cores/raven_ridge
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/wiki/amd/socket_am4
https://en.wikichip.org/w/index.php?title=amd/promontory&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/puma&action=edit&redlink=1
https://en.wikichip.org/wiki/14_nm_process
https://en.wikichip.org/wiki/28_nm
https://en.wikichip.org/wiki/amd/microarchitectures/piledriver
https://en.wikichip.org/wiki/x86
https://en.wikichip.org/wiki/x86/adx
https://en.wikichip.org/w/index.php?title=x86/rdseed&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/smap&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/sha&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/clflushopt&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/xsave&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/clzero&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/avx&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/avx2&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/fma3&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/bmi1&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/bmi2&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/aes&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/rdrand&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/smep&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/xop&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/tbm&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/lwp&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/fma4&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=CPUID&action=edit&redlink=1
http://agner.org/optimize/blog/read.php?i=838
04/08/2019 Zen - Microarchitectures- AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 8/30
Entire SoC Overview
Individual Core
https://en.wikichip.org/wiki/File:zen_soc_block.svg
https://en.wikichip.org/wiki/File:zen_block_diagram.svg
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 9/30
Single/Multi-chip Packages
Single-die
Single-die as used in Summit Ridge:
2-die MCP
2-die MCP used for Threadripper:
4-die MCP
4-die MCP used for EPYC:
https://en.wikichip.org/wiki/amd/cores/summit_ridge
https://en.wikichip.org/wiki/File:AMD_Summit_Ridge_SoC.svg
https://en.wikichip.org/wiki/amd/threadripper
https://en.wikichip.org/wiki/File:AMD_Threadripper_SoC.svg
https://en.wikichip.org/wiki/amd/epyc
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 10/30
4-die CCX configs
32-core configuration: 24-core configuration:
16-core configuration: 8-core configuration:
Memory Hierarchy
Cache
L0 µOP cache:
2,048 µOPs, 8-way set associative
32-sets, 8-µOP line size
Parity protected
L1I Cache:
64 KiB 4-way set associative
256-sets, 64 B line size
Shared by the two threads, per core
Parity protected
L1D Cache:
32 KiB 8-way set associative
64-sets, 64 B line size
Write-back policy
https://en.wikichip.org/wiki/File:AMD_Naples_SoC.svg
https://en.wikichip.org/wiki/File:zen_soc_block_(32_cores).svg
https://en.wikichip.org/wiki/File:zen_soc_block_(24_cores).svg
https://en.wikichip.org/wiki/File:zen_soc_block_(16_cores).svg
https://en.wikichip.org/wiki/File:zen_soc_block_(8_cores).svg
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 11/30
4-5 cycles latency for Int
7-8 cycles latency for FP
SEC-DED ECC
L2 Cache:
512 KiB 8-way set associative
1,024-sets, 64 B line size
Write-back policy
Inclusive of L1
Latency:
17 cycles latency (ONLY Summit Ridge)
12 cycles latency (All others)
DEC-TED ECC
L3 Cache:
Victim cache
Summit Ridge, Naples: 8 MiB/CCX, shared across all cores.
Raven Ridge: 4 MiB/CCX, shared across all cores.
16-way set associative
8,192-sets, 64 B line size
40 cycles latency
DEC-TED ECC
System DRAM:
2 channels per die
Summit Ridge: up to PC4-21300U (DDR4-2666 UDIMM)
Raven Ridge: up to PC4-23466U (DDR4-2933 UDIMM)
Naples: up to PC4-21300L (DDR4-2666 RDIMM/LRDIMM)
ECC support: x4 DRAM device failure correction (Chipkill), x8 SEC-DED ECC, Patrol and Demand scrubbing, Data poisoning
Zen TLB consists of dedicated level one TLB for instruction cache and another one for data cache.
TLBs
ITLB
8 entry L0 TLB, all page sizes
64 entry L1 TLB, all page sizes
512 entry L2 TLB, no 1G pages
Parity protected
DTLB
64 entry L1 TLB, all page sizes
1,532-entry L2 TLB, no 1G pages
Parity protected
Core
Pipeline
Zen presents a major design departure from the previous couple of
microarchitectures. In the pursuit of remaining competitive against Intel, AMD
went with a similar approach to Intel's: large beefier core with SoC design that
can scale from extremely low TDP (fanless devices) to supercomputers utilizing
dozens of cores. As such, Zen is aimed at replacing both Excavator (AMD's
previous performance microarchitecture) and Puma (AMD's previous ultra-low
power arch). In addition to covering the entire computing spectrum through
power efficiency and core scalability, another major design goal was 40% uplift
in single-thread performance (i.e. 40% IPC increase) from Excavator. The large
increase in performance is the result of major redesigns in all four areas of the
core (the front end, the execution engine, and the memory subsystem) as well as
Zen's new SoC CCX (CPU Complex) modular design. The core itself is wider and
all around bigger (roughly every component had its capacity substantially
increased). The improvement in power efficiency is the result of the 14 nm
process used as well as many low-power design methodologies that were utilized
early on in the design process (Excavator has been manufactured on GF's 28 nm
process). AMD introduced various components (such as their new prediction flow
and forwarding mechanisms) that eliminate the need for operations to go through the high power ALUs and decoders, increasing the overall power efficiency and
throughput.
Broad Overview
While Zen is an entirely new design, AMD continued to maintain their traditional design philosophy which shows throughout their design choice such as a split
scheduler and split FP and int&memory execution units. At a very broad view, Zen shares many similarities with its predecessor but introduces new elements and
major changes. Each core is composed of a front end (in-order area) that fetches instructions, decodes them, generates µOPs and fused µOPs, and sends them to the
Execution Engine (out-of-order section). Instructions are either fetched from the L1I$ or come from the µOPs cache (on subsequent fetches) eliminating the decoding
stage altogether. Zen decodes 4 instructions/cycle into the µOP Queue. The µOP Queue dispatches separate µOPs to the Integer side and the FP side (dispatching to
both at the same time when possible).
The biggest departure from previous generation is Zen's return to traditional core partitioning - every core is an independent core with its own floating-point/SIMD
units and a L2 cache. Previously, those units were shared between two cores; they are now once again completely private.
Unlike many of Intel's recent microarchitectures (such as Skylake and Kaby Lake) which make use of a unified scheduler, AMD continue to use a split pipeline design.
µOP are decoupled at the µOP Queue and are sent through the two distinct pipelines to either the Integer side or the FP side. The two sections are completely separate,
each featuring separate schedulers, queues, and execution units. The Integer side splits up the µOPs via a set of individual schedulers that feed the various ALU units.
On the floating point side, there is a different scheduler to handle the 128-bit FP operations. Zen support all modern x86 extensions including AVX/AVX2,
BMI1/BMI2, and AES. Zen also supports SHA, secure hash implementation instructions that are currently only found in Intel's ultra-low power microarchitectures
(e.g. Goldmont) but not in their mainstream processors.
https://en.wikichip.org/wiki/amd/cores/summit_ridge
https://en.wikichip.org/wiki/microarchitecture
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/TDP
https://en.wikichip.org/w/index.php?title=fanless&action=edit&redlink=1
https://en.wikichip.org/wiki/supercomputers
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/puma&action=edit&redlink=1
https://en.wikichip.org/wiki/core
https://en.wikichip.org/w/index.php?title=scalability&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=SoC&action=edit&redlink=1
https://en.wikichip.org/wiki/14_nm_process
https://en.wikichip.org/w/index.php?title=globalfoundries&action=edit&redlink=1
https://en.wikichip.org/wiki/28_nm_process
https://en.wikichip.org/w/index.php?title=in-order&action=edit&redlink=1
https://en.wikichip.org/wiki/%C2%B5OPs
https://en.wikichip.org/w/index.php?title=fused_%C2%B5OPs&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=out-of-order&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=L1I$&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=floating-point&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=SIMD&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=L2&action=edit&redlink=1
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
https://en.wikichip.org/wiki/x86/extensions
https://en.wikichip.org/w/index.php?title=x86/avx&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/avx2&action=edit&redlink=1https://en.wikichip.org/w/index.php?title=x86/bmi1&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/bmi2&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/aes&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/sha&action=edit&redlink=1
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/microarchitectures/goldmont
https://en.wikichip.org/wiki/File:amd_zen_hc28_page_0004.jpg
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 12/30
From the memory subsystem point of view, data is fed into the execution units from the L1D$ via the load
and store queue (both of which were almost doubled in capacity) via the two Address Generation Units
(AGUs) at the rate of 2 loads and 1 store per cycle. Each core also has a 512 KiB level 2 cache. L2 feeds both
the the level 1 data and level 1 instruction caches at 32B per cycle (32B can be sent in either direction
(bidirectional bus) each cycle). L2 is connected to the L3 cache which is shared across all cores. As with the
L1 to L2 transfers, the L2 also transfers data to the L3 and vice versa at 32B per cycle (32B in either
direction each cycle).
Front End
The Front End of the Zen core deals with the in-order operations such as instruction fetch and instruction decode.
The instruction fetch is composed of two paths: a traditional decode path where instructions come from the
instruction cache and a µOPs cache that are determined by the branch prediction (BP) unit. The instruction stream
and the branch prediction unit track instructions in 64B windows. Zen is AMD's first design to feature a µOPs cache,
a unit that not only improves performance, but also saves power (the µOPs cache was first introduced by Intel in
their Sandy Bridge microarchitecture).
The branch prediction unit is decoupled and can start working as soon as it receives a desired operation such as a
redirect, ahead of traditional instruction fetches. AMD still uses a hashed perceptron system similar to the one used
in Jaguar and Bobcat, albeit likely much more finely tuned. AMD stated it's also larger than previous architectures
but did not disclose actual sizes. Once the BP detects an indirect target operation, the branch is moved to the Indirect
Target Array (ITA) which is 512 entry deep. The BP includes a 32-entry return stack.
In Zen, AMD moved the instruction TLB to BP (to much earlier in the pipeline than in previous architectures). This
was done to allow for more-aggressive prefetching by allowing the physical address to be retrieved at an earlier
stage. The BP is capable of storing 2 branches per BTB (Branch Target Buffer) entry, reducing the number of BTB reads necessary. ITLB is composed of:
8-entry L0 TLB, all page sizes
64-entry L1 TLB, all page sizes
512-entry L2 TLB, no 1G pages
Fetching
Instructions are fetched from the L2 cache at the rate of 32B/cycle. Zen features an asymmetric level 1 cache with a 64 KiB instruction cache, double the size of the L1
data cache. Depending on the branch prediction decision instructions may be fetched from the instruction cache or from the µOPs cache in which eliminates the need
for performing the costly instruction decoding.
On the traditional side of decode, instructions are fetched from the L1$ at 32B aligned bytes per cycle and go to the
instruction byte buffer and through the pick stage to the decode. Actual tests show the effective throughput is
generally much lower (around 16-20 bytes). This is slightly higher than the fetch window in Intel's Skylake which
has a 16-byte fetch window. The size of the instruction byte buffer was not given by AMD but it's expected to be
larger than the 16-entry structure found in their previous architecture.
µOP cache & x86 tax
Decoding is the biggest weakness of x86, with decoders being one of the most expensive and complicated aspect of
the entire microarchitecture. Instructions can vary from a single byte up to fifteen. Determining instruction
boundaries is a complex task in itself. The best way to avoid the x86 decoding tax is to not decode instructions at all.
Ideally, most instructions get a hit from the BP and acquire a µOP tag, sending them directly to be retrieved from the
µOP cache which are then sent to the µOP Queue. This bypasses most of the expensive fetching and decoding that
would otherwise be needed to be done. This caching mechanism is also a considerable power saving feature.
The µOP cache used in Zen is not a trace cache and much closely resembles the one used by Intel in their
microarchitectures since Sandy Bridge. The µOP cache is an independent unit not part of the L1I$ and is not a
necessarily a subset of the L1I cache either; I.e., there are instances where there could be a hit in the µOP cache but a miss in the L1$. This happens when an
instruction that got stored in the µOP cache gets evicted from L1. During the fetch stage probing must be done from both paths. Zen has a specific unit called 'Micro-
Tags' which does the probing and determines whether the instruction should be accessed from the µOP cache or from the L1I$. The µOP cache itself has a dedicated
$tags for accessing those µOPs.
Decode
Having to execute x86, there are instructions that actually include multiple operations. Some of those operations cannot be realized efficiently in an OoOE design and
therefore must be converted into simpler operations. In the front-end, complex x86 instructions are broken down into simpler fixed-length operations called macro-
operations or MOPs (sometimes also called complex OPs or COPs). Those are often mistaken for being "RISCish" in nature but they retain their CISC characteristics.
MOPS can perform both an arithmetic operation and memory operation (e.g. you can read, modify, and write in a single MOP). MOPs can be further cracked into
smaller simpler single fixed length operation called micro-operations (µOPs). µOPs are a fixed length operation that performs just a single operation (i.e., only a single
load, store, or an arithmetic). Traditionally AMD used to distinguish between the two ops, however with Zen AMD simply refers to everything as µOPs although
internally they are still two separate concepts.
https://en.wikichip.org/w/index.php?title=L1D$&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Address_Generation_Units&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=in-order&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=instruction_fetch&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=instruction_decode&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=instruction_cache&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=%C2%B5OPs_cache&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=branch_prediction&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=%C2%B5OPs_cache&action=edit&redlink=1
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge
https://en.wikichip.org/w/index.php?title=branch_prediction&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=perceptron_branch_predictor&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/jaguar&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/bobcat&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=L2_cache&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=level_1_cache&action=edit&redlink=1
https://en.wikichip.org/wiki/KiB
https://en.wikichip.org/w/index.php?title=instruction_cache&action=edit&redlink=1
https://en.wikichip.org/wiki/%C2%B5OPs
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/skylake
https://en.wikichip.org/wiki/amd/microarchitectures
https://en.wikichip.org/wiki/x86
https://en.wikichip.org/w/index.php?title=x86/instructions_format&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=%C2%B5OP_cache&action=edit&redlink=1https://en.wikichip.org/w/index.php?title=trace_cache&action=edit&redlink=1
https://en.wikichip.org/wiki/intel/microarchitectures/sandy_bridge
https://en.wikichip.org/w/index.php?title=L1I$&action=edit&redlink=1
https://en.wikichip.org/wiki/x86
https://en.wikichip.org/wiki/macro-operations
https://en.wikichip.org/w/index.php?title=RISC&action=edit&redlink=1
https://en.wikichip.org/wiki/micro-operations
https://en.wikichip.org/wiki/File:amd_zen_hc28_overview.png
https://en.wikichip.org/wiki/File:amd_zen_hc28_fetch.png
https://en.wikichip.org/wiki/File:amd_zen_hc28_decode.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 13/30
Decoding is done by the 4 Zen decoders. The decode stage allows for four x86 instructions to
be decoded per cycle which are in turn sent to the µOP Queue. Previously, in the
Bulldozer/Jaguar-based designs AMD had two paths: a FastPath Single which emitted a
single MOP and a FastPath Double which emitted two MOPs which are in turn sent down the
pipe to the schedulers. Michael Clark (Zen's lead architect) noted that Zen has significantly
denser MOPs meaning almost all instructions will be a FastPath Single (i.e., one to one
transformations). What would normally get broken down into two MOPs in Bulldozer is now
translated into a single dense MOP. It's for those reasons that while up to 8MOPs/cycle can be
emitted, usually only 4MOPs/cycle are emitted from the decoders.
Dispatch is capable of sending up to 6 µOP to Integer EX and an up to 4 µOP to the Floating
Point (FP) EX. Zen can dispatch to both at the same time (i.e. for a maximum of 10 µOP per
cycle), however, since the retire control unit (RCU) can only handle up to 6 MOPs/cycle, the
effective number of dispatched µOPs is likely lower.
MSROM
A third path that may occasionally be reached is the Microcode Sequencer (MS) ROM.
Instructions that end up emitting more than two macro-ops will be redirected to microcode
ROM. When this happens the OP Queue is stalled (possibly along with the decoders) and the
MSROM gets to emit its MOPs.
Optimizations
A number of optimization opportunities are exploited at this stage.
Stack Engine
At the decode stage Zen incorporates the the Stack Engine Memfile (SEM). Note that while AMD refers to SEM as a new unit, they have had a Stack Engine in their
designs since K10. The Memfile sits between the queue and dispatch monitoring the MOP traffic. The Memfile is capable of performing store-to-load forwarding right
at dispatch for loads that trail behind known stores with physical addresses. Other things such as eliminating stack PUSH/POP operations are also done at this stage so
they are effectively a zero-latency instructions; proceeding instructions that rely on the stack pointer are not delayed. This is a fairly effective low-power solution that
off-loads some of the work that would otherwise be done by AGU.
µOP-Fusion
At this stage of the pipeline, Zen performs additional optimizations such as micro-op fusion or branch fusion - an operation where a comparison and branch op gets
combined into a single µOP (resulting in a single schedule+single execute). An almost identical optimization is also performed by Intel's competing microarchitectures.
In Zen, CMP or TEST (no other ALU instructions qualify) immediately followed by a conditional jump can be fused into a single µOP. Note that non-RIP-relative
memory will not be fused. Up to two fused branch µOPs can be executed each cycle when not taken. When taken, only single fused branch µOPs can be executed each
cycle.
It's interesting to reiterate the fact that the branch fusion is actually done by the dispatch stage instead of decode. This is a bit unusual because you'd normally perform
that operation in decode in order to reduce the number of internal instructions. In Zen, the decoders can still end up emitting two ops just to be fused together in the
dispatch stage. This change can likely be attributed to the various optimizations that came along with the introduction of the µOPs cache (which sits parallel to the
decoders in the pipeline). It also implies that the decoders are of a simple design intended to be further translated later own in the pipe thereby being limited to a
number of key transformations such as instruction boundary detection (i.e., x86 instruction length and rearrangement).
Execution Engine
As mentioned early, Zen returns to a fully partitioned core design with a private L2 cache and private
FP/SIMD units. Previously those units shared resources spanning two cores. Zen's Execution Engine (Back-
End) is split into two major sections: integer & memory operations and floating point operations. The two
sections are decoupled with independent renaming, schedulers, queues, and execution units. Both Integer and
FP sections have access to the Retire Queue which is 192 entries and can retire 8 instructions per cycle
(independent of either Integer or FP). The wider-than-dispatch retire allows Zen to catch up and free the
resources much quicker (previous architectures saw bottleneck at this point in situations where an older op is
stalling causing a reduction in performance due to retire needing to catch up to the front of the machine).
Because the two regions are entirely divided, a penalty of one cycle latency will incur for operands that
crosses boundaries; for example, if an operand of an integer arithmetic µOP depends on the result of a
floating point µOP operation. This applies both ways. This is a similar to the inter-Common Data Bus
exchanges in Intel's designs (e.g., Skylake) which incur a delay of 1 to 2 cycles when dependent operands
cross domains.
Move elimination
Move elimination is possible in both Integer and FP domains; register moves are done internally by modifying the register mapping rather than through an execution of
a µOP. No execution unit resources is used in the process and such µOP result in zero latency. In WikiChip's tests, almost all move eliminations succeed; including
chained moves. An elimination will never occur for moves involving the register itself. This applies to both 32-bit and 64-bit integer registers as well as all 128-bit and
256-bit vector registers but not half registers (e.g. 16/8 bit registers).
Integer
The Integer Execute can receive up to 6 µOPs/cycle from Dispatch where it is mapped from logical registers to physical registers. Zen has a 168-entry physical 64-bit
integer register file, an identical size to that of Broadwell. Instead of a large scheduler, Zen has 6 distributed scheduling queues, each 14 entries deep (4xALU,
2xAGU). Zen includes a number of enhancements such as differential checkpoints tracking branch instructions and eliminating redundant values as well as move
eliminations. Note that register moves are done internally by modifying the register mapping rather than through an execution of a µOP. While AMD stated that the
ALUs are largely symmetric except for a number of exceptions, it's still unknown which operations are reserved to which units.
https://en.wikichip.org/wiki/x86
https://en.wikichip.org/wiki/amd/microarchitectures/bulldozer
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/jaguar&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/microarchitectures/bulldozer
https://en.wikichip.org/w/index.php?title=instruction_decoder&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Integer&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Floating_Point&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/microarchitectures/k10
https://en.wikichip.org/w/index.php?title=store-to-load_forwarding&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=AGU&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=micro-op_fusion&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/cmp&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/test&action=edit&redlink=1
https://en.wikichip.org/wiki/ALUhttps://en.wikichip.org/w/index.php?title=conditional_jump&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/rip-relative_addressing&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=not_taken&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=FP&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=SIMD&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=integer&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=floating_point&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=register_renaming&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=schedulers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=queues&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Retire_Queue&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=retire&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=operand&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Common_Data_Bus&action=edit&redlink=1
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/w/index.php?title=logical_registers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=physical_registers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=register_file&action=edit&redlink=1
https://en.wikichip.org/wiki/intel/microarchitectures/broadwell
https://en.wikichip.org/wiki/ALU
https://en.wikichip.org/w/index.php?title=AGU&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=move_eliminations&action=edit&redlink=1
https://en.wikichip.org/wiki/File:amd_fastpath_single-double_(zen).svg
https://en.wikichip.org/wiki/File:amd_zen_hc28_integer.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 14/30
Generally, the four ALUs will execute four integer instructions per cycle. Simple operations can be done by any of the ALUs whereas the more expensive
multiplication and division ones can only be done by their respective ALUs (there is one of each). Additionally, two of Zen's ALUs are capable of performing a branch,
therefore Zen can peak at 2 branches per cycle. This only occurs if they are not taken. The two branches can simultaneously execute two branch instructions from the
same thread or from two separate threads. If the branch is taken, Zen is restricted to only 1 branch per cycle. This is a similar restriction which is found in Intel's
architectures such as Haswell. In Haswell, port 0 can only execute predicted "not-taken" branches whereas port 6 can perform both "taken" and "not taken". AMD's
reason for adding a second branch is driven by an entirely different reason compared to Haswell which had done the same. The second branch unit in Haswell was
added largely in an effort to mitigate port contention. Prior to that change, code involving tight loops that performed SSE operations ended up fighting over the same
port as both the SSE operation and the actual branch ended up being scheduled on the same port. Zen doesn't actually have this issue. The addition of a second branch
unit in their case serves to purely boost the performance of branch-heavy code.
The 2 AGUs can be used in conjunction with the ALUs. µOPs involving a memory operands will make use of both at the same time and will not be (i.e., the operations
don't get split up). Zen is capable of a read+write or read+read operations in one cycle (See § Memory Subsystem).
Floating Point
The Floating Point side can receive up to 4 µOPs/cycle from Dispatch where it is mapped from logical registers to physical registers. Zen has a 160-entry physical 128-
bit floating point register file, just 8 entries shy of the size used in Intel's Skylake/Kaby Lake architectures. The register file can perform direct transfers to the Integer
register files as needed.
Before ops go to the scheduling queue, they go through the Non-Scheduling Queue (NSQ) first which is
essentially a wait buffer. Because FP instructions typically have higher latency, they can create a back-up at
Dispatch. The non-scheduling queue attempts to reduce this by queuing more FP instructions which lets
Dispatch continue on as much as possible on the Integer side. Additionally, the NSQ can go ahead and start
working on the memory components of the FP instructions so that they can be ready once they go through the
Scheduling Queue. From the schedulers, the instructions are sent to be executed. The FP scheduler has four
pipes (1 more than that of Excavator) with execution units that operate on 128-bit floating point.
The FP deals with all vector operations. The simple integer vector operations (e.g. shift, add) can all be done
in one cycle, half the latency of AMD's previous architecture. Basic floating point math has a latency of three
cycles including multiplication (one additional cycle for double precision). Fused multiply-add are five
cycles.
The FP has a single pipe for 128-bit load operations. In fact, the entire FP side is optimized for 128-bit
operations. Zen supports all the latest instructions such as SSE and AVX1/2. The way 256-bit AVX was designed was so that they can be carried out as two
independent 128-bit operations. Zen takes advantage of that by operating on those instructions as two operations; i.e., Zen splits up 256-bit operations into two µOPs so
they are effectively half the throughput of their 128-bit operations counterparts. Likewise, stores are also done on 128-bit chunks, making 256-bit loads have an
effective throughput of one store every two cycles. The pipes are fairly well balanced, therefore most operations will have at least two pipes to be scheduled on
retaining the throughput of at least one such instruction each cycle. As implies, 256-bit operations will use up twice the resources to complete (i.e., 2x register,
scheduler, and ports). This is a compromise AMD has taken which helps conserve die space and power. By contrast, Intel's competing design, Skylake, does have
dedicated 256-bit circuitry. It's also worth noting that Intel's contemporary server class models have extended this further to incorporate dedicated 512-bit circuitry
supporting AVX-512 with the highest performance models having a whole second dedicated AVX-512 unit.
Additionally Zen also supports SHA and AES with 2 AES units implemented in an attempt to improve encryption performance. Those units can be found on pipes 0
and 1 of the floating point scheduler.
Memory Subsystem
Loads and Stores are conducted via the two AGUs which can operate simultaneously. Zen has a much larger load
queue capable of supporting 72 out-of-order loads (same as Intel's Skylake). There is also a 44-entry Store Queue.
Zen employs a split TLB-data pipe design which allows TLB tag access to take place while the data cache is being
fed in order to determine if the data is available and send their address to the L2 to start prefetching early on. Zen is
capable of up to two loads per cycle (2x16B each) and up to one store per cycle (1x16B). The L1 TLB is 64-entry for
all page sizes and the L2 TLB is a 1536-entry with no 1 GiB pages.
Zen incorporates a 64 KiB 4-way set associative L1 instruction cache and a 32 KiB 8-way set associative L1 data
cache. Both the instruction cache and the data cache can fetch from the L2 cache at 32 Bytes per cycle. The L2 cache
is a 512 KiB 8-way set associative unified cache, inclusive, and private to the core. The L2 cache can fetch and write
32B/cycle into the 8MB L3 cache (32B in either direction each cycle, i.e. bidirectional bus).
Infinity Fabric
Main article: AMD's Infinity Fabric
The Infinity Fabric (IF) is a system of transmissions and controls that underpin the entire Zen microarchitecture, any graphics microarchitecture (e.g Vega), and any
other additional accelerators they might add in the future. Consisting of two separate fabrics, one for control signals and a second for data transmission, the infinity
fabric isthe primary means by which data flows from one core to the other, across CCXs, chips, to any graphics unit, and from any I/O (e.g. USB).
Clock domains
Zen is divided into a number of clock domains, each operating at a certain frequency:
UClk - UMC Clock - The frequency at which the Unified Memory Controller's (UMC) operates at. This frequency is identical to MemClk.
LClk - Link Clock - The clock at which the I/O Hub Controller communicates with the chip.
FClk - Fabric Clock - The clock at which the data fabric operates at. This frequency is identical to MemClk.
MemClk - Memory Clock - Internal and external memory clock.
CClk - Core Clock - The frequency at which the CPU core and the caches operate at (i.e. advertised frequency).
For example, a stock Ryzen 7 1700 with 2400 MT/s DRAM will have a CClk = 3000 MHz, MemClk = FClk = UClk = 1200 MHz.
https://en.wikichip.org/wiki/ALU
https://en.wikichip.org/wiki/intel/microarchitectures/haswell
https://en.wikichip.org/wiki/intel/microarchitectures/haswell
https://en.wikichip.org/wiki/intel/microarchitectures/haswell
https://en.wikichip.org/w/index.php?title=AGU&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=logical_registers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=physical_registers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=register_file&action=edit&redlink=1
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=floating_point&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=multiplication&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Fused_multiply-add&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/avx1&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/avx2&action=edit&redlink=1
https://en.wikichip.org/wiki/AMD
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/skylake
https://en.wikichip.org/wiki/intel/cores/skylake_sp
https://en.wikichip.org/wiki/x86/avx-512
https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Execution_engine
https://en.wikichip.org/w/index.php?title=x86/sha&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=x86/aes&action=edit&redlink=1
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/wiki/amd/infinity_fabric
https://en.wikichip.org/w/index.php?title=amd/vega&action=edit&redlink=1
https://en.wikichip.org/wiki/clock_domains
https://en.wikichip.org/wiki/amd/ryzen_7_1700
https://en.wikichip.org/wiki/File:amd_zen_hc28_fp.png
https://en.wikichip.org/wiki/File:amd_zen_hc28_memory.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 15/30
Security
AMD incorporated a number of new security technologies into their server-class Zen processors (e.g., EPYC). The various security
features are offered via a new dedicated security subsystem which integrates an Cortex-A5 core. The dedicated secure processor runs
a secured kernel with the firmware which sits externally (e.g., on an SPI ROM). The secure processor is responsible for the
cryptographic functionalities for the secure key generation and management as well as hardware-validated boots.
SME SEV
Protection Per Whole Machine Individual VMs
Type of Protection Physical Memory Attack Physical Memory AttackVulnerable VM
Encryption Per Native page table Guest page table
Key Management Key/Machine Key/VM
Requires Driver No Yes
Secure Memory Encryption (SME)
Main article: Secure Memory Encryption
Secure Memory Encryption (SME) is a new feature which offers full hardware memory encryption against physical memory attacks. A single key is used for the
encryption. An AES-128 Encryption engine sits on the integrated memory controller thereby offering real-time per page table entry encryption - this works across
execution cores, network, storage, graphics, and any other I/O access that goes through the DMA. SME incurs additional latency tax only for encrypted pages.
AMD also supports Transparent SME (TSME) on their workstation-class PRO (Performance, Reliability, Opportunity) processors in addition to the server models.
TSME is subset of SME limited to base encryption without OS/HV involvement, allowing for legacy OS/HV software support. In this mode, all memory is encrypted
regardless of the value of the C-bit on any particular page. When this mode is enabled, SME and SEV are not available.
Secure Encrypted Virtualization (SEV)
Main article: Secure Encrypted Virtualization
Secure Encrypted Virtualization (SEV) is a more specialized version of SME whereby individual keys can be used per hypervisor and per
VM, a cluster of VMs, or a container. This allows the hypervisor memory to be encrypted and cryptographically isolated from the guest
machines. Additionally SEV can work alongside unencrypted VMs from the same hypervisor. All this functionality is integrated and works
with existing AMD-V technology.
https://en.wikichip.org/wiki/File:zen_soc_clock_domain.svg
https://en.wikichip.org/wiki/amd/epyc
https://en.wikichip.org/wiki/arm_holdings/microarchitectures/cortex-a5
https://en.wikichip.org/wiki/x86/secure_memory_encryption
https://en.wikichip.org/w/index.php?title=AES-128&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=integrated_memory_controller&action=edit&redlink=1
https://en.wikichip.org/wiki/x86/secure_encrypted_virtualization
https://en.wikichip.org/wiki/File:amd_sme.png
https://en.wikichip.org/wiki/File:amd_sev.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 16/30
RDL - Redistribution layer
LDOs - Regulate RVDD to create VDD per core
RVDD - Ungated supply
VDD - Gated core supply
VDDM - L2/L3 SRAM supply
Power
Zen presented AMD with a number of new challenges in the area of power largely due to their decision to cover
the entire spectrum of systems from ultra-low power to high performance. Previously AMD handled this by
designing two independent architectures (i.e., Excavator and Puma). In Zen, SoC voltage coming from the
Voltage Regulator Module (VRM) is fed to the RVDD, a package metal plane that distributes the highest VID
request from all cores. In Zen, each core has a digital LDO regulator (low-dropout) and a digital frequency
synthesizer (DFS) to vary frequency and voltage across power states on individual core basis. The LDO
regulates RVDD for each power domain and create an optimal VDD per core using a system of sensors they've
embedded across the entire chip; this is in addition to other properties such as countermeasures against droop.
This is in contrast to some alternative solutions by Intel which attempted to integrated the voltage regulator
(FIVR) on die in Haswell (and consequently removing it in Skylake due to a number of thermal restrictions it
created). Zen's new voltage control is an attempt at a much finer power tuning on a per core level based on a
collection of information it has on that core and overall chip.
 
AMD uses a Metal-Insulator-Metal Capacitor (MIMCap) layer between the two upper level metal layers for fast current injection in order to mitigate voltage droop.
AMD stated that it covers roughly 45% of the core and a slightly smaller coverage of the L3. In addition to the LDO circuit integrated for each core is a low-latency
power supply droop detector that can trigger the digital LDOs to turn on more drivers to counter droops.
A larger number of sensors across the entire die are used to measure many of the CPU states including frequency, voltage, power, and temperature. The data is in turn
used for workload characterization, adaptive voltage, frequency tuning, and dynamic clocking. Adaptive voltage and frequency scaling (AVFS), an on-die closed-loopsystem that adjusts the voltage in real time following real-time measurements based on sensory data collected. This is part of AMD's "Precision Boost" technology
offering high granularity of 25 MHz clock increments.
Zen implements over 1300 sensors to monitor the state of the die over all critical paths including the CCX and external components such as the memory fabric.
Additionally the CCX also incorporates 48 high-speed power supply monitors, 20 thermal diodes, and 9 high-speed droop detectors.
https://en.wikichip.org/wiki/File:zen_ccx_voltage.png
https://en.wikichip.org/w/index.php?title=L2&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=L3&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=SRAM&action=edit&redlink=1
https://en.wikichip.org/wiki/File:amd_sev_architecture.png
https://en.wikichip.org/wiki/amd/microarchitectures/excavator
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/puma&action=edit&redlink=1
https://en.wikichip.org/wiki/Voltage_Regulator_Module
https://en.wikichip.org/w/index.php?title=LDO_regulator&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=digital_frequency_synthesizer&action=edit&redlink=1
https://en.wikichip.org/wiki/Intel
https://en.wikichip.org/wiki/intel/microarchitectures/haswell
https://en.wikichip.org/wiki/intel/microarchitectures/skylake
https://en.wikichip.org/wiki/File:amd_zen_package_metal_plane.png
https://en.wikichip.org/wiki/File:amd_zen_per_core_voltage_distribution.png
https://en.wikichip.org/w/index.php?title=droop_detector&action=edit&redlink=1
https://en.wikichip.org/wiki/File:amd_zen_mimcap.png
https://en.wikichip.org/w/index.php?title=frequency&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=voltage&action=edit&redlink=1
https://en.wikichip.org/wiki/power
https://en.wikichip.org/w/index.php?title=temperature&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=adaptive_voltage&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=dynamic_clocking&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=Adaptive_voltage_and_frequency_scaling&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=critical_paths&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=thermal_diodes&action=edit&redlink=1
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 17/30
System Management Unit
This section is empty; you can help add the missing info
by editing this page (https://en.wikichip.org/w/index.php?
title=amd/microarchitectures/zen&action=edit).
Features
AMD introduced a series of new features in their new Zen microarchitecture:
Simultaneous MultiThreading (SMT)
Perhaps the single biggest enhancement to Zen is the addition of full-fledged simultaneous multithreading (SMT) support (a technology similar to Hyper-Threading
found in Intel processors). This is a departure from AMD's previous lightweight (and largely ineffective and to some degree misleading) Clustered Multithreading
(CMT). Zen is a properly simultaneous multi-threaded machine capable of handling two threads of execution throughout the entire machine. Below is a breakdown of
how the various core components work under SMT:
 - Competitively shared structures
 - Competitively shared and SMT tagged
 - Competitively shared with Algorithmic Priority
 - Statically Partitioned
The basics behind SMT are always the same: high utilization of resources through multiple threads of
execution. When a single thread is running all structures become fully available to that thread as
needed. With the introduction of SMT and a second thread, Zen attempts to share as much of the
resources as possible in an attempt to balance out the throughput and deliver the appropriate
structures to each thread as the software requires. The various structures can dynamically shift their
resources depending on the kind of workload being executed. Structures that are competitively shared
by the two threads (shaded in red in the diagram) include the execution units, schedulers, register file,
the decode, and cache (including the µOP cache). The load queue, ITLB, and DTLB (shaded in dark
cyan) are also competitively shared but require SMT tagging - resources (i.e. entries capacity) are
shared between the threads but actual entry values (e.g. addresses) can only be accessed by the
owning thread.
The branch predictor and the two register renaming/allocation units (shaded in blue) are
competitively shared with algorithmic priority. Zen provides additional logic to give a certain thread
temporary priority in resource allocation over the other thread. One such occasion is when the BP
encounters a flush on one of the threads. Temporary priority is given to that thread in order to help it
fetch much instructions as it could so it can get going again. Additionally, similar logic can be found
at dispatch to ensure good throughput by both threads and high utilization of the execution units.
The µOP Queue, Retire Queue, and Store Queue (shaded in green on the diagram) are statically partitioned, i.e. those units have duplicate logic to handle each thread
independently. Those were duplicated instead of shared simply due to the high complexity involved in doing so.
SenseMI Technology
SenseMI Technology (pronounced Sense-Em-Eye) is an umbrella term for a number of features AMD added to Zen microprocessors designed to increase performance
through various self-tuning using a network of sensors:
Neural Net Prediction - This appears to be largely marketing term for Zen's much beefier and more finely tuned branch prediction unit. Zen uses a hashed
perceptron system to intelligently anticipate future code flows, allowing warming up of cold blocks in order to avoid possible waits. Most of that
functionality is already found on every modern high-end microprocessor (including AMD's own previous microarchitectures). Because AMD has not
disclosed any more specific information about BP, it can only be speculated that no new groundbreaking logic was introduced in Zen.
Smart Prefetch - As with the Prediction Unit, this too appears to be a marketing term for the number of changes AMD introduced in the fetch stage where
the the branch predictor can get a hit on the next µOP and retrieve it via the µOPs cache directly to the µOPs Queue, eliminating the costly decode pipeline
stages. Additionally Zen can detect various data patterns in the program's execution and predict future data requests allowing for prefetching ahead of time
reducing latency.
Pure Power - A feature in Zen that allows for dynamic voltage and frequency scaling (DVFS), similar to
AMD's PowerTune technology or Cool'n'Quiet, along with a number of other enhancements that extends
beyond the core to the Infinity Fabric (AMD's new proprietary interconnect). Pure Power monitors the
state of the processor (e.g., workload), which in terms allows it to downclock when not under load in
order to save power. Zen incorporates a network of sensors across the entire chip to help aid Pure Power in its
monitoring.
https://en.wikichip.org/wiki/File:zen_pure_power_sensory.png
https://en.wikichip.org/wiki/File:New_text_document.svg
https://en.wikichip.org/w/index.php?title=amd/microarchitectures/zen&action=edit
https://en.wikichip.org/w/index.php?title=simultaneous_multithreading&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=intel/hyper-threading&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/clustered_multithreading&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=threads&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=schedulers&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=register_file&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=ITLB&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=DTLB&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=branch_predictor&action=edit&redlink=1https://en.wikichip.org/w/index.php?title=branch_prediction&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=perceptron_branch_predictor&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=dynamic_voltage_and_frequency_scaling&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/powertune&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/cool%27n%27quiet&action=edit&redlink=1
https://en.wikichip.org/wiki/File:amd_zen_hc28_smt.png
https://en.wikichip.org/wiki/File:10682-icon-neural-net-prediction-140x140.png
https://en.wikichip.org/wiki/File:10682-icon-smart-prefetch-140x140.png
https://en.wikichip.org/wiki/File:10682-icon-pure-power-140x140.png
https://en.wikichip.org/wiki/File:zen_pure_power_loop.png
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 18/30
Precision Boost - A feature that provides the ability to adjust the frequency of the processor on-the-fly given sufficient headroom (e.g. thermal limits based
on the sensory data collected by a network of sensors across the chip), i.e. "Turbo Frequency". Precision Boost adjusts in 25 MHz increments. With Zen-
based APUs, AMD introduced Precision Boost 2 - an enhancement of the original PB feature that uses a new algorithm that controls the boost frequency on
a per-thread basis depending on the headroom.
Extended Frequency Range (XFR) - This is a fully automated solution that attempts to allow higher
upper limit on the maximum frequency based on the cooling technique used (e.g. air, water, LN2).
Whenever the chip senses that it's suitable enough for a given frequency, it will attempt to increase that
limit further. XFR is partially enabled on all models, providing an extra +50 MHz frequency boost
whenever possible. For 'X' models, full XFR is enabled providing twice the headroom of up to +100 MHz. With Zen-
based APUs, AMD introduced Mobile XFR (mXFR) which offers mobile devices with premium cooling a
sustainable higher boost frequency for a longer period of time.
The AMD presentation slide on the right depicts a normal use case for the Ryzen 7 1800X. When under normal workload, the processor will operate at around its base
frequency of 3.6 GHz. When experiencing heavier workload, Precision Boost will kick in increment it as necessary up to its maximum frequency of 4 GHz. With
adequate cooling, XFR will bump it up an additional 100 MHz. This boost is sustainable for the first two active cores, at which point the boost frequency will drop to
the "all core" frequency. When light workload get experienced, the processor will reduce its frequency. As Pure Power senses the workload and CPU state, it can also
drastically downclock the CPU when appropriate (such as in the graph during mostly idle scenarios).
Scalability
CPU Complex (CCX)
AMD organized Zen in groups of cores called a CPU Complex (CCX). Each CCX consists of four cores connected to an L3 cache.
The L3 cache is an 8 MiB 16-way set associative victim cache and is mostly exclusive of the L2. The L3 cache is made of four slices
(providing 2 MiB L3 slice/core) organized by low-order address interleaved. Every core can access every L3 cache slice with the
same average latency. When a certain core starts working on a chunk of memory it will fill up the L2 and as it continue to execute
and fetch new data any spillover will find its way in the L3.
Depending on the exact processor processor model, there may be one or more CCXs joined together. For example, all mainstream
Ryzen 3/5/7 models have two CCXs with up to 8 cores (and an equal amount of cores disabled on each CCX as the chips are down-
binned to 4/6 cores). It's important to note that the L3 in Zen is not a true last level cache (LLC) as the 16 MiB L3$ will consist of
two separate 8 MiB and not one unified L3. The separate CPU complexes can communicate with each other via the Infinity Fabric
which connects the CCXs along with the memory controller and I/O. While the CCXs operate at core frequency (CClk), the fabric
itself operates at MemClk (see § Clock domains). This design choice allows for the scaling up to large high-performance multi-core
system (i.e., high scalability, particularly in the server segment, through high core count and large bandwidth) but it does mean that
systems making use of Zen processors have to treat every CPU Complex as a processor of its own - i.e., schedule tasks using cache-
coherent non-uniform memory access (ccNUMA-aware) scheduling. This is important to ensure that threads are not moved from one CCX to the other as doing so will
likely incur unnecessary performance penalties (as cache data would need to be communicated over via the fabric from one CCX to the next which has additional
overhead latency and lower bandwidth).
While specific worst-case scenario performance tests have shown that rapid inter-CCXs data movement incur a substantial performance penalty, real world tests have
shown the penalty is rather small in practice as the operating system (e.g. Windows) knows how to do the right thing. Additionally performance can be improved with
faster memory kits which in turn increases the frequency of the fabric as well (see § Clock domains).
https://en.wikichip.org/w/index.php?title=amd/precision_boost&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=amd/precision_boost_2&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/xfr
https://en.wikichip.org/w/index.php?title=amd/mobile_xfr&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/amd/ryzen_7/1800x
https://en.wikichip.org/wiki/amd/xfr
https://en.wikichip.org/wiki/File:ryzen-xfr-1800x_example.jpg
https://en.wikichip.org/w/index.php?title=victim_cache&action=edit&redlink=1
https://en.wikichip.org/w/index.php?title=exclusive_cache&action=edit&redlink=1
https://en.wikichip.org/wiki/amd/ryzen_3
https://en.wikichip.org/wiki/amd/ryzen_5
https://en.wikichip.org/wiki/amd/ryzen_7
https://en.wikichip.org/wiki/amd/infinity_fabric
https://en.wikichip.org/w/index.php?title=cache-coherent_non-uniform_memory_access&action=edit&redlink=1
https://en.wikichip.org/wiki/File:10682-icon-precision-boost-140x140.png
https://en.wikichip.org/wiki/File:amd_zen_xfr.jpg
https://en.wikichip.org/wiki/File:10682-icon-frequency-range-140x140.png
https://en.wikichip.org/wiki/File:naples_without_heatspread.jpg
04/08/2019 Zen - Microarchitectures - AMD - WikiChip
https://en.wikichip.org/wiki/amd/microarchitectures/zen#All_Zen_Chips 19/30
Multiprocessors
See also: Naples Core and AMD EPYC
As part of the Zen microarchitecture, AMD also developed a series of dual-socket multiprocessors. The new server
processors are branded under a new family called EPYC which effectively succeeds the older Opteron family. All
EPYC processors consist of four Zeppelin dies stitched together. Since each Zeppelins is a complete system on chip
with the northbridge and southbridge integrated on-die, the combination of four of them allows AMD to offer a
sufficient amount of I/O signals that a chipset can be entirely eliminated. Communication between the individual dies
is done via AMD's Infinity Fabrics protocols over a set of GMI (Global Memory Interconnect).
Each Zeppelin provides 32 Gen 3.0 PCIe lanes for a total of 128 lanes. In a single-socket configuration, all 128 lanes
may be used for general purpose I/O - for example 6 GPUs over x16 and x8 more lanes for additional storage. This is
considerably more than any comparable contemporary Intel model (either Broadwell EP or Skylake SP). Naples-
based processors scale all the way up to 32 cores with 64 threads (for up to 64 cores and 128 threads per complete
system). The caveat is that when in 2-way MP mode, half of the lanes are lost. 64 of the 128 of the PCIe lanes get allocated for interchip communication via AMD's
Infinity Fabrics protocols with the remaining 64 lanes left for the system. 64 PCIe lanes for socket-to-socket communication provides a maximum

Continue navegando