Documentation/driver-api/edac.rst - third_party/linux - Git at Google

 Error Detection And Correction (EDAC) Devices
 =============================================

 Main Concepts used at the EDAC subsystem
 ----------------------------------------

 There are several things to be aware of that aren't at all obvious, like
 *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
 etc...

 These are some of the many terms that are thrown about that don't always
 mean what people think they mean (Inconceivable!).  In the interest of
 creating a common ground for discussion, terms and their definitions
 will be established.

 * Memory devices

 The individual DRAM chips on a memory stick.  These devices commonly
 output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
 provides the number of bits that the memory controller expects:
 typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.

 * Memory Stick

 A printed circuit board that aggregates multiple memory devices in
 parallel.  In general, this is the Field Replaceable Unit (FRU) which
 gets replaced, in the case of excessive errors. Most often it is also
 called DIMM (Dual Inline Memory Module).

 * Memory Socket

 A physical connector on the motherboard that accepts a single memory
 stick. Also called as "slot" on several datasheets.

 * Channel

 A memory controller channel, responsible to communicate with a group of
 DIMMs. Each channel has its own independent control (command) and data
 bus, and can be used independently or grouped with other channels.

 * Branch

 It is typically the highest hierarchy on a Fully-Buffered DIMM memory
 controller. Typically, it contains two channels. Two channels at the
 same branch can be used in single mode or in lockstep mode. When
 lockstep is enabled, the cacheline is doubled, but it generally brings
 some performance penalty. Also, it is generally not possible to point to
 just one memory stick when an error occurs, as the error correction code
 is calculated using two DIMMs instead of one. Due to that, it is capable
 of correcting more errors than on single mode.

 * Single-channel

 The data accessed by the memory controller is contained into one dimm
 only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
 one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
 memories. FB-DIMM and RAMBUS use a different concept for channel, so
 this concept doesn't apply there.

 * Double-channel

 The data size accessed by the memory controller is interlaced into two
 dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
 bits with ECC), the data flows to the CPU using a 128 bits parallel
 access.

 * Chip-select row

 This is the name of the DRAM signal used to select the DRAM ranks to be
 accessed. Common chip-select rows for single channel are 64 bits, for
 dual channel 128 bits. It may not be visible by the memory controller,
 as some DIMM types have a memory buffer that can hide direct access to
 it from the Memory Controller.

 * Single-Ranked stick

 A Single-ranked stick has 1 chip-select row of memory. Motherboards
 commonly drive two chip-select pins to a memory stick. A single-ranked
 stick, will occupy only one of those rows. The other will be unused.

 .. _doubleranked:

 * Double-Ranked stick

 A double-ranked stick has two chip-select rows which access different
 sets of memory devices.  The two rows cannot be accessed concurrently.

 * Double-sided stick

 **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.

 A double-sided stick has two chip-select rows which access different sets
 of memory devices. The two rows cannot be accessed concurrently.
 "Double-sided" is irrespective of the memory devices being mounted on
 both sides of the memory stick.

 * Socket set

 All of the memory sticks that are required for a single memory access or
 all of the memory sticks spanned by a chip-select row.  A single socket
 set has two chip-select rows and if double-sided sticks are used these
 will occupy those chip-select rows.

 * Bank

 This term is avoided because it is unclear when needing to distinguish
 between chip-select rows and socket sets.

 * High Bandwidth Memory (HBM)

 HBM is a new memory type with low power consumption and ultra-wide
 communication lanes. It uses vertically stacked memory chips (DRAM dies)
 interconnected by microscopic wires called "through-silicon vias," or
 TSVs.

 Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
 interconnect called the "interposer". Therefore, HBM's characteristics
 are nearly indistinguishable from on-chip integrated RAM.

 Memory Controllers
 ------------------

 Most of the EDAC core is focused on doing Memory Controller error detection.
 The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
 to describe the memory controllers, with is an opaque struct for the EDAC
 drivers. Only the EDAC core is allowed to touch it.

 .. kernel-doc:: include/linux/edac.h

 .. kernel-doc:: drivers/edac/edac_mc.h

 PCI Controllers
 ---------------

 The EDAC subsystem provides a mechanism to handle PCI controllers by calling
 the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
 :c:type:`edac_pci_ctl_info` to describe the PCI controllers.

 .. kernel-doc:: drivers/edac/edac_pci.h

 EDAC Blocks
 -----------

 The EDAC subsystem also provides a generic mechanism to report errors on
 other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.

 The structures :c:type:`edac_dev_sysfs_block_attribute`,
 :c:type:`edac_device_block`, :c:type:`edac_device_instance` and
 :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
 representation at sysfs.

 This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
 PCI, like:

 - CPU caches (L1 and L2)
 - DMA engines
 - Core CPU switches
 - Fabric switch units
 - PCIe interface controllers
 - other EDAC/ECC type devices that can be monitored for
   errors, etc.

 It allows for a 2 level set of hierarchy.

 For example, a cache could be composed of L1, L2 and L3 levels of cache.
 Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
 caches. On such case, those can be represented via the following sysfs
 nodes::

 	/sys/devices/system/edac/..

 	pci/		<existing pci directory (if available)>
 	mc/		<existing memory device directory>
 	cpu/cpu0/..	<L1 and L2 block directory>
 		/L1-cache/ce_count
 			 /ue_count
 		/L2-cache/ce_count
 			 /ue_count
 	cpu/cpu1/..	<L1 and L2 block directory>
 		/L1-cache/ce_count
 			 /ue_count
 		/L2-cache/ce_count
 			 /ue_count
 	...

 	the L1 and L2 directories would be "edac_device_block's"

 .. kernel-doc:: drivers/edac/edac_device.h


 Heterogeneous system support
 ----------------------------

 An AMD heterogeneous system is built by connecting the data fabrics of
 both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
 GPU nodes can be accessed the same way as the data fabric on CPU nodes.

 The MI200 accelerators are data center GPUs. They have 2 data fabrics,
 and each GPU data fabric contains four Unified Memory Controllers (UMC).
 Each UMC contains eight channels. Each UMC channel controls one 128-bit
 HBM2e (2GB) channel (equivalent to 8 X 2GB ranks).  This creates a total
 of 4096-bits of DRAM data bus.

 While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
 channel is interfacing 2GB of DRAM (represented as rank).

 Memory controllers on AMD GPU nodes can be represented in EDAC thusly:

 	GPU DF / GPU Node -> EDAC MC
 	GPU UMC           -> EDAC CSROW
 	GPU UMC channel   -> EDAC CHANNEL

 For example: a heterogeneous system with 1 AMD CPU is connected to
 4 MI200 (Aldebaran) GPUs using xGMI.

 Some more heterogeneous hardware details:

 - The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
   They have chip selects (csrows) and channels. However, the layouts are different
   for performance, physical layout, or other reasons.
 - CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
   marketing speak. CPU has X memory channels, etc.
 - CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
 - GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
 - GPU UMCs use 8 channels, So UMC channel = EDAC channel.

 The EDAC subsystem provides a mechanism to handle AMD heterogeneous
 systems by calling system specific ops for both CPUs and GPUs.

 AMD GPU nodes are enumerated in sequential order based on the PCI
 hierarchy, and the first GPU node is assumed to have a Node ID value
 following those of the CPU nodes after latter are fully populated::

 	$ ls /sys/devices/system/edac/mc/
 		mc0   - CPU MC node 0
 		mc1  |
 		mc2  |- GPU card[0] => node 0(mc1), node 1(mc2)
 		mc3  |
 		mc4  |- GPU card[1] => node 0(mc3), node 1(mc4)
 		mc5  |
 		mc6  |- GPU card[2] => node 0(mc5), node 1(mc6)
 		mc7  |
 		mc8  |- GPU card[3] => node 0(mc7), node 1(mc8)

 For example, a heterogeneous system with one AMD CPU is connected to
 four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
 via the following sysfs entries::

 	/sys/devices/system/edac/mc/..

 	CPU			# CPU node
 	├── mc 0

 	GPU Nodes are enumerated sequentially after CPU nodes have been populated
 	GPU card 1		# Each MI200 GPU has 2 nodes/mcs
 	├── mc 1		# GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
 	│   ├── csrow 0		# UMC 0
 	│   │   ├── channel 0	# Each UMC has 8 channels
 	│   │   ├── channel 1   # size of each channel is 2 GB, so each UMC has 16 GB
 	│   │   ├── channel 2
 	│   │   ├── channel 3
 	│   │   ├── channel 4
 	│   │   ├── channel 5
 	│   │   ├── channel 6
 	│   │   ├── channel 7
 	│   ├── csrow 1		# UMC 1
 	│   │   ├── channel 0
 	│   │   ├── ..
 	│   │   ├── channel 7
 	│   ├── ..		..
 	│   ├── csrow 3		# UMC 3
 	│   │   ├── channel 0
 	│   │   ├── ..
 	│   │   ├── channel 7
 	│   ├── rank 0
 	│   ├── ..		..
 	│   ├── rank 31		# total 32 ranks/dimms from 4 UMCs
 	├
 	├── mc 2		# GPU node 1 == mc2
 	│   ├── ..		# each GPU has total 64 GB

 	GPU card 2
 	├── mc 3
 	│   ├── ..
 	├── mc 4
 	│   ├── ..

 	GPU card 3
 	├── mc 5
 	│   ├── ..
 	├── mc 6
 	│   ├── ..

 	GPU card 4
 	├── mc 7
 	│   ├── ..
 	├── mc 8
 	│   ├── ..
	Error Detection And Correction (EDAC) Devices
	=============================================

	Main Concepts used at the EDAC subsystem
	----------------------------------------

	There are several things to be aware of that aren't at all obvious, like
	sockets, socket sets, banks, rows, chip-select rows, channels*,
	etc...

	These are some of the many terms that are thrown about that don't always
	mean what people think they mean (Inconceivable!). In the interest of
	creating a common ground for discussion, terms and their definitions
	will be established.

	* Memory devices

	The individual DRAM chips on a memory stick. These devices commonly
	output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
	provides the number of bits that the memory controller expects:
	typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.

	* Memory Stick

	A printed circuit board that aggregates multiple memory devices in
	parallel. In general, this is the Field Replaceable Unit (FRU) which
	gets replaced, in the case of excessive errors. Most often it is also
	called DIMM (Dual Inline Memory Module).

	* Memory Socket

	A physical connector on the motherboard that accepts a single memory
	stick. Also called as "slot" on several datasheets.

	* Channel

	A memory controller channel, responsible to communicate with a group of
	DIMMs. Each channel has its own independent control (command) and data
	bus, and can be used independently or grouped with other channels.

	* Branch

	It is typically the highest hierarchy on a Fully-Buffered DIMM memory
	controller. Typically, it contains two channels. Two channels at the
	same branch can be used in single mode or in lockstep mode. When
	lockstep is enabled, the cacheline is doubled, but it generally brings
	some performance penalty. Also, it is generally not possible to point to
	just one memory stick when an error occurs, as the error correction code
	is calculated using two DIMMs instead of one. Due to that, it is capable
	of correcting more errors than on single mode.

	* Single-channel

	The data accessed by the memory controller is contained into one dimm
	only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
	one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
	memories. FB-DIMM and RAMBUS use a different concept for channel, so
	this concept doesn't apply there.

	* Double-channel

	The data size accessed by the memory controller is interlaced into two
	dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
	bits with ECC), the data flows to the CPU using a 128 bits parallel
	access.

	* Chip-select row

	This is the name of the DRAM signal used to select the DRAM ranks to be
	accessed. Common chip-select rows for single channel are 64 bits, for
	dual channel 128 bits. It may not be visible by the memory controller,
	as some DIMM types have a memory buffer that can hide direct access to
	it from the Memory Controller.

	* Single-Ranked stick

	A Single-ranked stick has 1 chip-select row of memory. Motherboards
	commonly drive two chip-select pins to a memory stick. A single-ranked
	stick, will occupy only one of those rows. The other will be unused.

	.. _doubleranked:

	* Double-Ranked stick

	A double-ranked stick has two chip-select rows which access different
	sets of memory devices. The two rows cannot be accessed concurrently.

	* Double-sided stick

	DEPRECATED TERM, see :ref:`Double-Ranked stick <doubleranked>`.

	A double-sided stick has two chip-select rows which access different sets
	of memory devices. The two rows cannot be accessed concurrently.
	"Double-sided" is irrespective of the memory devices being mounted on
	both sides of the memory stick.

	* Socket set

	All of the memory sticks that are required for a single memory access or
	all of the memory sticks spanned by a chip-select row. A single socket
	set has two chip-select rows and if double-sided sticks are used these
	will occupy those chip-select rows.

	* Bank

	This term is avoided because it is unclear when needing to distinguish
	between chip-select rows and socket sets.

	* High Bandwidth Memory (HBM)

	HBM is a new memory type with low power consumption and ultra-wide
	communication lanes. It uses vertically stacked memory chips (DRAM dies)
	interconnected by microscopic wires called "through-silicon vias," or
	TSVs.

	Several stacks of HBM chips connect to the CPU or GPU through an ultra-fast
	interconnect called the "interposer". Therefore, HBM's characteristics
	are nearly indistinguishable from on-chip integrated RAM.

	Memory Controllers
	------------------

	Most of the EDAC core is focused on doing Memory Controller error detection.
	The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
	to describe the memory controllers, with is an opaque struct for the EDAC
	drivers. Only the EDAC core is allowed to touch it.

	.. kernel-doc:: include/linux/edac.h

	.. kernel-doc:: drivers/edac/edac_mc.h

	PCI Controllers
	---------------

	The EDAC subsystem provides a mechanism to handle PCI controllers by calling
	the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
	:c:type:`edac_pci_ctl_info` to describe the PCI controllers.

	.. kernel-doc:: drivers/edac/edac_pci.h

	EDAC Blocks
	-----------

	The EDAC subsystem also provides a generic mechanism to report errors on
	other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.

	The structures :c:type:`edac_dev_sysfs_block_attribute`,
	:c:type:`edac_device_block`, :c:type:`edac_device_instance` and
	:c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
	representation at sysfs.

	This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
	PCI, like:

	- CPU caches (L1 and L2)
	- DMA engines
	- Core CPU switches
	- Fabric switch units
	- PCIe interface controllers
	- other EDAC/ECC type devices that can be monitored for
	errors, etc.

	It allows for a 2 level set of hierarchy.

	For example, a cache could be composed of L1, L2 and L3 levels of cache.
	Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
	caches. On such case, those can be represented via the following sysfs
	nodes::

	/sys/devices/system/edac/..

	pci/ <existing pci directory (if available)>
	mc/ <existing memory device directory>
	cpu/cpu0/.. <L1 and L2 block directory>
	/L1-cache/ce_count
	/ue_count
	/L2-cache/ce_count
	/ue_count
	cpu/cpu1/.. <L1 and L2 block directory>
	/L1-cache/ce_count
	/ue_count
	/L2-cache/ce_count
	/ue_count
	...

	the L1 and L2 directories would be "edac_device_block's"

	.. kernel-doc:: drivers/edac/edac_device.h


	Heterogeneous system support
	----------------------------

	An AMD heterogeneous system is built by connecting the data fabrics of
	both CPUs and GPUs via custom xGMI links. Thus, the data fabric on the
	GPU nodes can be accessed the same way as the data fabric on CPU nodes.

	The MI200 accelerators are data center GPUs. They have 2 data fabrics,
	and each GPU data fabric contains four Unified Memory Controllers (UMC).
	Each UMC contains eight channels. Each UMC channel controls one 128-bit
	HBM2e (2GB) channel (equivalent to 8 X 2GB ranks). This creates a total
	of 4096-bits of DRAM data bus.

	While the UMC is interfacing a 16GB (8high X 2GB DRAM) HBM stack, each UMC
	channel is interfacing 2GB of DRAM (represented as rank).

	Memory controllers on AMD GPU nodes can be represented in EDAC thusly:

	GPU DF / GPU Node -> EDAC MC
	GPU UMC -> EDAC CSROW
	GPU UMC channel -> EDAC CHANNEL

	For example: a heterogeneous system with 1 AMD CPU is connected to
	4 MI200 (Aldebaran) GPUs using xGMI.

	Some more heterogeneous hardware details:

	- The CPU UMC (Unified Memory Controller) is mostly the same as the GPU UMC.
	They have chip selects (csrows) and channels. However, the layouts are different
	for performance, physical layout, or other reasons.
	- CPU UMCs use 1 channel, In this case UMC = EDAC channel. This follows the
	marketing speak. CPU has X memory channels, etc.
	- CPU UMCs use up to 4 chip selects, So UMC chip select = EDAC CSROW.
	- GPU UMCs use 1 chip select, So UMC = EDAC CSROW.
	- GPU UMCs use 8 channels, So UMC channel = EDAC channel.

	The EDAC subsystem provides a mechanism to handle AMD heterogeneous
	systems by calling system specific ops for both CPUs and GPUs.

	AMD GPU nodes are enumerated in sequential order based on the PCI
	hierarchy, and the first GPU node is assumed to have a Node ID value
	following those of the CPU nodes after latter are fully populated::

	$ ls /sys/devices/system/edac/mc/
	mc0 - CPU MC node 0
	mc1 \|
	mc2 \|- GPU card[0] => node 0(mc1), node 1(mc2)
	mc3 \|
	mc4 \|- GPU card[1] => node 0(mc3), node 1(mc4)
	mc5 \|
	mc6 \|- GPU card[2] => node 0(mc5), node 1(mc6)
	mc7 \|
	mc8 \|- GPU card[3] => node 0(mc7), node 1(mc8)

	For example, a heterogeneous system with one AMD CPU is connected to
	four MI200 (Aldebaran) GPUs using xGMI. This topology can be represented
	via the following sysfs entries::

	/sys/devices/system/edac/mc/..

	CPU # CPU node
	├── mc 0

	GPU Nodes are enumerated sequentially after CPU nodes have been populated
	GPU card 1 # Each MI200 GPU has 2 nodes/mcs
	├── mc 1 # GPU node 0 == mc1, Each MC node has 4 UMCs/CSROWs
	│ ├── csrow 0 # UMC 0
	│ │ ├── channel 0 # Each UMC has 8 channels
	│ │ ├── channel 1 # size of each channel is 2 GB, so each UMC has 16 GB
	│ │ ├── channel 2
	│ │ ├── channel 3
	│ │ ├── channel 4
	│ │ ├── channel 5
	│ │ ├── channel 6
	│ │ ├── channel 7
	│ ├── csrow 1 # UMC 1
	│ │ ├── channel 0
	│ │ ├── ..
	│ │ ├── channel 7
	│ ├── .. ..
	│ ├── csrow 3 # UMC 3
	│ │ ├── channel 0
	│ │ ├── ..
	│ │ ├── channel 7
	│ ├── rank 0
	│ ├── .. ..
	│ ├── rank 31 # total 32 ranks/dimms from 4 UMCs
	├
	├── mc 2 # GPU node 1 == mc2
	│ ├── .. # each GPU has total 64 GB

	GPU card 2
	├── mc 3
	│ ├── ..
	├── mc 4
	│ ├── ..

	GPU card 3
	├── mc 5
	│ ├── ..
	├── mc 6
	│ ├── ..

	GPU card 4
	├── mc 7
	│ ├── ..
	├── mc 8
	│ ├── ..