| perf-arm-spe(1) |
| ================ |
| |
| NAME |
| ---- |
| perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools |
| |
| SYNOPSIS |
| -------- |
| [verse] |
| 'perf record' -e arm_spe// |
| |
| DESCRIPTION |
| ----------- |
| |
| The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and |
| events down to individual instructions. Rather than being interrupt-driven, it picks an |
| instruction to sample and then captures data for it during execution. Data includes execution time |
| in cycles. For loads and stores it also includes data address, cache miss events, and data origin. |
| |
| The sampling has 5 stages: |
| |
| 1. Choose an operation |
| 2. Collect data about the operation |
| 3. Optionally discard the record based on a filter |
| 4. Write the record to memory |
| 5. Interrupt when the buffer is full |
| |
| Choose an operation |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all |
| architectural instructions or all micro-ops. Sampling happens at a programmable interval. The |
| architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should |
| sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random |
| perturbation is also added to the sampling interval by default. |
| |
| Collect data about the operation |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Program counter, PMU events, timings and data addresses related to the operation are recorded. |
| Sampling ensures there is only one sampled operation is in flight. |
| |
| Optionally discard the record based on a filter |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Based on programmable criteria, choose whether to keep the record or discard it. If the record is |
| discarded then the flow stops here for this sample. |
| |
| Write the record to memory |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| The record is appended to a memory buffer |
| |
| Interrupt when the buffer is full |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records. |
| Perf saves the raw data in the perf.data file. |
| |
| Opening the file |
| ---------------- |
| |
| Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the |
| recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding |
| the data, Perf generates "synthetic samples" as if these were generated at the time of the |
| recording. These samples are the same as if normal sampling was done by Perf without using SPE, |
| although they may have more attributes associated with them. For example a normal sample may have |
| just the instruction pointer, but an SPE sample can have data addresses and latency attributes. |
| |
| Why Sampling? |
| ------------- |
| |
| - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for |
| hardware. Only one sampled operation is in flight at a time. |
| |
| - Allows precise attribution data, including: Full PC of instruction, data virtual and physical |
| addresses. |
| |
| - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source |
| indicates which particular cache was hit, but the meaning is implementation defined because |
| different implementations can have different cache configurations.) |
| |
| However, SPE does not provide any call-graph information, and relies on statistical methods. |
| |
| Collisions |
| ---------- |
| |
| When an operation is sampled while a previous sampled operation has not finished, a collision |
| occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate |
| should be set to avoid collisions. |
| |
| The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this |
| count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact |
| number for samples dropped that would have made it through the filter, but can be a rough |
| guide. |
| |
| The effect of microarchitectural sampling |
| ----------------------------------------- |
| |
| If an implementation samples micro-operations instead of instructions, the results of sampling must |
| be weighted accordingly. |
| |
| For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it |
| becomes twice as likely to appear in the sample population. |
| |
| The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be |
| estimated from the 'sample_pop' and 'inst_retired' PMU events. |
| |
| Kernel Requirements |
| ------------------- |
| |
| The ARM_SPE_PMU config must be set to build as either a module or statically. |
| |
| Depending on CPU model, the kernel may need to be booted with page table isolation disabled |
| (kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer |
| inaccessible. Try passing 'kpti=off' on the kernel command line". |
| |
| For the full criteria that determine whether KPTI needs to be forced off or not, see function |
| unmap_kernel_at_el0() in the kernel sources. Common cases where it's not required |
| are on the CPUs in kpti_safe_list, or on Arm v8.5+ where FEAT_E0PD is mandatory. |
| |
| The SPE interrupt must also be described by the firmware. If the module is loaded and KPTI is |
| disabled (or isn't required to be disabled) but the SPE PMU still doesn't show in |
| /sys/bus/event_source/devices/, then it's possible that the SPE interrupt isn't described by |
| ACPI or DT. In this case no warning will be printed by the driver. |
| |
| Capturing SPE with perf command-line tools |
| ------------------------------------------ |
| |
| You can record a session with SPE samples: |
| |
| perf record -e arm_spe// -- ./mybench |
| |
| The sample period is set from the -c option, and because the minimum interval is used by default |
| it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL. |
| |
| Config parameters |
| ~~~~~~~~~~~~~~~~~ |
| |
| These are placed between the // in the event and comma separated. For example '-e |
| arm_spe/load_filter=1,min_latency=10/' |
| |
| event_filter=<mask> - logical AND filter on specific events (PMSEVFR) - see bitfield description below |
| inv_event_filter=<mask> - logical OR to filter out specific events (PMSNEVFR, FEAT_SPEv1p2) - see bitfield description below |
| jitter=1 - use jitter to avoid resonance when sampling (PMSIRR.RND) |
| min_latency=<n> - collect only samples with this latency or higher* (PMSLATFR) |
| pa_enable=1 - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege |
| pct_enable=1 - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege |
| ts_enable=1 - enable timestamping with value of generic timer (PMSCR.TS) |
| discard=1 - enable SPE PMU events but don't collect sample data - see 'Discard mode' (PMBLIMITR.FM = DISCARD) |
| inv_data_src_filter=<mask> - mask to filter from 0-63 possible data sources (PMSDSFR, FEAT_SPE_FDS) - See 'Data source filtering' |
| |
| +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather |
| than only the execution latency. |
| |
| Only some events can be filtered on using 'event_filter' bits. The overall |
| filter is the logical AND of these bits, for example if bits 3 and 5 are set |
| only samples that have both 'L1D cache refill' AND 'TLB walk' are recorded. When |
| FEAT_SPEv1p2 is implemented 'inv_event_filter' can also be used to exclude |
| events that have any (OR) of the filter's bits set. For example setting bits 3 |
| and 5 in 'inv_event_filter' will exclude any events that are either L1D cache |
| refill OR TLB walk. If the same bit is set in both filters it's UNPREDICTABLE |
| whether the sample is included or excluded. Filter bits for both event_filter |
| and inv_event_filter are: |
| |
| bit 1 - Instruction retired (i.e. omit speculative instructions) |
| bit 2 - L1D access (FEAT_SPEv1p4) |
| bit 3 - L1D refill |
| bit 4 - TLB access (FEAT_SPEv1p4) |
| bit 5 - TLB refill |
| bit 6 - Not taken event (FEAT_SPEv1p2) |
| bit 7 - Mispredict |
| bit 8 - Last level cache access (FEAT_SPEv1p4) |
| bit 9 - Last level cache miss (FEAT_SPEv1p4) |
| bit 10 - Remote access (FEAT_SPEv1p4) |
| bit 11 - Misaligned access (FEAT_SPEv1p1) |
| bit 12-15 - IMPLEMENTATION DEFINED events (when implemented) |
| bit 17 - Partial or empty SME or SVE predicate (FEAT_SPEv1p1) |
| bit 18 - Empty SME or SVE predicate (FEAT_SPEv1p1) |
| bit 19 - L2D access (FEAT_SPEv1p4) |
| bit 20 - L2D miss (FEAT_SPEv1p4) |
| bit 21 - Cache data modified (FEAT_SPEv1p4) |
| bit 22 - Recently fetched (FEAT_SPEv1p4) |
| bit 23 - Data snooped (FEAT_SPEv1p4) |
| bit 24 - Streaming SVE mode event (when FEAT_SPE_SME is implemented), or |
| IMPLEMENTATION DEFINED event 24 (when implemented, only versions |
| less than FEAT_SPEv1p4) |
| bit 25 - SMCU or external coprocessor operation event when FEAT_SPE_SME is |
| implemented, or IMPLEMENTATION DEFINED event 25 (when implemented, |
| only versions less than FEAT_SPEv1p4) |
| bit 26-31 - IMPLEMENTATION DEFINED events (only versions less than FEAT_SPEv1p4) |
| bit 48-63 - IMPLEMENTATION DEFINED events (when implemented) |
| |
| For IMPLEMENTATION DEFINED bits, refer to the CPU TRM if these bits are |
| implemented. |
| |
| The driver will reject events if requested filter bits require unimplemented SPE |
| versions, but will not reject filter bits for unimplemented IMPDEF bits or when |
| their related feature is not present (e.g. SME). For example, if FEAT_SPEv1p2 is |
| not implemented, filtering on "Not taken event" (bit 6) will be rejected. |
| |
| So to sample just retired instructions: |
| |
| perf record -e arm_spe/event_filter=2/ -- ./mybench |
| |
| or just mispredicted branches: |
| |
| perf record -e arm_spe/event_filter=0x80/ -- ./mybench |
| |
| When set, the following filters can be used to select samples that match any of |
| the operation types (OR filtering). If only one is set then only samples of that |
| type are collected: |
| |
| branch_filter=1 - Collect branches (PMSFCR.B) |
| load_filter=1 - Collect loads (PMSFCR.LD) |
| store_filter=1 - Collect stores (PMSFCR.ST) |
| |
| When extended filtering is supported (FEAT_SPE_EFT), SIMD and float |
| pointer operations can also be selected: |
| |
| simd_filter=1 - Collect SIMD loads, stores and operations (PMSFCR.SIMD) |
| float_filter=1 - Collect floating point loads, stores and operations (PMSFCR.FP) |
| |
| When extended filtering is supported (FEAT_SPE_EFT), operation type filters can |
| be changed to AND using _mask fields. For example samples could be selected if |
| they are store AND SIMD by setting 'store_filter=1,simd_filter=1, |
| store_filter_mask=1,simd_filter_mask=1'. The new masks are as follows: |
| |
| branch_filter_mask=1 - Change branch filter behavior from OR to AND (PMSFCR.Bm) |
| load_filter_mask=1 - Change load filter behavior from OR to AND (PMSFCR.LDm) |
| store_filter_mask=1 - Change store filter behavior from OR to AND (PMSFCR.STm) |
| simd_filter_mask=1 - Change SIMD filter behavior from OR to AND (PMSFCR.SIMDm) |
| float_filter_mask=1 - Change floating point filter behavior from OR to AND (PMSFCR.FPm) |
| |
| Viewing the data |
| ~~~~~~~~~~~~~~~~~ |
| |
| By default perf report and perf script will assign samples to separate groups depending on the |
| attributes/events of the SPE record. Because instructions can have multiple events associated with |
| them, the samples in these groups are not necessarily unique. For example perf report shows these |
| groups: |
| |
| Available samples |
| 0 arm_spe// |
| 0 dummy:u |
| 21 l1d-miss |
| 897 l1d-access |
| 5 llc-miss |
| 7 llc-access |
| 2 tlb-miss |
| 1K tlb-access |
| 36 branch |
| 0 remote-access |
| 900 memory |
| 1800 instructions |
| |
| The arm_spe// and dummy:u events are implementation details and are expected to be empty. |
| |
| The instructions group contains the full list of unique samples that are not |
| sorted into other groups. To generate only this group use --itrace=i1i. |
| |
| 1i (1 instruction interval) signifies no further downsampling. Rather than an |
| instruction interval, this generates a sample every n SPE samples. For example |
| to generate the default set of events for every 100 SPE samples: |
| |
| perf report --itrace==bxofmtMai100i |
| |
| Other period types, for example nanoseconds (ns) are not currently supported. |
| |
| Memory access details are also stored on the samples and this can be viewed with: |
| |
| perf report --mem-mode |
| |
| The latency value from the SPE sample is stored in the 'weight' field of the |
| Perf samples and can be displayed in Perf script and report outputs by enabling |
| its display from the command line. |
| |
| Common errors |
| ~~~~~~~~~~~~~ |
| |
| - "Cannot find PMU `arm_spe'. Missing kernel support?" |
| |
| Module not built or loaded, KPTI not disabled, interrupt not described by firmware, |
| or running on a VM. See 'Kernel Requirements' above. |
| |
| - "Arm SPE CONTEXT packets not found in the traces." |
| |
| Root privilege is required to collect context packets. But these only increase the accuracy of |
| assigning PIDs to kernel samples. For userspace sampling this can be ignored. |
| |
| - Excessively large perf.data file size |
| |
| Increase sampling interval (see above) |
| |
| PMU events |
| ~~~~~~~~~~ |
| |
| SPE has events that can be counted on core PMUs. These are prefixed with |
| SAMPLE_, for example SAMPLE_POP, SAMPLE_FEED, SAMPLE_COLLISION and |
| SAMPLE_FEED_BR. |
| |
| These events will only count when an SPE event is running on the same core that |
| the PMU event is opened on, otherwise they read as 0. There are various ways to |
| ensure that the PMU event and SPE event are scheduled together depending on the |
| way the event is opened. For example opening both events as per-process events |
| on the same process, although it's not guaranteed that the PMU event is enabled |
| first when context switching. For that reason it may be better to open the PMU |
| event as a systemwide event and then open SPE on the process of interest. |
| |
| Discard mode |
| ~~~~~~~~~~~~ |
| |
| SPE related (SAMPLE_* etc) core PMU events can be used without the overhead of |
| collecting sample data if discard mode is supported (optional from Armv8.6). |
| First run a system wide SPE session (or on the core of interest) using options |
| to minimize output. Then run perf stat: |
| |
| perf record -e arm_spe/discard/ -a -N -B --no-bpf-event -o - > /dev/null & |
| perf stat -e SAMPLE_FEED_LD |
| |
| Data source filtering |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| When FEAT_SPE_FDS is present, 'inv_data_src_filter' can be used as a mask to |
| filter on a subset (0 - 63) of possible data source IDs. The full range of data |
| sources is 0 - 65535 although these are unlikely to be used in practice. Data |
| sources are IMPDEF so refer to the TRM for the mappings. Each bit N of the |
| filter maps to data source N. The filter is an OR of all the bits, and the value |
| provided inv_data_src_filter is inverted before writing to PMSDSFR_EL1 so that |
| set bits exclude that data source and cleared bits include that data source. |
| Therefore the default value of 0 is equivalent to no filtering (all data sources |
| included). |
| |
| For example, to include only data sources 0 and 3, clear bits 0 and 3 |
| (0xFFFFFFFFFFFFFFF6) |
| |
| When 'inv_data_src_filter' is set to 0xFFFFFFFFFFFFFFFF, any samples with any |
| data source set are excluded. |
| |
| SEE ALSO |
| -------- |
| |
| linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1], |
| linkperf:perf-inject[1] |