Your program is slow. But where? Which function? Which line?

perf answers these questions. It’s the standard profiler on Linux, built into the kernel, and more powerful than most developers realize.

This guide covers practical perf usage: from basic CPU profiling to hardware counter analysis and flame graph generation.

What is perf?

perf (also called perf_events or PCL - Performance Counters for Linux) is a profiling tool that uses CPU hardware counters and kernel tracepoints to measure program performance.

It can:

  • Count CPU cycles, cache misses, branch mispredictions
  • Sample which functions are using CPU time
  • Trace system calls, context switches, page faults
  • Generate flame graphs for visualization

perf workflow

Unlike tools that instrument your code (adding overhead), perf uses hardware sampling with minimal impact on performance.

Installation

# Ubuntu/Debian
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Fedora/RHEL
sudo dnf install perf

# Arch
sudo pacman -S perf

# Verify installation
perf --version

For full functionality, you may need to adjust permissions:

# Allow non-root users to use perf (temporary)
sudo sysctl -w kernel.perf_event_paranoid=-1

# Or permanently in /etc/sysctl.conf
echo 'kernel.perf_event_paranoid=-1' | sudo tee -a /etc/sysctl.conf

The Basics: perf stat

perf stat runs a command and shows CPU statistics when it completes.

$ perf stat ./my_program

 Performance counter stats for './my_program':

          1,542.31 msec task-clock                #    0.998 CPUs utilized
                42      context-switches          #   27.232 /sec
                 3      cpu-migrations            #    1.946 /sec
            12,847      page-faults               #    8.330 K/sec
     5,124,891,234      cycles                    #    3.324 GHz
     8,847,291,847      instructions              #    1.73  insn per cycle
     1,284,729,184      branches                  #  833.004 M/sec
        12,847,291      branch-misses             #    1.00% of all branches

       1.545892341 seconds time elapsed
       1.532847000 seconds user
       0.012000000 seconds sys

Key metrics:

  • Instructions per cycle (IPC): Higher is better. >2 is good, <1 indicates stalls
  • Branch misses: >5% suggests unpredictable control flow
  • Context switches: High numbers may indicate I/O blocking

Specific Events

# Cache analysis
perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./program

# Branch prediction
perf stat -e branches,branch-misses ./program

# Memory
perf stat -e dTLB-loads,dTLB-load-misses,page-faults ./program

# Everything useful
perf stat -e cycles,instructions,cache-references,cache-misses,branches,branch-misses ./program

Example: Detecting Cache Problems

$ perf stat -e cache-references,cache-misses ./random_access

     892,847,291      cache-references
     284,729,184      cache-misses              #   31.89% of all cache refs

$ perf stat -e cache-references,cache-misses ./sequential_access

     892,847,291      cache-references
       8,472,918      cache-misses              #    0.95% of all cache refs

A 31% cache miss rate vs 0.95% - that’s your bottleneck.

CPU Profiling: perf record + perf report

perf record samples your program’s execution. perf report shows where time was spent.

# Record with call graphs (stack traces)
perf record -g ./my_program

# View results
perf report

The interactive perf report UI:

Samples: 42K of event 'cycles', Event count (approx.): 28472918472
Overhead  Command      Shared Object        Symbol
  24.32%  my_program   my_program           [.] process_data
  18.47%  my_program   my_program           [.] hash_lookup
  12.83%  my_program   libc.so.6            [.] malloc
   8.29%  my_program   my_program           [.] parse_input
   7.14%  my_program   libc.so.6            [.] memcpy

Press Enter on a function to see its call graph (who calls it, what it calls).

Recording Options

# Sample at 99 Hz (default is 4000 Hz, 99 avoids lockstep with timers)
perf record -F 99 -g ./program

# Record system-wide for 10 seconds
perf record -F 99 -a -g -- sleep 10

# Record specific PID
perf record -F 99 -g -p 1234

# With DWARF unwinding (better stack traces for optimized code)
perf record --call-graph dwarf ./program

# Specific CPU cores
perf record -C 0,1 ./program

Understanding the Output

-   24.32%    24.32%  my_program  my_program      [.] process_data
   - 24.32% process_data
      - 18.47% called_from_main
         + 12.83% main
      + 5.85% called_from_thread
  • Overhead: Percentage of samples in this function
  • Self: Time in the function itself (not children)
  • Children: Time in this function + everything it calls
  • -: Expandable call tree
  • +: Collapsed call tree

Useful perf report Options

# Sort by self time (not children)
perf report --no-children

# Show source code annotations
perf report --stdio

# Filter by symbol
perf report -s symbol

# Show specific columns
perf report --fields=overhead,symbol

Real-Time Profiling: perf top

Like top, but for functions:

# System-wide
sudo perf top

# Specific process
perf top -p $(pgrep my_program)

# With call graphs
perf top -g

Output:

Samples: 82K of event 'cycles', 4000 Hz, Event count: 41847291847
Overhead  Shared Object        Symbol
  12.34%  [kernel]             [k] _raw_spin_lock
   8.47%  my_program           [.] hot_function
   6.29%  libc.so.6            [.] __memmove_avx
   4.18%  [kernel]             [k] copy_user_generic

Flame Graphs

Flame graphs are the best way to visualize profiling data. They show:

  • X-axis: Proportion of CPU time (wider = more time)
  • Y-axis: Call stack depth (bottom = entry point)

Here’s what a CPU flame graph looks like in practice — this one profiles a bash workload:

CPU Flame Graph example from perf profiling

Each box is a function. The wider it is, the more CPU samples it appeared in. You read from bottom (entry point) up (leaf functions). Towers of boxes show deep call chains; plateaus show where time is actually spent.

Generating Flame Graphs

# 1. Record with stack traces
perf record -F 99 -g ./my_program

# 2. Convert to text
perf script > out.perf

# 3. Generate flame graph (need FlameGraph tools)
git clone https://github.com/brendangregg/FlameGraph
./FlameGraph/stackcollapse-perf.pl out.perf > out.folded
./FlameGraph/flamegraph.pl out.folded > flamegraph.svg

# Open in browser
firefox flamegraph.svg

Or use the built-in (newer kernels):

perf record -F 99 -g ./my_program
perf script report flamegraph

Hardware Events Deep Dive

perf can measure thousands of hardware events. List them:

perf list

# Hardware events
  cpu-cycles OR cycles
  instructions
  cache-references
  cache-misses
  branch-instructions OR branches
  branch-misses
  bus-cycles
  stalled-cycles-frontend
  stalled-cycles-backend

# Hardware cache events
  L1-dcache-loads
  L1-dcache-load-misses
  L1-icache-load-misses
  LLC-loads
  LLC-load-misses
  dTLB-loads
  dTLB-load-misses

Cache Analysis

# L1 data cache
perf stat -e L1-dcache-loads,L1-dcache-load-misses ./program

# Last-level cache (L3)
perf stat -e LLC-loads,LLC-load-misses ./program

# TLB misses (memory-mapped workloads)
perf stat -e dTLB-loads,dTLB-load-misses ./program

Sample on Cache Misses

Instead of sampling CPU cycles, sample when cache misses happen:

# Record where cache misses occur
perf record -e cache-misses -g ./program
perf report

This shows which functions cause the most cache misses.

Branch Prediction

perf stat -e branches,branch-misses ./program

# Sample on branch misses
perf record -e branch-misses -g ./program

Tracing

perf can also trace kernel and user events.

System Calls

# Trace all syscalls
perf trace ./program

# Specific syscalls
perf trace -e read,write ./program

Context Switches

# Count context switches
perf stat -e context-switches ./program

# Trace each context switch
perf record -e sched:sched_switch -a -- sleep 5

Page Faults

# Count page faults
perf stat -e page-faults,minor-faults,major-faults ./program

# Trace where page faults happen
perf record -e page-faults -g ./program

Practical Examples

Example 1: Finding a Hot Function

$ perf record -g ./slow_program
$ perf report --no-children

Overhead  Symbol
  45.23%  slow_function
  12.84%  helper_function
   8.47%  malloc

slow_function uses 45% of CPU time. Look there first.

Example 2: Cache Miss Analysis

Compile with debug info:

g++ -O2 -g program.cpp -o program

Profile cache misses:

$ perf stat -e cache-references,cache-misses ./program

    847,291,847      cache-references
    284,729,184      cache-misses    # 33.61%

Find where they happen:

$ perf record -e cache-misses -g ./program
$ perf report

Overhead  Symbol
  62.34%  random_access_loop    # This is the problem
  18.47%  hash_table_lookup

Annotate to see which lines:

$ perf annotate random_access_loop

for (int i = 0; i < N; i++) {
 62.34 │      sum += arr[indices[i]];  // <-- Cache misses here
}

Example 3: Comparing Two Implementations

# Version A
$ perf stat -e cycles,instructions,cache-misses ./version_a
     5,847,291,847      cycles
     8,472,918,472      instructions    # 1.45 IPC
       284,729,184      cache-misses

# Version B
$ perf stat -e cycles,instructions,cache-misses ./version_b
     2,847,291,847      cycles
    12,472,918,472      instructions    # 4.38 IPC
         8,472,918      cache-misses

Version B: 2x fewer cycles, 3x more IPC, 33x fewer cache misses. Clear winner.

Example 4: Finding Memory Allocation Overhead

$ perf record -g ./program
$ perf report

Overhead  Symbol
  28.47%  malloc
  12.34%  free
   8.47%  operator new

40%+ in allocation? Consider object pooling or arena allocation.

One-Liners Reference

# CPU Statistics
perf stat command                    # Basic CPU stats
perf stat -d command                 # Detailed stats
perf stat -e EVENT1,EVENT2 command   # Specific events

# Profiling
perf record command                  # Sample CPU usage
perf record -g command               # With call graphs
perf record -F 99 -g command         # At 99 Hz
perf record --call-graph dwarf cmd   # DWARF unwinding

# Analysis
perf report                          # Interactive report
perf report --stdio                  # Text report
perf annotate function               # Source annotation

# Real-time
perf top                             # Live CPU profile
perf top -p PID                      # Specific process

# Tracing
perf trace command                   # Trace syscalls
perf trace -e syscall command        # Specific syscall

# System-wide
perf record -a -g -- sleep 10        # Record all CPUs for 10s
perf top -a                          # Live all CPUs

Cheatsheet

┌─────────────────────────────────────────────────────────────────┐
│ Task                        │ Command                          │
├─────────────────────────────┼──────────────────────────────────┤
│ Quick CPU stats             │ perf stat ./program              │
│ Where is CPU time spent?    │ perf record -g ./program         │
│                             │ perf report                      │
│ Cache miss analysis         │ perf stat -e cache-misses ./prog │
│ Where are cache misses?     │ perf record -e cache-misses -g   │
│ Branch prediction issues    │ perf stat -e branch-misses ./pr  │
│ Generate flame graph        │ perf script | stackcollapse |    │
│                             │ flamegraph.pl > out.svg          │
│ Real-time monitoring        │ perf top -p PID                  │
│ Trace syscalls              │ perf trace ./program             │
└─────────────────────────────┴──────────────────────────────────┘

References