Kernel Space and eBPF: The Observability Revolution

"Want to monitor your production systems? Just add some logging libraries!" — Someone who's never dealt with production overhead at scale

Every log line costs CPU, memory, and network bandwidth. The really interesting stuff is invisible to your application code. What's happening at the kernel level, where network packets flow and system calls execute, stays hidden.

eBPF changes that. It lets you safely run code inside the Linux kernel and observe what's happening at the lowest level with near-zero overhead. To understand why that matters, you need to understand how Linux divides the world between user space and kernel space.

The Two Worlds (And why you can't just touch hardware)

Every computer runs two separate realities simultaneously. Think of it like a building:

User Space vs Kernel Space

Watch system calls (golden boxes) travel between user space applications and kernel hardware control. Drag to rotate.

User Space

Applications run here with restricted permissions

Kernel Space

Hardware control with full system privileges

User space is where all your programs live. Chrome, Python scripts, Docker containers, everything. These programs can use the CPU for calculations and access their own memory. They cannot touch hardware directly. They cannot see other programs' memory. They cannot send network packets themselves.

Kernel space is the privileged core. Only the kernel can control hardware, manage memory for everyone, handle network packets, and enforce security rules. It's like building management with the master key.

This separation exists because we can't trust programs. If any random app could directly control your network card, a buggy JavaScript library could corrupt your entire system. The kernel is the gatekeeper with exclusive access to hardware.

The CPU enforces this separation using privilege levels called "rings." User space runs in Ring 3 (least privileged). Kernel space runs in Ring 0 (most privileged). When your program makes a system call, the CPU switches from Ring 3 to Ring 0, executes the kernel code, then switches back.

The Reality: What Actually Happens When Chrome Saves a File

Here's what you think happens:

// You write this in your app
fs.writeFile('data.txt', 'Hello World', (err) => {
  console.log('File saved!');
});

Here's what actually happens:

11:45:23.001234 - Chrome: "I want to save this data"
                  ↓
11:45:23.001237 - Chrome calls: write() system call
                  ↓
11:45:23.001240 - CPU switches from Ring 3 to Ring 0
                  (User mode → Kernel mode)
                  ↓
11:45:23.001255 - Kernel: "Let me check permissions..."
                  "Does Chrome own this file? Yes"
                  "Is the disk full? No"
                  "OK, I'll allow it"
                  ↓
11:45:23.001289 - Kernel writes to disk driver
                  ↓
11:45:23.002105 - Disk acknowledges write
                  ↓
11:45:23.002110 - Kernel: "Done! Wrote 11 bytes"
                  ↓
11:45:23.002115 - CPU switches back to Ring 3
                  ↓
11:45:23.002120 - Chrome: "Great, file saved!"

That's ~900 microseconds of context switching (the CPU saving and restoring state while switching between Ring 3 and Ring 0), permission checks, and hardware coordination. All invisible to your JavaScript code.

System Call Timeline

Follow a write() syscall through ~886 microseconds of execution. Each transition costs time.

11:45:23.001234Ring 3+0μs

Chrome: "I want to save data"

11:45:23.001237Ring 3+3μs

Chrome calls: write() system call

11:45:23.001240Ring 0+6μs

🔄 CPU switches Ring 3 → Ring 0

11:45:23.001255Ring 0+21μs

Kernel: Check permissions...

11:45:23.001289Ring 0+55μs

Kernel writes to disk driver

11:45:23.002105Ring 0+871μs

💾 Disk acknowledges write

11:45:23.002110Ring 0+876μs

Kernel: "Done! Wrote 11 bytes"

11:45:23.002115Ring 3+881μs

🔄 CPU switches Ring 0 → Ring 3

11:45:23.002120Ring 3+886μs

Chrome: "File saved!"

Total Time: ~886 microseconds

Context switching between Ring 3 (user) and Ring 0 (kernel) takes ~100-500ns each time. The disk I/O is the expensive part at ~816μs.

User Space (Ring 3) Kernel Space (Ring 0) Context Switch

System Calls: The Only Door Between Worlds (And the kernel is checking IDs)

Your program can't just stroll into kernel space. The only way to talk to the kernel is through system calls (syscalls). These are predefined entry points where your program says "pretty please" and the kernel decides whether to help.

Everything in Linux is a file. Network connections are file descriptors. Your GPU is /dev/nvidia0. Reading from the network uses the same read() syscall as reading from disk. This is elegant and sometimes frustrating.

Common system calls and what they actually do:

// What you write              // What the kernel does
read(fd, buf, 1024)      →    Checks permissions, fetches from disk
sendto(sock, data, len)  →    Routes through network stack, sends packet
mmap(NULL, size, PROT)   →    Finds free memory, maps to process space
fork()                   →    Duplicates entire process, assigns new PID
socket(AF_INET, SOCK)    →    Creates socket, assigns file descriptor

The Hidden Cost of Living in User Space

Here's a scenario that plays out millions of times per second in production:

Your Python app makes a database query:
                    ↓
socket.send() → syscall overhead (~100-500ns)
                    ↓
Context switch to kernel mode
                    ↓
Kernel processes TCP stack
                    ↓
Packet goes to network card
                    ↓
Context switch back to user mode
                    ↓
Wait for response... (millions of nanoseconds)
                    ↓
Packet arrives at network card
                    ↓
Interrupt fires → kernel reads packet
                    ↓
Context switch to kernel mode
                    ↓
TCP processing
                    ↓
Data copied to socket buffer
                    ↓
Context switch back to user mode
                    ↓
socket.recv() returns your data

Every transition between user space and kernel space has a cost. Now imagine you want to monitor every one of these operations. Where do you put your observability code?

The Observability Dilemma (Or: How I learned to stop worrying and crash the kernel)

Before eBPF, you had two bad options for deep system observability:

Option 1: Kernel Module (One bug away from a kernel panic)

Write C code that runs directly in kernel space:

// One mistake here crashes the entire system
#include <linux/module.h>
#include <linux/kernel.h>

int packet_handler(struct sk_buff *skb) {
    char *data = skb->data;
    // Count packets by source IP

    // Forgot to check if data is valid
    unsigned int src_ip = *(unsigned int*)(data + 12);
    // NULL pointer dereference = kernel panic

    count_packets(src_ip);
    return 0;
}

Problems:

One bug crashes the entire system
Must recompile for each kernel version
Requires deep kernel expertise
Not something you deploy to production lightly

Contrast this with eBPF: once your eBPF program passes the verifier for a kernel version, you can deploy it with confidence. The verifier mathematically proves it cannot crash your kernel, access invalid memory, or loop infinitely. It's a fundamental safety guarantee that kernel modules simply don't have.

Option 2: Userspace Monitoring (Copy everything, miss half of it)

Monitor from user space with tools like tcpdump:

# Slow and loses data under load
while True:
    packet = pcap.capture()  # Copy from kernel to userspace
    analyze_packet(packet)
    update_metrics(packet)

Problems:

High overhead (every packet copied from kernel to userspace)
Drops packets under load (buffer overflows are common)
Can't see kernel internals (syscalls, memory allocation, etc.)
Limited visibility into what's actually happening

The Scenario: When Logging Libraries Attack

Tuesday, 3:47 PM: Your e-commerce site is humming along nicely. 10,000 requests per second. Life is good.

Tuesday, 3:48 PM: Someone deploys a "helpful" logging library that logs every SQL query with full stack traces.

Tuesday, 3:49 PM:

CPU usage: 45% → 89%
Memory usage: 12GB → 28GB (logs buffering in memory)
Network egress: 100Mbps → 800Mbps (shipping logs to your logging service)
Response times: 50ms → 450ms

Tuesday, 3:51 PM: Your site is down. The logging library that was supposed to help you debug issues just became the issue.

Traditional observability forces you to choose between visibility and performance. You can have one, but not both.

eBPF: JavaScript for Your Kernel (But actually safe)

eBPF (Extended Berkeley Packet Filter) is like having a JavaScript runtime inside your Linux kernel. But instead of eval()-ing random code and hoping for the best, there's a verifier that mathematically proves your code is safe before it runs.

Note on terminology: The technical implementation is often called "bpf" (lowercase), while "eBPF" is used in user-facing contexts and documentation. They refer to the same modern technology. The original "BPF" (Berkeley Packet Filter) from the 1990s was limited to packet filtering—"eBPF" is the extended version that can do much more.

Here's the breakthrough:

Traditional Observability:           eBPF:
═══════════════════════             ═══════════════════════
Every observation =                 Observe at the source
- Copy data to userspace            - Run code IN the kernel
- Process in your app               - Zero copies needed
- Ship to monitoring system         - Filter/aggregate in place
                                   - Send only what matters

Cost per packet: 5-10µs            Cost per packet: 5-50ns
Can handle: 100K packets/sec       Can handle: 20M packets/sec

eBPF is 100-1000x faster than traditional userspace monitoring.

How eBPF Actually Works

Think of eBPF like a carefully controlled door into the kernel:

eBPF Verifier: Safety Guaranteed

The verifier mathematically proves your code is safe before loading into the kernel. One failure = rejected.

Your eBPF Code

SEC("xdp")
int count_packets(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    // Verifier checks: Valid pointer?
    struct ethhdr *eth = data;
    if ((void*)(eth + 1) > data_end)
        return XDP_PASS;

    // Verifier checks: Safe memory access?
    __u64 *count = bpf_map_lookup_elem(...);
    if (count) (*count)++;

    return XDP_PASS;
}

Verification Checks

No infinite loops?

All loops must have bounded iterations

No invalid memory access?

All pointers must be validated before use

Stack size < 512 bytes?

eBPF programs have strict stack limits

No dangerous operations?

Cannot use arbitrary kernel functions

Terminates in finite time?

Program must be guaranteed to complete

Why the Verifier is Strict

One bug in kernel space = entire system crash. The verifier uses static analysis to guarantee safety. If it can't prove your code is safe, it rejects it. This is a feature, not a bug.

You write a small program (usually in C) that says "when X happens, do Y"
Compile to eBPF bytecode using Clang/LLVM (like Java bytecode, but for the kernel)
The Verifier examines every instruction:
- No infinite loops? ✓
- No invalid memory access? ✓
- No dangerous operations? ✓
- Terminates in finite time? ✓
JIT (Just-In-Time) compile verified bytecode to native machine code for your CPU architecture (x86, ARM, etc.)
Load into kernel where it runs at true native speed
Done

The verifier is strict. It rejects your code if there's even a theoretical possibility of problems. This is a feature, not a bug.

Development Frameworks and Tools

Before diving into code, you should know what tools are available for writing eBPF programs:

libbpf — The modern, low-level approach

A C library that provides the standard way to write eBPF programs. Your eBPF code compiles to bytecode ahead of time, then loads at runtime. No runtime compilation needed in production.

Best for: Production deployments, performance-critical code
Language: C
Deployment: Ships pre-compiled bytecode (no compiler needed on target)
Learning curve: Steep (you're writing kernel code)

BCC (BPF Compiler Collection) — Python framework with runtime compilation

Write eBPF in C, control logic in Python. Compiles your eBPF code at runtime on the target machine.

Best for: Rapid prototyping, debugging, development
Language: Python + C
Deployment: Requires LLVM/Clang on target machine
Learning curve: Easier to start, but runtime dependencies

bpftrace — High-level tracing language

Like awk for eBPF. One-liners for common tracing tasks.

# Trace all open() syscalls with filename
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); }'

Best for: Quick debugging, one-off analysis
Language: Custom DSL (awk-like syntax)
Learning curve: Easiest to start

Aya — Rust library for eBPF

Write eBPF programs in Rust with memory safety guarantees.

Best for: Rust developers, projects prioritizing safety
Language: Rust
Deployment: Pre-compiled bytecode
Learning curve: Rust knowledge required

The code examples in this article use libbpf-style code, which is the modern production approach. But if you're just getting started, bpftrace or BCC might be easier for learning.

Your First eBPF Program

Here's actual eBPF code that counts network packets by source IP. This program attaches at the XDP (eXpress Data Path) level, which runs at the network driver—the earliest possible point to see packets before they enter the network stack.

Don't worry about the "map" data structure you'll see in the code. Think of it as a hash table that lives in kernel memory. We'll explain maps in detail shortly.

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>

// Define a map: IP address → packet count
struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10000);
    __type(key, __u32);      // IP address
    __type(value, __u64);    // Packet count
} packet_counts SEC(".maps");

// This runs for EVERY packet (20M+ per second)
SEC("xdp")
int count_packets(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    // Parse Ethernet header
    struct ethhdr *eth = data;
    if ((void*)(eth + 1) > data_end)
        return XDP_PASS;  // Packet too small, skip

    // Parse IP header
    struct iphdr *ip = data + sizeof(*eth);
    if ((void*)(ip + 1) > data_end)
        return XDP_PASS;  // Not enough data, skip

    // Look up counter for this source IP
    __u32 src_ip = ip->saddr;
    __u64 *count = bpf_map_lookup_elem(&packet_counts, &src_ip);

    if (count) {
        // Increment existing counter
        __sync_fetch_and_add(count, 1);
    } else {
        // First packet from this IP
        __u64 init_val = 1;
        bpf_map_update_elem(&packet_counts, &src_ip, &init_val, BPF_ANY);
    }

    return XDP_PASS;  // Let packet continue normally
}

This code runs inside the kernel at the earliest point in the network stack, processing millions of packets per second with minimal overhead.

Compare this to userspace monitoring:

# Userspace packet counter - for comparison
import pcap

packet_counts = {}

while True:
    # This is slow:
    # 1. Kernel captures packet
    # 2. Kernel copies packet to userspace buffer
    # 3. Your program wakes up (context switch)
    # 4. You process one packet
    # 5. Repeat...

    packet = pcap.next()
    src_ip = extract_src_ip(packet)
    packet_counts[src_ip] = packet_counts.get(src_ip, 0) + 1

    # At 1M packets/sec, this will:
    # - Max out CPU
    # - Drop most packets

The eBPF version runs at the source, processing packets before they've entered the Linux network stack. No copies. No context switches.

eBPF Components: The Building Blocks (Maps, hooks, and ring buffers)

eBPF isn't just about running code in the kernel. It's a complete system with several key components.

1. eBPF Maps (Shared memory that doesn't explode)

Maps are data structures that live in kernel memory and can be accessed by both eBPF programs and userspace programs:

eBPF Map Types

Maps are data structures in kernel memory shared between eBPF programs and userspace. Try different types.

Hash Map

Key-value pairs with O(1) lookups

O(1) lookup

Use Case:

IP → packet count, PID → metrics

Example Data:

10.0.0.1 → 5,234

10.0.0.2 → 2,891

10.0.0.3 → 7,663

Interactive Demo

Key (e.g., 10.0.0.5)

Value (e.g., 1234)

Current Map Contents:

10.0.0.1→5234

10.0.0.2→2891

Maps Bridge Kernel and Userspace

eBPF programs (kernel) write to maps. Your application (userspace) reads from them. Both can access the same map simultaneously—it's shared memory between the two worlds.

Hash Map Array Ring Buffer Per-CPU

Think of maps as a database table in kernel space:

┌──────────────────────────────────┐
│   Map: "packet_counts"           │
│   ┌──────────┬─────────┐         │
│   │   Key    │  Value  │         │
│   ├──────────┼─────────┤         │
│   │ 10.0.0.1 │  5,234  │         │
│   │ 10.0.0.2 │  2,891  │         │
│   │ 10.0.0.3 │  7,663  │         │
│   └──────────┴─────────┘         │
└──────────────────────────────────┘
         ↑              ↑
    eBPF writes    Userspace reads

Common map types and when to use them:

Hash Map: Key-value pairs (like a Python dict)

Use for: IP → packet count, process ID → metrics
Lookup: O(1)

Array: Fixed-size, indexed by integers

Use for: Per-CPU stats, fixed configuration
Lookup: O(1)

Ring Buffer: Stream of events (new in Linux 5.8+)

Use for: Event logging, packet captures
Lock-free, handles millions of events per second

Per-CPU Array: One array per CPU core

Use for: High-frequency counters without locks
Avoids cache line bouncing between CPUs

2. Ring Buffers (Fast event streaming)

Ring buffers are circular queues for streaming events from kernel to userspace:

Ring Buffer: Lock-Free Event Streaming

eBPF programs write events (producer) while userspace reads them (consumer). Circular and lock-free.

Ring Buffer (Kernel Memory)

Packet Syscall GPU

Consumed Events (Userspace)

Waiting for events...

Why Ring Buffers?

Lock-free design means no contention between producer (eBPF in kernel) and consumer (your app). Can handle millions of events per second. When full, oldest events are overwritten—like a rolling log.

Write Pointer Read Pointer

// In your eBPF program (kernel space)
SEC("kprobe/tcp_sendmsg")
int trace_tcp_send(struct pt_regs *ctx) {
    struct tcp_event event = {
        .pid = bpf_get_current_pid_tgid() >> 32,
        .timestamp = bpf_ktime_get_ns(),
        .bytes_sent = PT_REGS_PARM3(ctx)
    };

    // Write to ring buffer (non-blocking, super fast)
    bpf_ringbuf_output(&tcp_events, &event, sizeof(event), 0);

    return 0;
}

// In your Rust program (userspace)
use libbpf_rs::RingBuffer;

let mut ring_buf = RingBuffer::new();
ring_buf.add(map_fd, |data| {
    let event: TcpEvent = unsafe { *(data as *const TcpEvent) };
    println!("Process {} sent {} bytes at {}",
             event.pid, event.bytes_sent, event.timestamp);
    Ok(0)
})?;

// This blocks until events arrive
while ring_buf.poll(Duration::from_millis(100))? {}

The ring buffer is lock-free and can handle millions of events per second. When it's full, new events overwrite the oldest ones. Like a rolling log that never runs out of space.

3. Helper Functions (Your eBPF program's API)

eBPF programs can't just call any kernel function (that would be unsafe). Instead, they use ~200 pre-approved "helper functions":

// Some useful helpers:
bpf_ktime_get_ns()           // Get current timestamp
bpf_get_current_pid_tgid()   // Get process/thread ID
bpf_probe_read()             // Safely read kernel memory
bpf_map_lookup_elem()        // Look up value in map
bpf_map_update_elem()        // Update value in map
bpf_trace_printk()           // Debug print (slow, don't use in production)
bpf_get_current_comm()       // Get process name
bpf_skb_load_bytes()         // Read packet data

These helpers are verified safe. They won't crash your kernel. If you try to call anything else, the verifier rejects your program.

The Network Stack Journey (Where XDP, TC, and friends live)

To understand where eBPF programs can attach, you need to understand the Linux network stack. Packets travel through multiple layers:

Linux Network Stack: Where eBPF Hooks Attach

Watch a packet travel through the network stack. Click layers to see hook points. Earlier = faster.

Application

Your program (Python, Node.js, etc.)

Socket Layer

System call interface

1 hook

TCP/UDP

Transport layer processing

2 hooks

IP Layer

Routing decisions

1 hook

TC Egress

10M pps

Traffic control (qdisc)

1 hook

Network Driver

Device driver

XDP

20M+ pps

eXpress Data Path (earliest point)

1 hook

Network Card

Physical hardware

XDP (Fastest)

20M+ packets/sec. Raw packet bytes. Use for DDoS protection, load balancing.

TC (Fast + Flexible)

10M packets/sec. More context. Use for bandwidth limits, packet modification.

The Earlier the Better

XDP runs at the network driver level, before any kernel processing. This is why it's 2-10x faster than TC or iptables—the packet hasn't consumed any CPU cycles yet.

XDP Hook TC Hook kprobe/tracepoint

Incoming Packet Journey:
════════════════════════

1. Network Card
   "Physical packet arrives"
           ↓
2. Network Driver
   "Convert to memory"
           ↓
3. ★ XDP HOOK ★ (eXpress Data Path)
   EARLIEST POINT - raw packet bytes
   • Can DROP (DDoS protection)
   • Can PASS (let continue)
   • Can REDIRECT (send elsewhere)
   • Can TX (bounce back)

   Performance: 20M+ packets/second
           ↓
4. ★ TC INGRESS ★ (Traffic Control)
   After XDP, before routing
   • Can modify packets
   • Can enforce policies
   • Has more context than XDP

   Performance: 10M packets/second
           ↓
5. Netfilter/iptables
   "Traditional firewall rules"
           ↓
6. Routing Decision
   "Which interface/socket?"
           ↓
7. ★ KPROBE: tcp_v4_rcv() ★
   Can hook any kernel function
           ↓
8. TCP/UDP Processing
           ↓
9. Socket Buffer
           ↓
10. Application receives data

Hook Types and When to Use Them

XDP (eXpress Data Path)

Runs at the network driver level—the earliest possible point to see packets.

When to use: Need maximum performance with minimal context
Use cases: DDoS protection, load balancing, basic packet filtering
Example: Drop all packets from banned IPs before they consume any CPU
Performance: 20M+ packets/second

TC (Traffic Control)

Runs at the qdisc layer with more packet context than XDP.

When to use: Need to modify packets or enforce bandwidth limits
Use cases: Network shaping, container networking, advanced routing
Example: Limit egress bandwidth per pod in Kubernetes
Performance: 10M packets/second

kprobe

Attaches to any kernel function entry point.

When to use: Need to trace specific kernel functions
Use cases: Performance debugging, security monitoring
Example: Track which processes make the most syscalls
Warning: Function names can change between kernel versions

Tracepoint

Attaches to stable kernel trace points that don't change between versions.

When to use: Production monitoring that needs to be stable
Use cases: Syscall auditing, scheduler analysis
Example: Monitor file access patterns across all processes
Advantage: Stable API across kernel versions

uprobe

Attaches to userspace function entry points in libraries or applications.

When to use: Need to monitor libraries or application code
Use cases: GPU monitoring, SSL/TLS inspection, malloc tracking
Example: Track CUDA memory allocations in ML workloads
Note: Higher overhead than kernel hooks

Performance Comparison: Real Numbers

Here's the overhead of different approaches for monitoring network traffic:

These numbers come from production systems. Companies like Facebook, Netflix, and Cloudflare use eBPF for this reason. When you handle billions of packets per day, a 100-1000x performance difference matters.

Kubernetes + eBPF (Observability without the overhead)

Every Pod you deploy in Kubernetes (a Pod is Kubernetes' unit of deployment—basically a group of one or more containers) is just a Linux process with fancy isolation. That's it.

What's Really Happening When You Deploy a Pod

# What you write:
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: web
    image: nginx
    ports:
    - containerPort: 80

What actually happens on the Linux node:

1. kubelet receives Pod spec
         ↓
2. Create Linux namespaces:
   • Network namespace (isolated network stack)
   • PID namespace (isolated process tree)
   • Mount namespace (isolated filesystem view)
   • User namespace (isolated users/groups)
         ↓
3. Pull nginx image (just a tarball of files)
         ↓
4. Extract to filesystem
         ↓
5. Start nginx process with:
   • Restricted CPU (cgroups)
   • Restricted memory (cgroups)
   • Isolated network (can't see host)
   • Isolated filesystem (can't see host files)
         ↓
6. THIS IS YOUR "CONTAINER"
   (It's just a carefully isolated Linux process)

Because containers are just processes, eBPF can observe everything they do at the kernel level:

Pod makes HTTP request:
         ↓
1. Python calls requests.get()
         ↓
2. Syscall: socket(), connect(), send()
         ↓
3. ★ eBPF tracepoint triggers! ★
   Records: PID, syscall type, bytes
         ↓
4. Kernel processes TCP
         ↓
5. ★ eBPF XDP hook sees packet! ★
   Records: Source IP, dest IP, packet size
         ↓
6. Packet goes to network
         ↓
7. Response comes back
         ↓
8. ★ eBPF XDP sees response! ★
         ↓
9. ★ eBPF kprobe on tcp_cleanup_rbuf() ★
   Records: Bytes received
         ↓
10. Application gets response

eBPF collected: Complete network profile
                - Syscalls made
                - Bytes sent/received
                - Latency
                - Source/dest IPs

Traditional monitoring collected: Nothing
(Unless you added logging, which adds 10-50ms overhead per request)

Real-World Use Cases

Network monitoring per pod:

SEC("xdp")
int monitor_pod_network(struct xdp_md *ctx) {
    __u32 src_ip = get_src_ip(ctx);

    // Map IP to pod name (populated from Kubernetes API)
    char *pod_name = ip_to_pod_map[src_ip];

    // Track bandwidth per pod
    pod_bandwidth[pod_name] += ctx->data_end - ctx->data;

    return XDP_PASS;
}

GPU monitoring for ML workloads:

SEC("uprobe/libcudart:cudaMalloc")
int trace_cuda_alloc(struct pt_regs *ctx) {
    size_t size = PT_REGS_PARM2(ctx);  // Second argument
    __u32 pid = bpf_get_current_pid_tgid() >> 32;

    // Track GPU memory per process
    gpu_allocations[pid] += size;

    // Alert if allocation > 1GB
    if (size > 1024*1024*1024) {
        struct alloc_event evt = {
            .pid = pid,
            .size = size,
            .timestamp = bpf_ktime_get_ns()
        };
        bpf_ringbuf_output(&alerts, &evt, sizeof(evt), 0);
    }

    return 0;
}

Security monitoring:

SEC("tracepoint/syscalls/sys_enter_open")
int trace_file_access(struct trace_event_raw_sys_enter* ctx) {
    char filename[256];
    bpf_probe_read_str(&filename, sizeof(filename),
                       (void*)ctx->args[0]);

    // Alert on suspicious file access
    if (strstr(filename, "/etc/shadow") ||
        strstr(filename, "/.ssh/id_rsa")) {
        struct security_event evt = {
            .pid = bpf_get_current_pid_tgid() >> 32,
            .timestamp = bpf_ktime_get_ns()
        };
        bpf_probe_read_str(&evt.filename, sizeof(evt.filename),
                          filename);
        bpf_ringbuf_output(&security_events, &evt, sizeof(evt), 0);
    }

    return 0;
}

You can monitor network, GPU, filesystem, syscalls—everything happening in your Kubernetes cluster—without modifying your applications or adding instrumentation libraries. The overhead is minimal.

The Gotchas (Because nothing is ever perfect)

eBPF is powerful, but it has limitations:

1. The Verifier is Strict (Very strict)

// This gets rejected:
SEC("xdp")
int my_program(struct xdp_md *ctx) {
    for (int i = 0; i < 1000; i++) {  // ❌ Verifier can't prove termination
        do_something();
    }
    return XDP_PASS;
}

// This works:
SEC("xdp")
int my_program(struct xdp_md *ctx) {
    #pragma unroll
    for (int i = 0; i < 16; i++) {  // ✓ Loop is unrolled, no branches
        do_something();
    }
    return XDP_PASS;
}

2. Stack Size Limit (512 bytes. That's it.)

SEC("xdp")
int my_program(struct xdp_md *ctx) {
    char big_buffer[1024];  // ❌ Exceeds 512 byte stack limit
    // Verifier says: "No."

    return XDP_PASS;
}

// Solution: Use eBPF maps for large data
struct {
    __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
    __uint(max_entries, 1);
    __type(key, int);
    __type(value, char[1024]);
} temp_storage SEC(".maps");

SEC("xdp")
int my_program(struct xdp_md *ctx) {
    int key = 0;
    char *buffer = bpf_map_lookup_elem(&temp_storage, &key);
    if (!buffer) return XDP_ABORTED;
    // Use buffer...
    return XDP_PASS;
}

3. Kernel Version Hell (Older kernels = fewer features)

XDP: Requires Linux 4.8+ (2016)
Ring buffers: Requires Linux 5.8+ (2020)
Sleepable eBPF: Requires Linux 5.10+ (2020)
Many helpers: Kernel 5.x only

CO-RE: The Modern Solution (Compile Once, Run Everywhere)

The traditional problem with kernel development was that code compiled for one kernel version wouldn't work on another because internal kernel structures could have different memory layouts. You'd need to recompile your eBPF program for each target kernel version.

CO-RE (Compile Once, Run Everywhere) solves this:

Traditional approach:            CO-RE approach:
═══════════════════             ═══════════════════
Compile on kernel 5.4     →     Compile once with BTF info
Run on 5.4 ✓                   ↓
Run on 5.10 ✗ (breaks!)        Deploy to ANY kernel version
Recompile for 5.10        →     • Kernel 5.4 ✓
Run on 5.10 ✓                   • Kernel 5.10 ✓
                                • Kernel 5.15 ✓
                                • Kernel 6.x ✓

How it works:

During compilation, CO-RE includes BTF (BPF Type Format) information describing kernel data structures
At load time, the eBPF loader reads your target kernel's BTF info
It automatically adjusts field offsets and sizes to match the running kernel
Your program adapts to different kernel struct layouts without recompilation

This means:

Compile once on your development machine
Deploy to any kernel version (with BTF support, typically 5.2+)
No need to ship compiler toolchains to production
Automatic handling of kernel struct layout differences

This is why modern eBPF tools like Cilium and Falco can run across different Linux distributions and kernel versions without modification.

4. Debugging is Hard (Printf debugging from the kernel)

// Use for development only (slow):
bpf_trace_printk("Got packet from %d\n", src_ip);

// Read output:
$ cat /sys/kernel/debug/tracing/trace_pipe

Use bpf_printk for development, ring buffers for production.

The Bottom Line (What you need to know)

What eBPF changes:

Traditional Monitoring:

Add instrumentation libraries → Overhead: 5-50%
Ship logs to aggregators → Cost: $$$$
Miss kernel-level events → Visibility: 20% of reality
Modify application code → Deployment risk: High

eBPF Monitoring:

Run code in kernel → Overhead: 0.1-1%
Filter/aggregate at source → Cost: $
See everything → Visibility: 100% of reality
Zero application changes → Deployment risk: None

When to use eBPF:

✓ High-performance network monitoring
✓ Security monitoring (file access, process execution)
✓ Performance profiling without overhead
✓ Container/Kubernetes observability
✓ GPU/hardware monitoring

When NOT to use eBPF:

✗ Simple application-level metrics (just use Prometheus)
✗ Business logic monitoring (belongs in your app)
✗ One-off debugging (strace is fine)
✗ Systems without Linux 4.8+ (upgrade first)

The key insight is simple. You can see everything without paying the usual performance price. That's why every major cloud provider uses eBPF in production.

Tools built on eBPF:

Cilium: Kubernetes networking and security
Falco: Runtime security monitoring
Pixie: Kubernetes observability
Parca: Continuous profiling
Katran: Load balancing (Facebook)
bpftrace: Dynamic tracing

eBPF is still relatively young. The modern version is from 2014. New use cases keep appearing. Your kernel is ready to tell you everything if you know how to ask.

Related: How Production Are You Really?