Kernel Space and eBPF: The Observability Revolution
"Want to monitor your production systems? Just add some logging libraries!" — Someone who's never dealt with production overhead at scale
Every log line costs CPU, memory, and network bandwidth. The really interesting stuff is invisible to your application code. What's happening at the kernel level, where network packets flow and system calls execute, stays hidden.
eBPF changes that. It lets you safely run code inside the Linux kernel and observe what's happening at the lowest level with near-zero overhead. To understand why that matters, you need to understand how Linux divides the world between user space and kernel space.
The Two Worlds (And why you can't just touch hardware)
Every computer runs two separate realities simultaneously. Think of it like a building:
User Space vs Kernel Space
Watch system calls (golden boxes) travel between user space applications and kernel hardware control. Drag to rotate.
User Space
Applications run here with restricted permissions
Kernel Space
Hardware control with full system privileges
User space is where all your programs live. Chrome, Python scripts, Docker containers, everything. These programs can use the CPU for calculations and access their own memory. They cannot touch hardware directly. They cannot see other programs' memory. They cannot send network packets themselves.
Kernel space is the privileged core. Only the kernel can control hardware, manage memory for everyone, handle network packets, and enforce security rules. It's like building management with the master key.
This separation exists because we can't trust programs. If any random app could directly control your network card, a buggy JavaScript library could corrupt your entire system. The kernel is the gatekeeper with exclusive access to hardware.
The CPU enforces this separation using privilege levels called "rings." User space runs in Ring 3 (least privileged). Kernel space runs in Ring 0 (most privileged). When your program makes a system call, the CPU switches from Ring 3 to Ring 0, executes the kernel code, then switches back.
The Reality: What Actually Happens When Chrome Saves a File
Here's what you think happens:
// You write this in your app
fs.writeFile('data.txt', 'Hello World', (err) => {
console.log('File saved!');
});
Here's what actually happens:
11:45:23.001234 - Chrome: "I want to save this data"
↓
11:45:23.001237 - Chrome calls: write() system call
↓
11:45:23.001240 - CPU switches from Ring 3 to Ring 0
(User mode → Kernel mode)
↓
11:45:23.001255 - Kernel: "Let me check permissions..."
"Does Chrome own this file? Yes"
"Is the disk full? No"
"OK, I'll allow it"
↓
11:45:23.001289 - Kernel writes to disk driver
↓
11:45:23.002105 - Disk acknowledges write
↓
11:45:23.002110 - Kernel: "Done! Wrote 11 bytes"
↓
11:45:23.002115 - CPU switches back to Ring 3
↓
11:45:23.002120 - Chrome: "Great, file saved!"
That's ~900 microseconds of context switching (the CPU saving and restoring state while switching between Ring 3 and Ring 0), permission checks, and hardware coordination. All invisible to your JavaScript code.
System Call Timeline
Follow a write() syscall through ~886 microseconds of execution. Each transition costs time.
Chrome: "I want to save data"
Chrome calls: write() system call
🔄 CPU switches Ring 3 → Ring 0
Kernel: Check permissions...
Kernel writes to disk driver
💾 Disk acknowledges write
Kernel: "Done! Wrote 11 bytes"
🔄 CPU switches Ring 0 → Ring 3
Chrome: "File saved!"
Total Time: ~886 microseconds
Context switching between Ring 3 (user) and Ring 0 (kernel) takes ~100-500ns each time. The disk I/O is the expensive part at ~816μs.
System Calls: The Only Door Between Worlds (And the kernel is checking IDs)
Your program can't just stroll into kernel space. The only way to talk to the kernel is through system calls (syscalls). These are predefined entry points where your program says "pretty please" and the kernel decides whether to help.
Everything in Linux is a file. Network connections are file descriptors. Your GPU is /dev/nvidia0. Reading from the network uses the same read() syscall as reading from disk. This is elegant and sometimes frustrating.
Common system calls and what they actually do:
// What you write // What the kernel does
read(fd, buf, 1024) → Checks permissions, fetches from disk
sendto(sock, data, len) → Routes through network stack, sends packet
mmap(NULL, size, PROT) → Finds free memory, maps to process space
fork() → Duplicates entire process, assigns new PID
socket(AF_INET, SOCK) → Creates socket, assigns file descriptor
The Hidden Cost of Living in User Space
Here's a scenario that plays out millions of times per second in production:
Your Python app makes a database query:
↓
socket.send() → syscall overhead (~100-500ns)
↓
Context switch to kernel mode
↓
Kernel processes TCP stack
↓
Packet goes to network card
↓
Context switch back to user mode
↓
Wait for response... (millions of nanoseconds)
↓
Packet arrives at network card
↓
Interrupt fires → kernel reads packet
↓
Context switch to kernel mode
↓
TCP processing
↓
Data copied to socket buffer
↓
Context switch back to user mode
↓
socket.recv() returns your data
Every transition between user space and kernel space has a cost. Now imagine you want to monitor every one of these operations. Where do you put your observability code?
The Observability Dilemma (Or: How I learned to stop worrying and crash the kernel)
Before eBPF, you had two bad options for deep system observability:
Option 1: Kernel Module (One bug away from a kernel panic)
Write C code that runs directly in kernel space:
// One mistake here crashes the entire system
#include <linux/module.h>
#include <linux/kernel.h>
int packet_handler(struct sk_buff *skb) {
char *data = skb->data;
// Count packets by source IP
// Forgot to check if data is valid
unsigned int src_ip = *(unsigned int*)(data + 12);
// NULL pointer dereference = kernel panic
count_packets(src_ip);
return 0;
}
Problems:
- One bug crashes the entire system
- Must recompile for each kernel version
- Requires deep kernel expertise
- Not something you deploy to production lightly
Contrast this with eBPF: once your eBPF program passes the verifier for a kernel version, you can deploy it with confidence. The verifier mathematically proves it cannot crash your kernel, access invalid memory, or loop infinitely. It's a fundamental safety guarantee that kernel modules simply don't have.
Option 2: Userspace Monitoring (Copy everything, miss half of it)
Monitor from user space with tools like tcpdump:
# Slow and loses data under load
while True:
packet = pcap.capture() # Copy from kernel to userspace
analyze_packet(packet)
update_metrics(packet)
Problems:
- High overhead (every packet copied from kernel to userspace)
- Drops packets under load (buffer overflows are common)
- Can't see kernel internals (syscalls, memory allocation, etc.)
- Limited visibility into what's actually happening
The Scenario: When Logging Libraries Attack
Tuesday, 3:47 PM: Your e-commerce site is humming along nicely. 10,000 requests per second. Life is good.
Tuesday, 3:48 PM: Someone deploys a "helpful" logging library that logs every SQL query with full stack traces.
Tuesday, 3:49 PM:
- CPU usage: 45% → 89%
- Memory usage: 12GB → 28GB (logs buffering in memory)
- Network egress: 100Mbps → 800Mbps (shipping logs to your logging service)
- Response times: 50ms → 450ms
Tuesday, 3:51 PM: Your site is down. The logging library that was supposed to help you debug issues just became the issue.
Traditional observability forces you to choose between visibility and performance. You can have one, but not both.
eBPF: JavaScript for Your Kernel (But actually safe)
eBPF (Extended Berkeley Packet Filter) is like having a JavaScript runtime inside your Linux kernel. But instead of eval()-ing random code and hoping for the best, there's a verifier that mathematically proves your code is safe before it runs.
Note on terminology: The technical implementation is often called "bpf" (lowercase), while "eBPF" is used in user-facing contexts and documentation. They refer to the same modern technology. The original "BPF" (Berkeley Packet Filter) from the 1990s was limited to packet filtering—"eBPF" is the extended version that can do much more.
Here's the breakthrough:
Traditional Observability: eBPF:
═══════════════════════ ═══════════════════════
Every observation = Observe at the source
- Copy data to userspace - Run code IN the kernel
- Process in your app - Zero copies needed
- Ship to monitoring system - Filter/aggregate in place
- Send only what matters
Cost per packet: 5-10µs Cost per packet: 5-50ns
Can handle: 100K packets/sec Can handle: 20M packets/sec
eBPF is 100-1000x faster than traditional userspace monitoring.
How eBPF Actually Works
Think of eBPF like a carefully controlled door into the kernel:
eBPF Verifier: Safety Guaranteed
The verifier mathematically proves your code is safe before loading into the kernel. One failure = rejected.
Your eBPF Code
SEC("xdp")
int count_packets(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
// Verifier checks: Valid pointer?
struct ethhdr *eth = data;
if ((void*)(eth + 1) > data_end)
return XDP_PASS;
// Verifier checks: Safe memory access?
__u64 *count = bpf_map_lookup_elem(...);
if (count) (*count)++;
return XDP_PASS;
}Verification Checks
No infinite loops?
All loops must have bounded iterations
No invalid memory access?
All pointers must be validated before use
Stack size < 512 bytes?
eBPF programs have strict stack limits
No dangerous operations?
Cannot use arbitrary kernel functions
Terminates in finite time?
Program must be guaranteed to complete
Why the Verifier is Strict
One bug in kernel space = entire system crash. The verifier uses static analysis to guarantee safety. If it can't prove your code is safe, it rejects it. This is a feature, not a bug.
- You write a small program (usually in C) that says "when X happens, do Y"
- Compile to eBPF bytecode using Clang/LLVM (like Java bytecode, but for the kernel)
- The Verifier examines every instruction:
- No infinite loops? ✓
- No invalid memory access? ✓
- No dangerous operations? ✓
- Terminates in finite time? ✓
- JIT (Just-In-Time) compile verified bytecode to native machine code for your CPU architecture (x86, ARM, etc.)
- Load into kernel where it runs at true native speed
- Done
The verifier is strict. It rejects your code if there's even a theoretical possibility of problems. This is a feature, not a bug.
Development Frameworks and Tools
Before diving into code, you should know what tools are available for writing eBPF programs:
libbpf — The modern, low-level approach
A C library that provides the standard way to write eBPF programs. Your eBPF code compiles to bytecode ahead of time, then loads at runtime. No runtime compilation needed in production.
- Best for: Production deployments, performance-critical code
- Language: C
- Deployment: Ships pre-compiled bytecode (no compiler needed on target)
- Learning curve: Steep (you're writing kernel code)
BCC (BPF Compiler Collection) — Python framework with runtime compilation
Write eBPF in C, control logic in Python. Compiles your eBPF code at runtime on the target machine.
- Best for: Rapid prototyping, debugging, development
- Language: Python + C
- Deployment: Requires LLVM/Clang on target machine
- Learning curve: Easier to start, but runtime dependencies
bpftrace — High-level tracing language
Like awk for eBPF. One-liners for common tracing tasks.
# Trace all open() syscalls with filename
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); }'
- Best for: Quick debugging, one-off analysis
- Language: Custom DSL (awk-like syntax)
- Learning curve: Easiest to start
Aya — Rust library for eBPF
Write eBPF programs in Rust with memory safety guarantees.
- Best for: Rust developers, projects prioritizing safety
- Language: Rust
- Deployment: Pre-compiled bytecode
- Learning curve: Rust knowledge required
The code examples in this article use libbpf-style code, which is the modern production approach. But if you're just getting started, bpftrace or BCC might be easier for learning.
Your First eBPF Program
Here's actual eBPF code that counts network packets by source IP. This program attaches at the XDP (eXpress Data Path) level, which runs at the network driver—the earliest possible point to see packets before they enter the network stack.
Don't worry about the "map" data structure you'll see in the code. Think of it as a hash table that lives in kernel memory. We'll explain maps in detail shortly.
#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
// Define a map: IP address → packet count
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 10000);
__type(key, __u32); // IP address
__type(value, __u64); // Packet count
} packet_counts SEC(".maps");
// This runs for EVERY packet (20M+ per second)
SEC("xdp")
int count_packets(struct xdp_md *ctx) {
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
// Parse Ethernet header
struct ethhdr *eth = data;
if ((void*)(eth + 1) > data_end)
return XDP_PASS; // Packet too small, skip
// Parse IP header
struct iphdr *ip = data + sizeof(*eth);
if ((void*)(ip + 1) > data_end)
return XDP_PASS; // Not enough data, skip
// Look up counter for this source IP
__u32 src_ip = ip->saddr;
__u64 *count = bpf_map_lookup_elem(&packet_counts, &src_ip);
if (count) {
// Increment existing counter
__sync_fetch_and_add(count, 1);
} else {
// First packet from this IP
__u64 init_val = 1;
bpf_map_update_elem(&packet_counts, &src_ip, &init_val, BPF_ANY);
}
return XDP_PASS; // Let packet continue normally
}
This code runs inside the kernel at the earliest point in the network stack, processing millions of packets per second with minimal overhead.
Compare this to userspace monitoring:
# Userspace packet counter - for comparison
import pcap
packet_counts = {}
while True:
# This is slow:
# 1. Kernel captures packet
# 2. Kernel copies packet to userspace buffer
# 3. Your program wakes up (context switch)
# 4. You process one packet
# 5. Repeat...
packet = pcap.next()
src_ip = extract_src_ip(packet)
packet_counts[src_ip] = packet_counts.get(src_ip, 0) + 1
# At 1M packets/sec, this will:
# - Max out CPU
# - Drop most packets
The eBPF version runs at the source, processing packets before they've entered the Linux network stack. No copies. No context switches.
eBPF Components: The Building Blocks (Maps, hooks, and ring buffers)
eBPF isn't just about running code in the kernel. It's a complete system with several key components.
1. eBPF Maps (Shared memory that doesn't explode)
Maps are data structures that live in kernel memory and can be accessed by both eBPF programs and userspace programs:
eBPF Map Types
Maps are data structures in kernel memory shared between eBPF programs and userspace. Try different types.
Hash Map
Key-value pairs with O(1) lookups
Use Case:
IP → packet count, PID → metrics
Example Data:
Interactive Demo
Current Map Contents:
Maps Bridge Kernel and Userspace
eBPF programs (kernel) write to maps. Your application (userspace) reads from them. Both can access the same map simultaneously—it's shared memory between the two worlds.
Think of maps as a database table in kernel space:
┌──────────────────────────────────┐
│ Map: "packet_counts" │
│ ┌──────────┬─────────┐ │
│ │ Key │ Value │ │
│ ├──────────┼─────────┤ │
│ │ 10.0.0.1 │ 5,234 │ │
│ │ 10.0.0.2 │ 2,891 │ │
│ │ 10.0.0.3 │ 7,663 │ │
│ └──────────┴─────────┘ │
└──────────────────────────────────┘
↑ ↑
eBPF writes Userspace reads
Common map types and when to use them:
Hash Map: Key-value pairs (like a Python dict)
- Use for: IP → packet count, process ID → metrics
- Lookup: O(1)
Array: Fixed-size, indexed by integers
- Use for: Per-CPU stats, fixed configuration
- Lookup: O(1)
Ring Buffer: Stream of events (new in Linux 5.8+)
- Use for: Event logging, packet captures
- Lock-free, handles millions of events per second
Per-CPU Array: One array per CPU core
- Use for: High-frequency counters without locks
- Avoids cache line bouncing between CPUs
2. Ring Buffers (Fast event streaming)
Ring buffers are circular queues for streaming events from kernel to userspace:
Ring Buffer: Lock-Free Event Streaming
eBPF programs write events (producer) while userspace reads them (consumer). Circular and lock-free.
Ring Buffer (Kernel Memory)
Consumed Events (Userspace)
Why Ring Buffers?
Lock-free design means no contention between producer (eBPF in kernel) and consumer (your app). Can handle millions of events per second. When full, oldest events are overwritten—like a rolling log.
// In your eBPF program (kernel space)
SEC("kprobe/tcp_sendmsg")
int trace_tcp_send(struct pt_regs *ctx) {
struct tcp_event event = {
.pid = bpf_get_current_pid_tgid() >> 32,
.timestamp = bpf_ktime_get_ns(),
.bytes_sent = PT_REGS_PARM3(ctx)
};
// Write to ring buffer (non-blocking, super fast)
bpf_ringbuf_output(&tcp_events, &event, sizeof(event), 0);
return 0;
}
// In your Rust program (userspace)
use libbpf_rs::RingBuffer;
let mut ring_buf = RingBuffer::new();
ring_buf.add(map_fd, |data| {
let event: TcpEvent = unsafe { *(data as *const TcpEvent) };
println!("Process {} sent {} bytes at {}",
event.pid, event.bytes_sent, event.timestamp);
Ok(0)
})?;
// This blocks until events arrive
while ring_buf.poll(Duration::from_millis(100))? {}
The ring buffer is lock-free and can handle millions of events per second. When it's full, new events overwrite the oldest ones. Like a rolling log that never runs out of space.
3. Helper Functions (Your eBPF program's API)
eBPF programs can't just call any kernel function (that would be unsafe). Instead, they use ~200 pre-approved "helper functions":
// Some useful helpers:
bpf_ktime_get_ns() // Get current timestamp
bpf_get_current_pid_tgid() // Get process/thread ID
bpf_probe_read() // Safely read kernel memory
bpf_map_lookup_elem() // Look up value in map
bpf_map_update_elem() // Update value in map
bpf_trace_printk() // Debug print (slow, don't use in production)
bpf_get_current_comm() // Get process name
bpf_skb_load_bytes() // Read packet data
These helpers are verified safe. They won't crash your kernel. If you try to call anything else, the verifier rejects your program.
The Network Stack Journey (Where XDP, TC, and friends live)
To understand where eBPF programs can attach, you need to understand the Linux network stack. Packets travel through multiple layers:
Linux Network Stack: Where eBPF Hooks Attach
Watch a packet travel through the network stack. Click layers to see hook points. Earlier = faster.
Application
Your program (Python, Node.js, etc.)
Socket Layer
System call interface
TCP/UDP
Transport layer processing
IP Layer
Routing decisions
TC Egress
10M ppsTraffic control (qdisc)
Network Driver
Device driver
XDP
20M+ ppseXpress Data Path (earliest point)
Network Card
Physical hardware
XDP (Fastest)
20M+ packets/sec. Raw packet bytes. Use for DDoS protection, load balancing.
TC (Fast + Flexible)
10M packets/sec. More context. Use for bandwidth limits, packet modification.
The Earlier the Better
XDP runs at the network driver level, before any kernel processing. This is why it's 2-10x faster than TC or iptables—the packet hasn't consumed any CPU cycles yet.
Incoming Packet Journey:
════════════════════════
1. Network Card
"Physical packet arrives"
↓
2. Network Driver
"Convert to memory"
↓
3. ★ XDP HOOK ★ (eXpress Data Path)
EARLIEST POINT - raw packet bytes
• Can DROP (DDoS protection)
• Can PASS (let continue)
• Can REDIRECT (send elsewhere)
• Can TX (bounce back)
Performance: 20M+ packets/second
↓
4. ★ TC INGRESS ★ (Traffic Control)
After XDP, before routing
• Can modify packets
• Can enforce policies
• Has more context than XDP
Performance: 10M packets/second
↓
5. Netfilter/iptables
"Traditional firewall rules"
↓
6. Routing Decision
"Which interface/socket?"
↓
7. ★ KPROBE: tcp_v4_rcv() ★
Can hook any kernel function
↓
8. TCP/UDP Processing
↓
9. Socket Buffer
↓
10. Application receives data
Hook Types and When to Use Them
XDP (eXpress Data Path)
Runs at the network driver level—the earliest possible point to see packets.
- When to use: Need maximum performance with minimal context
- Use cases: DDoS protection, load balancing, basic packet filtering
- Example: Drop all packets from banned IPs before they consume any CPU
- Performance: 20M+ packets/second
TC (Traffic Control)
Runs at the qdisc layer with more packet context than XDP.
- When to use: Need to modify packets or enforce bandwidth limits
- Use cases: Network shaping, container networking, advanced routing
- Example: Limit egress bandwidth per pod in Kubernetes
- Performance: 10M packets/second
kprobe
Attaches to any kernel function entry point.
- When to use: Need to trace specific kernel functions
- Use cases: Performance debugging, security monitoring
- Example: Track which processes make the most syscalls
- Warning: Function names can change between kernel versions
Tracepoint
Attaches to stable kernel trace points that don't change between versions.
- When to use: Production monitoring that needs to be stable
- Use cases: Syscall auditing, scheduler analysis
- Example: Monitor file access patterns across all processes
- Advantage: Stable API across kernel versions
uprobe
Attaches to userspace function entry points in libraries or applications.
- When to use: Need to monitor libraries or application code
- Use cases: GPU monitoring, SSL/TLS inspection, malloc tracking
- Example: Track CUDA memory allocations in ML workloads
- Note: Higher overhead than kernel hooks
Performance Comparison: Real Numbers
Here's the overhead of different approaches for monitoring network traffic:
These numbers come from production systems. Companies like Facebook, Netflix, and Cloudflare use eBPF for this reason. When you handle billions of packets per day, a 100-1000x performance difference matters.
Kubernetes + eBPF (Observability without the overhead)
Every Pod you deploy in Kubernetes (a Pod is Kubernetes' unit of deployment—basically a group of one or more containers) is just a Linux process with fancy isolation. That's it.
What's Really Happening When You Deploy a Pod
# What you write:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: web
image: nginx
ports:
- containerPort: 80
What actually happens on the Linux node:
1. kubelet receives Pod spec
↓
2. Create Linux namespaces:
• Network namespace (isolated network stack)
• PID namespace (isolated process tree)
• Mount namespace (isolated filesystem view)
• User namespace (isolated users/groups)
↓
3. Pull nginx image (just a tarball of files)
↓
4. Extract to filesystem
↓
5. Start nginx process with:
• Restricted CPU (cgroups)
• Restricted memory (cgroups)
• Isolated network (can't see host)
• Isolated filesystem (can't see host files)
↓
6. THIS IS YOUR "CONTAINER"
(It's just a carefully isolated Linux process)
Because containers are just processes, eBPF can observe everything they do at the kernel level:
Pod makes HTTP request:
↓
1. Python calls requests.get()
↓
2. Syscall: socket(), connect(), send()
↓
3. ★ eBPF tracepoint triggers! ★
Records: PID, syscall type, bytes
↓
4. Kernel processes TCP
↓
5. ★ eBPF XDP hook sees packet! ★
Records: Source IP, dest IP, packet size
↓
6. Packet goes to network
↓
7. Response comes back
↓
8. ★ eBPF XDP sees response! ★
↓
9. ★ eBPF kprobe on tcp_cleanup_rbuf() ★
Records: Bytes received
↓
10. Application gets response
eBPF collected: Complete network profile
- Syscalls made
- Bytes sent/received
- Latency
- Source/dest IPs
Traditional monitoring collected: Nothing
(Unless you added logging, which adds 10-50ms overhead per request)
Real-World Use Cases
Network monitoring per pod:
SEC("xdp")
int monitor_pod_network(struct xdp_md *ctx) {
__u32 src_ip = get_src_ip(ctx);
// Map IP to pod name (populated from Kubernetes API)
char *pod_name = ip_to_pod_map[src_ip];
// Track bandwidth per pod
pod_bandwidth[pod_name] += ctx->data_end - ctx->data;
return XDP_PASS;
}
GPU monitoring for ML workloads:
SEC("uprobe/libcudart:cudaMalloc")
int trace_cuda_alloc(struct pt_regs *ctx) {
size_t size = PT_REGS_PARM2(ctx); // Second argument
__u32 pid = bpf_get_current_pid_tgid() >> 32;
// Track GPU memory per process
gpu_allocations[pid] += size;
// Alert if allocation > 1GB
if (size > 1024*1024*1024) {
struct alloc_event evt = {
.pid = pid,
.size = size,
.timestamp = bpf_ktime_get_ns()
};
bpf_ringbuf_output(&alerts, &evt, sizeof(evt), 0);
}
return 0;
}
Security monitoring:
SEC("tracepoint/syscalls/sys_enter_open")
int trace_file_access(struct trace_event_raw_sys_enter* ctx) {
char filename[256];
bpf_probe_read_str(&filename, sizeof(filename),
(void*)ctx->args[0]);
// Alert on suspicious file access
if (strstr(filename, "/etc/shadow") ||
strstr(filename, "/.ssh/id_rsa")) {
struct security_event evt = {
.pid = bpf_get_current_pid_tgid() >> 32,
.timestamp = bpf_ktime_get_ns()
};
bpf_probe_read_str(&evt.filename, sizeof(evt.filename),
filename);
bpf_ringbuf_output(&security_events, &evt, sizeof(evt), 0);
}
return 0;
}
You can monitor network, GPU, filesystem, syscalls—everything happening in your Kubernetes cluster—without modifying your applications or adding instrumentation libraries. The overhead is minimal.
The Gotchas (Because nothing is ever perfect)
eBPF is powerful, but it has limitations:
1. The Verifier is Strict (Very strict)
// This gets rejected:
SEC("xdp")
int my_program(struct xdp_md *ctx) {
for (int i = 0; i < 1000; i++) { // ❌ Verifier can't prove termination
do_something();
}
return XDP_PASS;
}
// This works:
SEC("xdp")
int my_program(struct xdp_md *ctx) {
#pragma unroll
for (int i = 0; i < 16; i++) { // ✓ Loop is unrolled, no branches
do_something();
}
return XDP_PASS;
}
2. Stack Size Limit (512 bytes. That's it.)
SEC("xdp")
int my_program(struct xdp_md *ctx) {
char big_buffer[1024]; // ❌ Exceeds 512 byte stack limit
// Verifier says: "No."
return XDP_PASS;
}
// Solution: Use eBPF maps for large data
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
__uint(max_entries, 1);
__type(key, int);
__type(value, char[1024]);
} temp_storage SEC(".maps");
SEC("xdp")
int my_program(struct xdp_md *ctx) {
int key = 0;
char *buffer = bpf_map_lookup_elem(&temp_storage, &key);
if (!buffer) return XDP_ABORTED;
// Use buffer...
return XDP_PASS;
}
3. Kernel Version Hell (Older kernels = fewer features)
- XDP: Requires Linux 4.8+ (2016)
- Ring buffers: Requires Linux 5.8+ (2020)
- Sleepable eBPF: Requires Linux 5.10+ (2020)
- Many helpers: Kernel 5.x only
CO-RE: The Modern Solution (Compile Once, Run Everywhere)
The traditional problem with kernel development was that code compiled for one kernel version wouldn't work on another because internal kernel structures could have different memory layouts. You'd need to recompile your eBPF program for each target kernel version.
CO-RE (Compile Once, Run Everywhere) solves this:
Traditional approach: CO-RE approach:
═══════════════════ ═══════════════════
Compile on kernel 5.4 → Compile once with BTF info
Run on 5.4 ✓ ↓
Run on 5.10 ✗ (breaks!) Deploy to ANY kernel version
Recompile for 5.10 → • Kernel 5.4 ✓
Run on 5.10 ✓ • Kernel 5.10 ✓
• Kernel 5.15 ✓
• Kernel 6.x ✓
How it works:
- During compilation, CO-RE includes BTF (BPF Type Format) information describing kernel data structures
- At load time, the eBPF loader reads your target kernel's BTF info
- It automatically adjusts field offsets and sizes to match the running kernel
- Your program adapts to different kernel struct layouts without recompilation
This means:
- Compile once on your development machine
- Deploy to any kernel version (with BTF support, typically 5.2+)
- No need to ship compiler toolchains to production
- Automatic handling of kernel struct layout differences
This is why modern eBPF tools like Cilium and Falco can run across different Linux distributions and kernel versions without modification.
4. Debugging is Hard (Printf debugging from the kernel)
// Use for development only (slow):
bpf_trace_printk("Got packet from %d\n", src_ip);
// Read output:
$ cat /sys/kernel/debug/tracing/trace_pipe
Use bpf_printk for development, ring buffers for production.
The Bottom Line (What you need to know)
What eBPF changes:
Traditional Monitoring:
- Add instrumentation libraries → Overhead: 5-50%
- Ship logs to aggregators → Cost: $$$$
- Miss kernel-level events → Visibility: 20% of reality
- Modify application code → Deployment risk: High
eBPF Monitoring:
- Run code in kernel → Overhead: 0.1-1%
- Filter/aggregate at source → Cost: $
- See everything → Visibility: 100% of reality
- Zero application changes → Deployment risk: None
When to use eBPF:
- ✓ High-performance network monitoring
- ✓ Security monitoring (file access, process execution)
- ✓ Performance profiling without overhead
- ✓ Container/Kubernetes observability
- ✓ GPU/hardware monitoring
When NOT to use eBPF:
- ✗ Simple application-level metrics (just use Prometheus)
- ✗ Business logic monitoring (belongs in your app)
- ✗ One-off debugging (strace is fine)
- ✗ Systems without Linux 4.8+ (upgrade first)
The key insight is simple. You can see everything without paying the usual performance price. That's why every major cloud provider uses eBPF in production.
Tools built on eBPF:
- Cilium: Kubernetes networking and security
- Falco: Runtime security monitoring
- Pixie: Kubernetes observability
- Parca: Continuous profiling
- Katran: Load balancing (Facebook)
- bpftrace: Dynamic tracing
eBPF is still relatively young. The modern version is from 2014. New use cases keep appearing. Your kernel is ready to tell you everything if you know how to ask.
Related: How Production Are You Really?