Linux Networking and Container Networking: What I Learned Building an eBPF Observability Tool
Linux Networking and Container Networking: What I Learned Building an eBPF Observability Tool
I was staring at a dashboard showing every network flow attributed to the same entity: the node itself. Two pods were clearly communicating — I could see the traffic in tcpdump — but my eBPF-based observability agent was labeling every single flow with the node's IP instead of the actual pod names.
The agent's approach was straightforward: attach TC (Traffic Control) classifiers to network interfaces, extract source and destination IPs from packet headers, look up those IPs in a cache built from the Kubernetes API, and map them to pod names. Simple. Elegant. Completely wrong in this case.
Both pods were running with hostNetwork: true. They shared the node's IP address. My entire enrichment model — "look up the source IP, find the pod" — collapsed because multiple pods had the same IP.
To understand why that broke everything, I had to learn how Linux networking actually works. Not the textbook version with OSI layers and protocol headers. The version you need when you're attaching probes to network interfaces and trying to figure out which packets belong to which containers.
This is that knowledge.
Prerequisite: This post assumes you know what eBPF is. If terms like "TC classifier" or "ring buffer" are unfamiliar, start with Kernel Space and eBPF: The Observability Revolution.
The Packet's Journey (What actually happens when data leaves your application)
Before diving into containers, you need to understand how a packet moves through a regular Linux machine. Every networking concept that follows — namespaces, veth pairs, bridges — is a variation on this same fundamental pipeline.
When your application calls send() on a socket, the data doesn't just teleport to the network. It passes through a series of kernel subsystems, each with a specific job:
Application calls send("Hello")
↓
Socket layer: attach transport headers (TCP/UDP)
↓
IP layer: attach IP header, make routing decision
↓
Traffic Control (TC): apply bandwidth limits, run classifiers
↓
Network driver: hand to hardware
↓
NIC: electrical signals on the wire (or virtual wire)
The routing decision is the critical step. The kernel consults its routing table to determine which network interface to send the packet out of, and what the next hop should be. This is where things get interesting on a machine with multiple interfaces.
# A Kubernetes node's routing table
$ ip route show
default via 172.18.0.1 dev eth0 # Default: send out eth0
10.244.0.0/24 dev cni0 proto kernel scope link # Pod subnet: send to bridge
10.244.1.0/24 via 10.244.1.1 dev eth0 # Other node's pods: send out eth0
Reading this table: if the destination IP is in the 10.244.0.0/24 range (local pods), the kernel sends the packet to the cni0 interface — a bridge that connects to pods on this node. If the destination is 10.244.1.0/24 (pods on another node), it goes out eth0 toward the physical network. Everything else hits the default route and goes out eth0 as well.
The routing table is the kernel's decision tree for "where does this packet go next?" Every packet traverses it. When you're building network observability, understanding which interface a packet will traverse tells you where you need to be watching.
Network Interfaces (Not just hardware anymore)
A network interface is a named endpoint where the kernel sends and receives packets. On a physical machine, eth0 is your Ethernet port — the actual hardware that converts electrical signals to data. On a cloud VM, eth0 is a virtual NIC provided by the hypervisor, but it behaves identically from the kernel's perspective.
The important thing: Linux doesn't limit you to physical interfaces. It can create purely virtual interfaces that exist only in software. These virtual interfaces are the building blocks of container networking.
Here's what a typical Kubernetes node looks like:
$ ip link show
1: lo: <LOOPBACK,UP> # Loopback - packets to 127.0.0.1 (never leaves the machine)
2: eth0: <BROADCAST,UP> # Physical/virtual NIC (connects to the network)
3: cni0: <BROADCAST,UP> # Bridge (virtual switch connecting pods)
5: veth9a3f@if4: <BROADCAST,UP> # One end of a virtual pipe (connects to a pod)
7: vethb7c2@if6: <BROADCAST,UP> # Another pipe to another pod
9: vethe1d8@if8: <BROADCAST,UP> # Another pipe to yet another pod
Three types of interfaces on one machine:
- lo (loopback): Traffic to
127.0.0.1. Never leaves the machine. - eth0: The machine's connection to the outside world. All external traffic flows through here.
- cni0: A bridge — a virtual network switch. We'll cover this in detail shortly.
- veth*: Virtual Ethernet pairs — pipes connecting container namespaces to the bridge.
Each interface carries different traffic. This is the first critical insight for observability: if you attach a probe to the wrong interface, you'll miss traffic entirely. Attach to eth0 and you'll see all external traffic but miss pod-to-pod communication on the same node. Attach to cni0 and you'll see pod traffic but miss the host's own network activity.
Network Namespaces (How containers get their own network stack)
A network namespace is a complete, isolated copy of the Linux network stack. That means its own set of interfaces, its own routing table, its own firewall rules, and its own socket table. Processes inside one namespace cannot see or interact with interfaces in another namespace.
This is the core isolation mechanism behind container networking. It's not optional complexity — it's the reason containers can each have their own eth0 with their own IP address without conflicting with each other or the host.
Network Namespace Isolation
Each network namespace gets its own isolated set of interfaces, routing table, and sockets. A new namespace starts nearly empty -- it cannot see the host's network resources.
Host Network Namespace (PID 1)
Interfaces
| Name | Type | State | IP Address | MAC |
|---|---|---|---|---|
| lo | loopback | UP | 127.0.0.1/8 | 00:00:00:00:00:00 |
| eth0 | physical | UP | 10.0.0.2/24 | 02:42:0a:00:00:02 |
| cni0 | bridge | UP | 10.244.0.1/24 | 6a:3e:1f:a2:b8:01 |
| veth8a3f1c2 | veth | UP | - | be:4c:7d:e1:22:f3 |
| vethd90e4b7 | veth | UP | - | ae:12:5f:c3:44:a1 |
Routing Table
| Destination | Gateway | Interface |
|---|---|---|
| default | 10.0.0.1 | eth0 |
| 10.0.0.0/24 | - | eth0 |
| 10.244.0.0/24 | - | cni0 |
| 10.244.1.0/24 | 10.0.0.3 | eth0 |
Socket Table
| Proto | Local Address | Remote Address | State |
|---|---|---|---|
| tcp | 0.0.0.0:22 | 0.0.0.0:* | LISTEN |
| tcp | 0.0.0.0:6443 | 0.0.0.0:* | LISTEN |
| tcp | 10.0.0.2:6443 | 10.0.0.5:49312 | ESTABLISHED |
| udp | 0.0.0.0:8472 | 0.0.0.0:* | - |
No container namespace exists yet.
Click "Create Namespace" to see what a new namespace looks like.
Container Network Namespace (container: nginx-7d4f8)
Interfaces
| Name | Type | State | IP Address | MAC |
|---|---|---|---|---|
| lo | loopback | DOWN | 127.0.0.1/8 | 00:00:00:00:00:00 |
Routing Table
(empty -- no routes configured)
Socket Table
(no active sockets)
The host machine runs in the root namespace — the default namespace that exists when the system boots. When a container runtime (like containerd or Docker) creates a container, it creates a new network namespace for that container. Inside this new namespace, the network stack starts empty: only a loopback interface exists, and it's down.
# Create a network namespace (this is what container runtimes do)
$ ip netns add my-container
# Look inside — it's a blank slate
$ ip netns exec my-container ip link show
1: lo: <LOOPBACK> mtu 65536 state DOWN
# Only loopback. No eth0. No connectivity. Total isolation.
# The host still has all its interfaces
$ ip link show
1: lo 2: eth0 3: cni0 5: veth9a3f@if4 ...
The ip netns exec command runs a command inside a specific namespace. From the container's perspective, the host's eth0, cni0, and all those veth interfaces don't exist. It's as if the container is on a completely separate machine with no network card installed.
This raises an obvious question: if the namespace is completely isolated, how does the container communicate with anything?
Veth Pairs (Virtual pipes connecting isolated worlds)
A veth pair is two virtual network interfaces connected by an invisible pipe. Whatever goes into one end comes out the other. They always come in pairs — you can't create one without the other.
Think of it as a wormhole: two endpoints, potentially in different namespaces, with a direct connection between them. You put a packet in one end and it instantly appears at the other end.
Virtual Ethernet (veth) Pair
A veth pair acts as a tunnel between two network namespaces. Whatever enters one end exits the other.
Host Namespace
Container Namespace
Here's how container runtimes use them:
# Create a veth pair — two interfaces, connected
$ ip link add veth-host type veth peer name veth-pod
# Right now, both ends are in the host namespace.
# Move one end into the container's namespace:
$ ip link set veth-pod netns my-container
# Now:
# veth-host lives in the HOST namespace
# veth-pod lives in the CONTAINER namespace
# They're connected. Packets in one end → out the other.
# Give the container end an IP address and bring it up
$ ip netns exec my-container ip addr add 10.244.0.5/24 dev veth-pod
$ ip netns exec my-container ip link set veth-pod up
$ ip link set veth-host up
After this setup, the container has a network interface (veth-pod) with IP 10.244.0.5. Inside the container, this interface looks and behaves exactly like a regular eth0 — the container has no idea it's virtual. The container runtime typically renames the container-side interface to eth0 so applications don't need to know about the plumbing underneath.
The host-side interface (veth-host) appears in the host namespace as one of those vethXXXX entries you saw in the ip link show output earlier. Each running container has one veth pair, so a node running 30 pods has 30 veth interfaces on the host side.
Now the container can send packets — they emerge from the host-side veth. But where do they go from there? That's where bridges come in.
Bridges (cni0 is just a virtual network switch)
A bridge is a virtual Layer 2 network switch. If you've seen a physical network switch — the box with many Ethernet ports that connects devices on a local network — a Linux bridge is exactly that, in software.
The bridge cni0 (or docker0 if you're using Docker's default networking) connects all the host-side veth endpoints together. When pod A sends a packet to pod B on the same node, the packet travels: pod A's veth → bridge → pod B's veth. The bridge learns which MAC address lives on which port (just like a physical switch) and forwards frames accordingly.
Linux Bridge: Virtual Layer 2 Switch
A bridge (cni0) connects pod veth pairs on the same node. Toggle scenarios to see how traffic routes differently.
Same-Node: Pod A to Pod B
Traffic stays on the bridge — never touches eth0. The bridge performs MAC-based forwarding between the two veth pairs entirely in kernel space.
Pod A → veth-A → cni0 → veth-B → Pod B
# See what's connected to the bridge
$ bridge link show
5: veth9a3f@cni0: <BROADCAST,UP> master cni0 # Pod A's connection
7: vethb7c2@cni0: <BROADCAST,UP> master cni0 # Pod B's connection
9: vethe1d8@cni0: <BROADCAST,UP> master cni0 # Pod C's connection
# The bridge maintains a forwarding table (MAC address → port)
$ bridge fdb show br cni0
aa:bb:cc:00:00:01 dev veth9a3f master cni0 # Pod A's MAC → port 5
aa:bb:cc:00:00:02 dev vethb7c2 master cni0 # Pod B's MAC → port 7
aa:bb:cc:00:00:03 dev vethe1d8 master cni0 # Pod C's MAC → port 9
The critical insight: same-node pod-to-pod traffic never touches eth0. The packet goes from one veth port on the bridge to another veth port on the bridge. It stays entirely within the bridge, just like traffic between two devices plugged into the same physical switch never hits the router.
Cross-node traffic is different. If pod A sends a packet to a pod on another node, the bridge doesn't have a port for that destination. The packet gets forwarded to eth0 (the bridge's uplink, essentially), which sends it out to the physical network toward the other node.
This distinction matters enormously for observability. If your probes are only on eth0, you're blind to all same-node pod communication. On a busy node running dozens of pods, that can be the majority of the traffic.
Container Networking End-to-End (What actually happens when kubectl apply runs)
Now you have all the pieces. Let's trace what happens when Kubernetes creates a pod, and then what happens when that pod sends a packet to another pod.
Pod Creation: The Network Plumbing
The Container Networking Interface (CNI) is a specification — not a tool. It defines a contract: "given a container's network namespace, set up its networking." Different CNI plugins (Flannel, Calico, Cilium) implement this contract differently, but they all perform the same fundamental steps.
CNI Sequence: How a Pod Gets Its Network
Step-by-step process from pod creation to a fully networked container. Play the sequence or click any step to expand.
kubelet receives pod spec
containerd creates container
New network namespace created
CNI plugin invoked
Veth pair created
Pod end moved to namespace
IP address assigned
Host end connected to bridge
Routes configured
Pod is network-ready
1. kubelet receives pod spec from API server
↓
2. kubelet tells containerd: "create this container"
↓
3. containerd creates a new network namespace
↓
4. kubelet calls the CNI plugin:
"Here's the namespace, set up networking"
↓
5. CNI plugin creates a veth pair
↓
6. CNI plugin moves one end into the pod's namespace
↓
7. CNI plugin assigns an IP address (e.g., 10.244.0.5)
↓
8. CNI plugin connects the host end to the bridge (cni0)
↓
9. CNI plugin configures routes inside the pod namespace
↓
10. Pod is network-ready
Different CNI plugins differ in how they handle cross-node traffic. Flannel wraps packets in VXLAN tunnels. Calico uses BGP to distribute routes. Cilium uses eBPF to replace kube-proxy and implement routing. But the local plumbing — namespace, veth, bridge — is nearly universal.
The Packet Path: Pod A → Pod B (Same Node)
Pod A (10.244.0.5) makes an HTTP request to Pod B (10.244.0.6). Both are on the same node.
Pod A's application calls connect() + send()
↓
Pod A's network namespace:
Routing table says: "10.244.0.0/24 → dev eth0"
→ Send packet out eth0 (which is the veth-podA end)
↓
Packet crosses the veth pair:
veth-podA → veth-hostA
Packet appears in the host namespace on the bridge port
↓
cni0 bridge:
Looks up destination MAC in forwarding table
Finds it on port veth-hostB
Forwards the frame to that port
↓
Packet crosses the second veth pair:
veth-hostB → veth-podB
Packet appears in Pod B's namespace
↓
Pod B's network namespace:
TCP/IP stack processes the packet
Delivers to the listening socket
↓
Pod B's application receives the HTTP request
The packet never touched eth0. It never left the node. The bridge handled everything.
The Packet Path: Pod A → Pod C (Different Node)
Pod A (10.244.0.5) makes a request to Pod C (10.244.1.8) on Node 2.
Pod A's application calls connect() + send()
↓
Pod A's namespace routing table:
"10.244.1.0/24" is NOT local → default route → eth0
↓
Packet crosses veth to host namespace
↓
Host namespace routing table:
"10.244.1.0/24 via 10.244.1.1 dev eth0"
→ Send out eth0 toward the other node
↓
[Physical/virtual network between nodes]
↓
Node 2's eth0 receives the packet
↓
Node 2's routing table:
"10.244.1.0/24 dev cni0" → forward to bridge
↓
Node 2's cni0 bridge forwards to Pod C's veth
↓
Pod C receives the packet
This time the packet did cross eth0 on both nodes. It's visible to probes on eth0. But on Node 1, the packet's source IP is still 10.244.0.5 (Pod A's real IP), and the destination is 10.244.1.8 (Pod C's real IP). No translation happened. This is a core Kubernetes networking guarantee.
Kubernetes Networking (Every pod gets an IP, and then things get complicated)
Kubernetes enforces three networking rules:
- Every pod gets a unique IP address — no sharing, no conflicts
- Pods can communicate with each other without NAT — the source IP a pod sees is the actual sender's IP
- The IP a pod sees for itself is the same IP others see — no address translation at the pod boundary
These rules make IP-based observability possible. If you capture a packet with source IP 10.244.0.5, you can look that up and know it came from a specific pod. The IP is a reliable identifier.
But then Kubernetes introduces Services, and things get more complicated.
Services and DNAT
A Kubernetes Service provides a stable virtual IP (called a ClusterIP) that load-balances across a set of backend pods. The ClusterIP doesn't belong to any interface — it exists only as a set of firewall rules.
When a pod sends traffic to a ClusterIP, kube-proxy (or its eBPF replacement in Cilium) intercepts the packet and performs DNAT (Destination Network Address Translation): it rewrites the destination IP from the ClusterIP to one of the backend pod's real IPs.
Service DNAT: How kube-proxy Rewrites Packet Headers
Toggle between pre-DNAT and post-DNAT to see how the destination address changes.
IP Packet Header
10.244.0.5Pod A10.96.0.10ClusterIPClusterIP is a virtual IP — no pod has this address
Packet Flow
Pod A
10.244.0.5
Service: my-svc
10.96.0.10
kube-proxy DNAT
iptables / ipvs
Pod B
10.244.0.6
Observation Point Matters
Probe captures before DNAT: destination is 10.96.0.10 (useless for pod lookup)
Probe captures after DNAT: destination is 10.244.0.6 (maps to Pod B)
# kube-proxy creates iptables rules like this:
$ iptables -t nat -L KUBE-SERVICES -n
# Service "my-svc" has ClusterIP 10.96.0.10, port 80
# Backend pods: 10.244.0.6 and 10.244.1.3
DNAT tcp -- 0.0.0.0/0 10.96.0.10 dpt:80 to:10.244.0.6:80 (50% probability)
DNAT tcp -- 0.0.0.0/0 10.96.0.10 dpt:80 to:10.244.1.3:80 (50% probability)
The packet transformation:
Before DNAT: src=10.244.0.5 dst=10.96.0.10 (ClusterIP — virtual)
After DNAT: src=10.244.0.5 dst=10.244.0.6 (Real pod IP)
DNAT (Destination NAT) rewrites only the destination IP address on the packet. The source IP stays intact. NAT (Network Address Translation) is the general term for rewriting IP addresses in packet headers — DNAT specifically rewrites the destination. There's also SNAT (Source NAT) which rewrites the source, but Kubernetes tries to avoid that for pod-to-pod traffic because it breaks the "pods see real IPs" guarantee.
This matters for observability because where DNAT happens determines what IP addresses you see in captured packets:
- Before DNAT (on the sending pod's veth, or early in the bridge path):
dst=10.96.0.10— a ClusterIP that doesn't map to any pod in your cache - After DNAT (later in the path, on the receiving pod's veth):
dst=10.244.0.6— a real pod IP you can look up
Where the DNAT rules execute depends on the CNI and kube-proxy mode. With iptables-based kube-proxy, DNAT happens in the PREROUTING chain, which runs before the routing decision. This means traffic captured on the bridge is typically post-DNAT. But this is an implementation detail that varies across setups — the key point is that observation point determines what you see.
hostNetwork: true (The identity crisis that broke my enrichment pipeline)
Now we get to the thing that burned me.
Normally, a pod gets its own network namespace. It gets a unique IP, its own routing table, its own veth pair — complete isolation. hostNetwork: true skips all of that. The pod runs directly in the host's network namespace.
Normal Pod vs hostNetwork Pod
How network namespace isolation changes when a pod uses hostNetwork: true.
Normal Pod
nginx-7d4f8b
eth0
veth pair
cni0
bridge
IP: 10.244.0.5
Own namespace, unique IP
hostNetwork Pod
kube-proxy-x9k2
hostNetwork: truehostNetwork: truenode eth0node eth0
direct access
IP: 172.18.0.3172.18.0.3(same as node)
Host namespace, shared IP
IP Lookup Collision
???
Multiple hostNetwork pods share the node IP, making source attribution ambiguous for IP-based identity systems.
# Normal pod — gets its own network identity
apiVersion: v1
kind: Pod
spec:
containers:
- name: my-app
image: nginx
# Result: Pod IP = 10.244.0.5 (unique, own namespace)
# Has: own eth0, own routing table, own veth pair
---
# hostNetwork pod — borrows the node's identity
apiVersion: v1
kind: Pod
spec:
hostNetwork: true
containers:
- name: my-agent
image: orb8-agent
# Result: Pod IP = 172.18.0.3 (same as the node)
# Has: node's eth0, node's routing table, no veth pair
Why would you want this? DaemonSets running system-level tools need it. An eBPF observability agent needs access to the host's network interfaces to attach probes. A CNI plugin needs to configure the host's network stack. kube-proxy needs to install iptables rules in the host namespace. These tools can't operate from inside an isolated pod namespace — they need to see and modify the host's networking directly.
The cost: hostNetwork pods share the node's IP address. If three hostNetwork pods run on a node with IP 172.18.0.3, all three have the IP 172.18.0.3.
This is what broke my enrichment pipeline. The pod watcher builds a map of IP → pod name. When it processes hostNetwork pods, it inserts the node's IP as the key. But only one entry can exist per key, so the last hostNetwork pod processed "wins." Every flow to or from that IP gets attributed to whatever pod happened to be indexed last.
Pod watcher processes events:
kube-proxy (hostNetwork) → IP 172.18.0.3 → cache: 172.18.0.3 = kube-proxy
orb8-agent (hostNetwork) → IP 172.18.0.3 → cache: 172.18.0.3 = orb8-agent (overwrites!)
Flow arrives: src=172.18.0.3 dst=10.96.0.1
Cache lookup: 172.18.0.3 → "orb8-agent"
But the actual sender was kube-proxy.
Or kubelet (a host process, not even a pod).
Or sshd.
We can't tell. They all share 172.18.0.3.
The IP → pod mapping is one-to-many for hostNetwork pods, but a hashmap is one-to-one. Information is lost.
For regular pods, IP-based enrichment works reliably because Kubernetes guarantees unique IPs. For hostNetwork pods, you need additional signals — port numbers, process IDs from tracepoint probes, or simply accepting that hostNetwork traffic gets labeled as "host" rather than attributed to a specific pod.
Where You Observe Determines What You See (The interface you attach to changes everything)
This is the practical synthesis of everything above. When building an eBPF observability tool, the decision of which network interface to attach TC classifiers to determines your visibility. Different interfaces see fundamentally different traffic.
Observation Points: Where You Attach the Probe Matters
Different probe attachment points capture different traffic. Select a probe location to see what becomes visible and what stays hidden.
Captures traffic entering/leaving the node
Visible Traffic
- Cross-node pod traffic (Pod A to remote pod)
- Host process traffic (kubelet, SSH, etc.)
- hostNetwork pod traffic
- External ingress/egress traffic
Invisible Traffic
- Same-node pod-to-pod traffic (Pod A to Pod B)
eth0 Sees the Node Boundary
Probing eth0 captures everything crossing the node boundary: cross-node pod traffic, host process traffic, and hostNetwork pod traffic. But same-node pod-to-pod traffic never reaches eth0 -- it is switched entirely within the cni0 bridge.
Here's what each attachment point gives you:
Probes on eth0 (the host's physical/virtual NIC):
- Cross-node pod traffic (packets leaving or entering the node)
- Host process traffic (kubelet, sshd, containerd)
- hostNetwork pod traffic
- Missing: same-node pod-to-pod traffic (stays on the bridge, never touches eth0)
Probes on cni0 (the bridge):
- All pod traffic on the node, including same-node communication
- Missing: hostNetwork pod traffic (these pods bypass the bridge — they're in the host namespace)
- Missing: host process traffic
Probes on individual veth interfaces:
- Traffic for one specific pod only
- Most granular, but requires managing probes as pods come and go
- Veths are created and destroyed with pods — the probe lifecycle gets complicated
The practical approach for broad visibility: attach to both eth0 and the bridge interface. This covers cross-node traffic, same-node pod traffic, and host-level traffic. You'll get some duplicate events for cross-node pod traffic (it crosses both the bridge and eth0), but deduplication is easier than missing data.
// Discover and attach to multiple interfaces for maximum visibility
fn discover_interfaces() -> Vec<String> {
let mut interfaces = Vec::new();
// 1. Find the default route interface (eth0 or equivalent)
// Parse /proc/net/route — the entry with destination 00000000
// is the default route
let routes = std::fs::read_to_string("/proc/net/route")
.expect("failed to read route table");
for line in routes.lines().skip(1) {
let fields: Vec<&str> = line.split_whitespace().collect();
if fields[1] == "00000000" {
interfaces.push(fields[0].to_string()); // e.g., "eth0"
break;
}
}
// 2. Find bridge interfaces
// Bridges have a /sys/class/net/<name>/bridge directory
for entry in std::fs::read_dir("/sys/class/net").unwrap() {
let name = entry.unwrap().file_name().to_string_lossy().to_string();
if matches!(name.as_str(), "cni0" | "docker0" | "cbr0")
|| name.starts_with("br-")
{
interfaces.push(name);
}
}
interfaces
}
// Attach TC classifiers to each discovered interface
fn attach_probes(interfaces: &[String], bpf: &mut Bpf) {
for iface in interfaces {
// Ingress: packets arriving at this interface
TcBuilder::new(bpf.program_mut("classifier_ingress"))
.ifname(iface)
.direction(TcAttachType::Ingress)
.build().unwrap()
.attach().unwrap();
// Egress: packets leaving this interface
TcBuilder::new(bpf.program_mut("classifier_egress"))
.ifname(iface)
.direction(TcAttachType::Egress)
.build().unwrap()
.attach().unwrap();
}
}
Enriching Flows with Pod Names (The IP lookup that works 95% of the time)
TC classifiers operate at the network layer. They see raw packet headers: source IP, destination IP, source port, destination port, and protocol. That's it. A 5-tuple. There's no process name, no container ID, no pod label.
This is a fundamental constraint of the TC hook point. TC classifiers run in the kernel's network stack during softirq processing (the kernel's way of handling interrupts from hardware, like a network card signaling "packet arrived"). There's no process context available because the packet processing isn't running on behalf of any specific process — it's the kernel doing work triggered by a hardware interrupt. The eBPF helper bpf_get_current_cgroup_id(), which could identify the container, returns 0 in this context.
So you need another way to map IPs to pods. The Kubernetes API provides it.
IP Enrichment Pipeline
How raw packet 5-tuples get enriched with Kubernetes pod names via IP lookup.
Stage 1: Raw Event
Stage 2: IP Lookup
Stage 3: Enriched Flow
Pod Network (unique IPs)
Each pod gets its own IP from the CNI. Lookup is 1:1 — unambiguous attribution.
hostNetwork (shared node IP)
Pods share the node IP. Multiple pods map to the same address, breaking 1:1 lookup.
The agent runs a background task that watches the Kubernetes API for pod events. Every time a pod is created, updated, or deleted, the watcher updates an in-memory cache mapping IP addresses to pod metadata.
// Background task: watch Kubernetes API for pod changes
async fn watch_pods(cache: Arc<PodCache>) {
let pods: Api<Pod> = Api::all(kube_client);
let watcher = runtime::watcher(pods, watcher::Config::default());
while let Some(event) = watcher.try_next().await.unwrap() {
match event {
Event::Apply(pod) => {
// Pod created or updated — extract IP and metadata
let ip = pod.status.as_ref()
.and_then(|s| s.pod_ip.as_ref());
if let Some(ip) = ip {
cache.insert_by_ip(ip.parse().unwrap(), PodInfo {
namespace: pod.metadata.namespace.unwrap_or_default(),
name: pod.metadata.name.unwrap_or_default(),
});
}
}
Event::Delete(pod) => {
// Pod deleted — remove from cache
if let Some(ip) = pod.status.as_ref()
.and_then(|s| s.pod_ip.as_ref())
{
cache.remove_by_ip(&ip.parse().unwrap());
}
}
Event::InitDone => {
// Initial list complete — cache is warm
}
}
}
}
When the main event loop reads a flow event from the eBPF ring buffer, it looks up both IPs in the cache:
// Main loop: poll ring buffer, enrich, aggregate
fn process_event(event: &NetworkFlowEvent, cache: &PodCache) -> EnrichedFlow {
let src_pod = cache.get_by_ip(&event.src_ip);
let dst_pod = cache.get_by_ip(&event.dst_ip);
// Direction-aware attribution:
// Ingress = packet arriving → destination is the local pod
// Egress = packet leaving → source is the local pod
let (namespace, pod_name) = if event.direction == INGRESS {
dst_pod.or(src_pod)
} else {
src_pod.or(dst_pod)
}.map(|p| (p.namespace.clone(), p.name.clone()))
.unwrap_or(("external".into(), "unknown".into()));
EnrichedFlow { namespace, pod_name, /* ... */ }
}
Where this works:
- Regular pods: Each has a unique IP assigned by the CNI. Lookup is unambiguous.
- Cross-node traffic: Source and destination are real pod IPs. Both resolve correctly.
Where this breaks:
- hostNetwork pods: Multiple pods share the node IP. Lookup returns whichever was cached last.
- External traffic: IPs from outside the cluster aren't in the pod cache. Labeled as
external/unknown. - Service ClusterIPs: If a packet is captured before DNAT, the destination is a virtual IP not in the pod cache.
For the 95% case — regular pods communicating directly — this approach is reliable and efficient. The cache is updated in real time via the Kubernetes watch API, so new pods are resolvable within seconds of creation.
Interface Discovery (Reading /proc to find where to attach)
You can't hardcode interface names. Different Kubernetes distributions, CNI plugins, and cloud providers use different interface names and bridge configurations. The agent needs to discover them at startup.
The primary interface is found by parsing /proc/net/route, a pseudo-file the kernel exposes:
$ cat /proc/net/route
Iface Destination Gateway Flags RefCnt Use Metric Mask MTU Window IRTT
eth0 00000000 0101A8C0 0003 0 0 100 00000000 0 0 0
cni0 0000F40A 00000000 0001 0 0 0 00FFFFFF 0 0 0
The entry with destination 00000000 (which is 0.0.0.0 — the default route) tells you the primary interface. In this case, eth0. The gateway 0101A8C0 is 192.168.1.1 in little-endian hex.
Bridges are found by checking which interfaces have a /sys/class/net/<name>/bridge directory — a convention the kernel uses to expose bridge metadata. Some environments use well-known names (cni0, docker0, cbr0); others use generated names (br-a1b2c3d4 in Docker networks, or kind's bridge names).
The interface landscape is also dynamic. Veth interfaces are created and destroyed as pods come and go. Bridges can be reconfigured by CNI plugins. A robust agent runs discovery at startup and periodically re-scans, or watches for netlink events (the kernel's notification mechanism for network configuration changes) to react to interface changes in real time.
The Full Picture (What I wish I knew before building this)
Container networking is layers of abstraction, each solving one problem:
- Network namespaces give containers isolated network stacks
- Veth pairs connect those isolated stacks back to the host
- Bridges let containers on the same node talk to each other
- Routing tables direct traffic to the right interface
- CNI plugins orchestrate all of the above when pods are created
- Kubernetes Services and DNAT provide stable endpoints and load balancing
- hostNetwork: true bypasses all isolation for system-level tools
For observability, the mental model is: packets flow through specific interfaces, and each interface carries specific traffic. Where you observe determines what you see. IP-based enrichment works because Kubernetes guarantees pod IP uniqueness — except when pods opt out of their own namespace via hostNetwork.
The opening mystery resolves cleanly: hostNetwork pods don't get their own IP, so the IP → pod mapping is ambiguous. The fix isn't a smarter data structure — it's accepting that hostNetwork traffic belongs to the "host" and needs different attribution strategies (port-based disambiguation, or simply labeling it as host traffic).
Building an eBPF-powered observability tool taught me that the Linux network stack is not a monolith. It's a pipeline of well-defined stages, observable at each point, with different tradeoffs at each stage. Understanding that pipeline — not just memorizing it, but knowing why each piece exists and what traffic flows where — is the difference between observability that works and observability that lies to you.
Related: Kernel Space and eBPF: The Observability Revolution