Debugging docker routing behavior
Let's zoom in a bit on why this happens in detail.
In this example we have two docker networks called my_default and my_monitoring (aside, because they are within the my docker-compose project):
# docker network ls
NETWORK ID NAME DRIVER SCOPE
2ad61e302639 host host local
d383fea61ebd my_default bridge local
629af1b7e10d my_monitoring bridge local
Let's now inspect the bridge devices:
# ip link show type bridge
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:24:3d:80:be brd ff:ff:ff:ff:ff:ff
4: br-629af1b7e10d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:24:e7:96:7d brd ff:ff:ff:ff:ff:ff
5: br-d383fea61ebd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:00:08:62:04 brd ff:ff:ff:ff:ff:ff
We can see that for our two networks, there's a bridge device (can be spotted by sharing the id after br-).
In this case the docker0 bridge is down, maybe since we don't have any containers in the default docker network. We'll disregard it, but would probably just behave like any other docker-created bridge.
Let's focus on one of these bridges, let's say the monitoring one 629af1b7e10d.
# ip link show master br-629af1b7e10d
9: veth6f9b96e@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default
link/ether 66:d2:79:4f:f8:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
11: veth101dbe8@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default
link/ether a6:c4:b9:34:ab:31 brd ff:ff:ff:ff:ff:ff link-netnsid 2
90: vethf92d1d7@if89: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default
link/ether 6a:3b:30:70:e3:8a brd ff:ff:ff:ff:ff:ff link-netnsid 36
We can see that bridge has three virtual ethernet devices bound to it, these correspond to the three docker containers that expose some port, regardless if it binds those port on the host (exposing is within the docker network, port binding via ports is on a host or a specified IP of it).
Detour: What's a bridge anyway? https://wiki.archlinux.org/title/network_bridge gives good insight. From the host's perspective, the bridge looks like an ethernet device that acts as a gateway towards (here) the containers attached to it. That explains why the bridge interfaces have an IP address attached (see https://unix.stackexchange.com/a/319984/20146):
# ip addr show type bridge
...
4: br-629af1b7e10d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:24:e7:96:7d brd ff:ff:ff:ff:ff:ff
inet 172.19.0.1/16 brd 172.19.255.255 scope global br-629af1b7e10d
valid_lft forever preferred_lft forever
...
If we peek into one of the containers attached, we see its device has an IP address from that subnet 172.19.0.0/16:
...
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.19.0.3/16 brd 172.19.255.255 scope global eth0
valid_lft forever preferred_lft forever
Detour: apart from execing into the container, we can execute network commands as if we were in the container's network namespace in an other way. This is useful if the container doesn't have the network tools installed. See https://stackoverflow.com/a/52287652/180258 on how to link /var/link/netns, then
# get the network namespace of a container on our monitoring docker net, first 12 chars only for some reason
docker inspect aa7bc3710d32 --format '{{range }} {{.NetworkSettings.SandboxID}} {{end}}' | cut -b-12
# ip netns exec 1aff47345e7f ip link
...
0: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
or equally # nsenter --net=/var/run/netns/1aff47345e7f ip addr, which without the command args would give you a shell with that namespace (and can be used to enter other namespaces at once too) but I'll stop musing here.
Once we are able to enter a container's network namespace, why not quickly check how it binds its EXPOSE-ed port (here node-exporter service exposing 9100 to the monitoring net)?
After nsenter we execute ss -nl4p to find... mostly nothing, or not what we expect (note: ss is the modern way how you do netstat). ss -nlp though, without the IPv4 filter, lists the *:9100 port as expected. Weird. Let's do ss -nl6p and we see [::ffff:172.19.0.2]:9100, which is... a very weird address. Reading up, it is "an IPv4 address embedded in IPv6 space". Now, I don't want to know more about that, and let's just pretend it is listening on a regular IPv4 address (which it is not, but..).
How would traffic be routed out from this network namespace?
# ip route
default via 172.19.0.1 dev eth0
172.19.0.0/16 dev eth0 proto kernel scope link src 172.19.0.2
Who are we communicating with?
# ip neigh
172.19.0.4 dev eth0 lladdr 02:42:ac:13:00:04 REACHABLE
This shows an other container (actually a prometheus instance that scrapes the node-exporter). All good from this direction.
Now, back towards our original question, how does host traffic (or extra-host traffic) get routed to the containers? More specifically, to the PORTS-bound published ports (expose-d ports are only visible within the docker network).
Let's find a container with published port. An nginx container has ports: 443:9001, so its locally exposed port 9001 should be bound to all addresses of the host on port 443. After nsenter-ing it, we now find a 0.0.0.0:9001 bound listener (which is now for real bound to IPv4). But where's the 443 part?
Leaving the network namespace, and doing ss -nlp | grep 443 on the host reveals that docker-proxy is listening on 0.0.0.0:443. Indeed, if we check:
# systemctl status docker
CGroup: /system.slice/docker.service
├─65555 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
...
└─68015 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.18.0.23 -container-port 9001
we find it is a docker-managed proxy targeting our container. Okay, that part we are satisfied with for our purposes (don't ask how the docker network driver works). Note: from below, it might turn out this docker-proxy is a red herring... or not?
So now we know we have to ask, how docker arranges iptables rules that target port 443?
Doing iptables-save | grep 443 reveals
*nat
...
:DOCKER - [0:0]
...
-A DOCKER ! -i br-d383fea61ebd -p tcp -m tcp --dport 443 -j DNAT --to-destination 172.18.0.23:9001
so there's this nat table DOCKER chain rule, where traffic originating not within the my_default docker network targeting port 443 will directly be routed to the container destination. Neat. (It also means the docker-proxy likely doesn't come in play on this path, but maybe when inside that network).
Okay, how do packets reach this chain and rule?
*nat
...
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
Uh-oh, so this only triggers on packets destined to LOCAL addresses. So this was a dead-end. Let's look further.
Back to our original suspect docker-proxy. But we are a bit tired having no evidence at hand, so with the help of https://serverfault.com/a/126079/47611 and https://serverfault.com/a/1113788/47611, let's get hold of some iptables rule traces:
iptables -t raw -A PREROUTING -p tcp -d <hostip> --dport 443 -j TRACE
xtables-monitor --trace
and we see
PACKET: 2 036c1093 IN=eno1 OUT=br-d383fea61ebd
which is m'kay, but why exactly? Maybe the conntrack disturbs things, let's start with a clean page. Stop the docker container, start trace, restart container.
Now we got, as we were hoping for, a different trace. First, a trace when the container was stopped:
PACKET: 2 842d387a IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=56110DF SPORT=17230 DPORT=443 SYN
TRACE: 2 842d387a nat:PREROUTING:rule:0x28:JUMP:DOCKER -4 -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
TRACE: 2 842d387a nat:DOCKER:return:
TRACE: 2 842d387a nat:PREROUTING:return:
TRACE: 2 842d387a nat:PREROUTING:policy:ACCEPT
PACKET: 2 842d387a IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=56110DF SPORT=17230 DPORT=443 SYN
TRACE: 2 842d387a filter:INPUT:rule:0x14:JUMP:ufw-before-logging-input -4 -t filter -A INPUT -j ufw-before-logging-input
... more ufw ...
What is immediately interesting, is that we see the packet jumping to the DOCKER chain in the nat table, which we assumed it wouldn't. Wrongly.
Then it is not surprising that when the container is back, the packet gets routed to it:
PACKET: 2 f92d752e IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=31830DF SPORT=57694 DPORT=443 SYN
TRACE: 2 f92d752e raw:PREROUTING:rule:0xd:CONTINUE -4 -t raw -A PREROUTING -d <hostip>/32 -p tcp -m tcp --dport 443 -j TRACE
TRACE: 2 f92d752e raw:PREROUTING:return:
TRACE: 2 f92d752e raw:PREROUTING:policy:ACCEPT
PACKET: 2 f92d752e IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=31830DF SPORT=57694 DPORT=443 SYN
TRACE: 2 f92d752e nat:PREROUTING:rule:0x28:JUMP:DOCKER -4 -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
TRACE: 2 f92d752e nat:DOCKER:rule:0x3b:ACCEPT -4 -t nat -A DOCKER ! -i br-d383fea61ebd -p tcp -m tcp --dport 443 -j DNAT --to-destination 172.18.0.23:9001
PACKET: 2 f92d752e IN=eno1 OUT=br-d383fea61ebd MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=172.18.0.23 LEN=60 TOS=0x0 TTL=55 ID=31830DF SPORT=57694 DPORT=9001 SYN
TRACE: 2 f92d752e filter:FORWARD:rule:0xea:JUMP:DOCKER-USER -4 -t filter -A FORWARD -j DOCKER-USER
TRACE: 2 f92d752e filter:DOCKER-USER:return:
TRACE: 2 f92d752e filter:FORWARD:rule:0xe7:JUMP:DOCKER-ISOLATION-STAGE-1 -4 -t filter -A FORWARD -j DOCKER-ISOLATION-STAGE-1
TRACE: 2 f92d752e filter:DOCKER-ISOLATION-STAGE-1:return:
TRACE: 2 f92d752e filter:FORWARD:rule:0xa8:JUMP:DOCKER -4 -t filter -A FORWARD -o br-d383fea61ebd -j DOCKER
TRACE: 2 f92d752e filter:DOCKER:rule:0xf1:ACCEPT -4 -t filter -A DOCKER -d 172.18.0.23/32 ! -i br-d383fea61ebd -o br-d383fea61ebd -p tcp -m tcp --dport 9001 -j ACCEPT
TRACE: 2 f92d752e nat:POSTROUTING:return:
TRACE: 2 f92d752e nat:POSTROUTING:policy:ACCEPT
Home exercise to check the case for non-SYN packet, how it behaves slightly differently based on conntrack.
But, back to the question then. What is this -m addrtype --dst-type LOCAL matcher? Based on our experience, and the comment to https://unix.stackexchange.com/q/130807/20146, we conclude that LOCAL is anything that is assigned to this host, including its public address. Yay.
So, to recap, the nat-table PREROUTING rules will DNAT the packet to the container, so the packet enters the normal FORWARD chain (because it is switching interfaces, right?). There it is directed to the DOCKER chain, where it gets accepted.
The UFW routed rules (for example ufw route deny out on br-d383fea61ebd) are consulted from the FORWARD chain, but only after the docker rules. So too late.
https://docs.docker.com/network/packet-filtering-firewalls/ advises to use the DOCKER-USER chain for FORWARD rules to be executed before the docker rules. If one could get UFW to dump its rules into that chain instead of FORWARD chain, would be nice. https://github.com/docker/for-linux/issues/690#issuecomment-499132578 hints other people had this idea already. Let's not go down that rabbithole now.
In the meantime, what works is that if you don't want to expose a published port to the host, explicitly prefix with the bind address, like ports: - "127.0.0.1:9001:9001". Or, well, don't publish at all.