Why does docker bypass ufw rules one time and another time not?

Question

I am using ufw to setup a firewall on my host system. It seems that ufw would let me bypass certain rules when using it combined with docker in some cases.

I am aware that docker by default alters iptables directly, which leads to certain problems, especially with ufw, but I have encountered an issue that seems very strange to me.

Here is a breakdown of what I did.

I want to deny all incoming traffic:

sudo ufw default deny incoming

I want to allow ssh:

sudo ufw allow ssh

I want to allow everything that goes from my host system back to my host system on port 8181 (Context: this shall later be used to build a ssh tunnel to my host and access port 8181 from anywhere - but is not important for now)

sudo ufw allow from 127.0.0.1 to 127.0.0.1 port 8181

I enable my firewall settings:

sudo ufw enable

If I now have a look at the firewall status via sudo ufw status it states the following:

Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere                  
127.0.0.1 8181             ALLOW       127.0.0.1                 
22/tcp (v6)                ALLOW       Anywhere (v6)

Looks good to me, but now comes the strange part. I have an API, that runs inside a docker container available at port 8080 internally.

If I now run the docker container with the following command and map port 8080 to port 8181 on my host system

docker run -d --name ufw_test -p 8181:8080 mytest/web-app:latest

it seems to bypass my rule that I have set earlier to only allow traffic from 127.0.0.1 to 127.0.0.1 on port 8181. I was able to access my API from anywhere. I tried it with different PCs on the same Network and my API was accessible via 192.168.178.20:8181 from another PC. I figured, a way to fix this would be starting my container like so:

docker run -d --name ufw_test -p 127.0.0.1:8181:8080 mytest/web-app:latest

This would restrict access to my API the way I intended it to, however I wonder, what would be the reason that the second command worked, where the first did not?

score 3 · Accepted Answer · answered May 21 '19 at 10:52

ufw shows only the ufw configuration and any rules inserted directly in your firewall configuration (with iptables directly or another tool such as docker) without going through ufw are NOT displayed.

Firewall rules in Linux are applied in the order they are listed. When you start a docker container docker will insert the rules your docker containers need before existing rules and the rule-set you maintain with ufw.

In other words Docker exposing a port takes precedence over a subsequent ufw rules closing a particular port.

Check for instance with [sudo] iptables-save what your effective rule set is.

As to why -p 127.0.0.1:8181:8080 works differently?
The firewall rule docker creates will still take precedence to your ufw rules, but rather than exposing the port on all interfaces, including to the public, you now instruct docker to be much more restrictive and only expose the port on localhost.

score 1 · Answer 2 · answered Dec 19 '23 at 20:35

Debugging docker routing behavior

Let's zoom in a bit on why this happens in detail.

In this example we have two docker networks called my_default and my_monitoring (aside, because they are within the my docker-compose project):

# docker network ls
NETWORK ID     NAME            DRIVER    SCOPE
2ad61e302639   host            host      local
d383fea61ebd   my_default      bridge    local
629af1b7e10d   my_monitoring   bridge    local

Let's now inspect the bridge devices:

# ip link show type bridge
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default 
    link/ether 02:42:24:3d:80:be brd ff:ff:ff:ff:ff:ff
4: br-629af1b7e10d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:24:e7:96:7d brd ff:ff:ff:ff:ff:ff
5: br-d383fea61ebd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:00:08:62:04 brd ff:ff:ff:ff:ff:ff

We can see that for our two networks, there's a bridge device (can be spotted by sharing the id after br-).

In this case the docker0 bridge is down, maybe since we don't have any containers in the default docker network. We'll disregard it, but would probably just behave like any other docker-created bridge.

Let's focus on one of these bridges, let's say the monitoring one 629af1b7e10d.

# ip link show master br-629af1b7e10d
9: veth6f9b96e@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default 
    link/ether 66:d2:79:4f:f8:a4 brd ff:ff:ff:ff:ff:ff link-netnsid 0
11: veth101dbe8@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default 
    link/ether a6:c4:b9:34:ab:31 brd ff:ff:ff:ff:ff:ff link-netnsid 2
90: vethf92d1d7@if89: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-629af1b7e10d state UP mode DEFAULT group default 
    link/ether 6a:3b:30:70:e3:8a brd ff:ff:ff:ff:ff:ff link-netnsid 36

We can see that bridge has three virtual ethernet devices bound to it, these correspond to the three docker containers that expose some port, regardless if it binds those port on the host (exposing is within the docker network, port binding via ports is on a host or a specified IP of it).

Detour: What's a bridge anyway? https://wiki.archlinux.org/title/network_bridge gives good insight. From the host's perspective, the bridge looks like an ethernet device that acts as a gateway towards (here) the containers attached to it. That explains why the bridge interfaces have an IP address attached (see https://unix.stackexchange.com/a/319984/20146):

# ip addr show type bridge
...
4: br-629af1b7e10d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:24:e7:96:7d brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.1/16 brd 172.19.255.255 scope global br-629af1b7e10d
       valid_lft forever preferred_lft forever
...

If we peek into one of the containers attached, we see its device has an IP address from that subnet 172.19.0.0/16:

...
10: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue 
    link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.3/16 brd 172.19.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Detour: apart from execing into the container, we can execute network commands as if we were in the container's network namespace in an other way. This is useful if the container doesn't have the network tools installed. See https://stackoverflow.com/a/52287652/180258 on how to link /var/link/netns, then

# get the network namespace of a container on our monitoring docker net, first 12 chars only for some reason
docker inspect aa7bc3710d32 --format '{{range }} {{.NetworkSettings.SandboxID}} {{end}}' | cut -b-12
# ip netns exec 1aff47345e7f ip link
...
0: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether 02:42:ac:13:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0

or equally # nsenter --net=/var/run/netns/1aff47345e7f ip addr, which without the command args would give you a shell with that namespace (and can be used to enter other namespaces at once too) but I'll stop musing here.

Once we are able to enter a container's network namespace, why not quickly check how it binds its EXPOSE-ed port (here node-exporter service exposing 9100 to the monitoring net)?

After nsenter we execute ss -nl4p to find... mostly nothing, or not what we expect (note: ss is the modern way how you do netstat). ss -nlp though, without the IPv4 filter, lists the *:9100 port as expected. Weird. Let's do ss -nl6p and we see [::ffff:172.19.0.2]:9100, which is... a very weird address. Reading up, it is "an IPv4 address embedded in IPv6 space". Now, I don't want to know more about that, and let's just pretend it is listening on a regular IPv4 address (which it is not, but..).

How would traffic be routed out from this network namespace?

# ip route
default via 172.19.0.1 dev eth0 
172.19.0.0/16 dev eth0 proto kernel scope link src 172.19.0.2

Who are we communicating with?

# ip neigh
172.19.0.4 dev eth0 lladdr 02:42:ac:13:00:04 REACHABLE

This shows an other container (actually a prometheus instance that scrapes the node-exporter). All good from this direction.

Now, back towards our original question, how does host traffic (or extra-host traffic) get routed to the containers? More specifically, to the PORTS-bound published ports (expose-d ports are only visible within the docker network).

Let's find a container with published port. An nginx container has ports: 443:9001, so its locally exposed port 9001 should be bound to all addresses of the host on port 443. After nsenter-ing it, we now find a 0.0.0.0:9001 bound listener (which is now for real bound to IPv4). But where's the 443 part?

Leaving the network namespace, and doing ss -nlp | grep 443 on the host reveals that docker-proxy is listening on 0.0.0.0:443. Indeed, if we check:

# systemctl status docker
    CGroup: /system.slice/docker.service
             ├─65555 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
             ...
             └─68015 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 443 -container-ip 172.18.0.23 -container-port 9001

we find it is a docker-managed proxy targeting our container. Okay, that part we are satisfied with for our purposes (don't ask how the docker network driver works). Note: from below, it might turn out this docker-proxy is a red herring... or not?

So now we know we have to ask, how docker arranges iptables rules that target port 443?

Doing iptables-save | grep 443 reveals

*nat
...
:DOCKER - [0:0]
...
-A DOCKER ! -i br-d383fea61ebd -p tcp -m tcp --dport 443 -j DNAT --to-destination 172.18.0.23:9001

so there's this nat table DOCKER chain rule, where traffic originating not within the my_default docker network targeting port 443 will directly be routed to the container destination. Neat. (It also means the docker-proxy likely doesn't come in play on this path, but maybe when inside that network).

Okay, how do packets reach this chain and rule?

*nat
...
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER

Uh-oh, so this only triggers on packets destined to LOCAL addresses. So this was a dead-end. Let's look further.

Back to our original suspect docker-proxy. But we are a bit tired having no evidence at hand, so with the help of https://serverfault.com/a/126079/47611 and https://serverfault.com/a/1113788/47611, let's get hold of some iptables rule traces:

iptables -t raw -A PREROUTING -p tcp -d <hostip> --dport 443 -j TRACE
xtables-monitor --trace

and we see

PACKET: 2 036c1093 IN=eno1 OUT=br-d383fea61ebd

which is m'kay, but why exactly? Maybe the conntrack disturbs things, let's start with a clean page. Stop the docker container, start trace, restart container.

Now we got, as we were hoping for, a different trace. First, a trace when the container was stopped:

PACKET: 2 842d387a IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=56110DF SPORT=17230 DPORT=443 SYN 
 TRACE: 2 842d387a nat:PREROUTING:rule:0x28:JUMP:DOCKER  -4 -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
 TRACE: 2 842d387a nat:DOCKER:return:
 TRACE: 2 842d387a nat:PREROUTING:return:
 TRACE: 2 842d387a nat:PREROUTING:policy:ACCEPT 
PACKET: 2 842d387a IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=56110DF SPORT=17230 DPORT=443 SYN
 TRACE: 2 842d387a filter:INPUT:rule:0x14:JUMP:ufw-before-logging-input  -4 -t filter -A INPUT -j ufw-before-logging-input
... more ufw ...

What is immediately interesting, is that we see the packet jumping to the DOCKER chain in the nat table, which we assumed it wouldn't. Wrongly.

Then it is not surprising that when the container is back, the packet gets routed to it:

PACKET: 2 f92d752e IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=31830DF SPORT=57694 DPORT=443 SYN
 TRACE: 2 f92d752e raw:PREROUTING:rule:0xd:CONTINUE  -4 -t raw -A PREROUTING -d <hostip>/32 -p tcp -m tcp --dport 443 -j TRACE
 TRACE: 2 f92d752e raw:PREROUTING:return:
 TRACE: 2 f92d752e raw:PREROUTING:policy:ACCEPT
PACKET: 2 f92d752e IN=eno1 MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=<hostip> LEN=60 TOS=0x0 TTL=56 ID=31830DF SPORT=57694 DPORT=443 SYN
 TRACE: 2 f92d752e nat:PREROUTING:rule:0x28:JUMP:DOCKER  -4 -t nat -A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
 TRACE: 2 f92d752e nat:DOCKER:rule:0x3b:ACCEPT  -4 -t nat -A DOCKER ! -i br-d383fea61ebd -p tcp -m tcp --dport 443 -j DNAT --to-destination 172.18.0.23:9001
PACKET: 2 f92d752e IN=eno1 OUT=br-d383fea61ebd MACSRC=94:f7:ad:4f:81:84 MACDST=b4:2e:99:83:77:46 MACPROTO=0800 SRC=<srcip> DST=172.18.0.23 LEN=60 TOS=0x0 TTL=55 ID=31830DF SPORT=57694 DPORT=9001 SYN
 TRACE: 2 f92d752e filter:FORWARD:rule:0xea:JUMP:DOCKER-USER  -4 -t filter -A FORWARD -j DOCKER-USER
 TRACE: 2 f92d752e filter:DOCKER-USER:return:
 TRACE: 2 f92d752e filter:FORWARD:rule:0xe7:JUMP:DOCKER-ISOLATION-STAGE-1  -4 -t filter -A FORWARD -j DOCKER-ISOLATION-STAGE-1
 TRACE: 2 f92d752e filter:DOCKER-ISOLATION-STAGE-1:return:
 TRACE: 2 f92d752e filter:FORWARD:rule:0xa8:JUMP:DOCKER  -4 -t filter -A FORWARD -o br-d383fea61ebd -j DOCKER
 TRACE: 2 f92d752e filter:DOCKER:rule:0xf1:ACCEPT  -4 -t filter -A DOCKER -d 172.18.0.23/32 ! -i br-d383fea61ebd -o br-d383fea61ebd -p tcp -m tcp --dport 9001 -j ACCEPT
 TRACE: 2 f92d752e nat:POSTROUTING:return:
 TRACE: 2 f92d752e nat:POSTROUTING:policy:ACCEPT

Home exercise to check the case for non-SYN packet, how it behaves slightly differently based on conntrack.

But, back to the question then. What is this -m addrtype --dst-type LOCAL matcher? Based on our experience, and the comment to https://unix.stackexchange.com/q/130807/20146, we conclude that LOCAL is anything that is assigned to this host, including its public address. Yay.

So, to recap, the nat-table PREROUTING rules will DNAT the packet to the container, so the packet enters the normal FORWARD chain (because it is switching interfaces, right?). There it is directed to the DOCKER chain, where it gets accepted.

The UFW routed rules (for example ufw route deny out on br-d383fea61ebd) are consulted from the FORWARD chain, but only after the docker rules. So too late.

https://docs.docker.com/network/packet-filtering-firewalls/ advises to use the DOCKER-USER chain for FORWARD rules to be executed before the docker rules. If one could get UFW to dump its rules into that chain instead of FORWARD chain, would be nice. https://github.com/docker/for-linux/issues/690#issuecomment-499132578 hints other people had this idea already. Let's not go down that rabbithole now.

In the meantime, what works is that if you don't want to expose a published port to the host, explicitly prefix with the bind address, like ports: - "127.0.0.1:9001:9001". Or, well, don't publish at all.

Why does docker bypass ufw rules one time and another time not?

2 Answers2

Debugging docker routing behavior