A thing of beauty |
Or perhaps you're already a Docker enthusiast and your super savvy microservice architecture orchestrates dozens of applications among a pile of process containers.
Either way, the massive multiplication of containers everywhere introduces an interesting networking problem:
"How do thousands of containers interact with thousands of other containers efficiently over a network? What if every one of those containers could just route to one another?"
Canonical is pleased to introduce today an innovative solution that addresses this problem in perhaps the most elegant and efficient manner to date! We call it "The Fan" -- an extension of the network tunnel driver in the Linux kernel. The fan was conceived by Mark Shuttleworth and John Meinel, and implemented by Jay Vosburgh and Andy Whitcroft.
A Basic Overview
Each container host has a "fan bridge" that enables all of its containers to deterministically map network traffic to any other container on the fan network. I say "deterministically", in that there are no distributed databases, no consensus protocols, and no more overhead than IP-IP tunneling. [A more detailed technical description can be found here.] Quite simply, a /16 network gets mapped on onto an unused /8 network, and container traffic is routed by the host via an IP tunnel.
A Demo
Interested yet? Let's take it for a test drive in AWS...
First, launch two instances in EC2 (or your favorite cloud) in the same VPC. Ben Howard has created special test images for AWS and GCE, which include a modified Linux kernel, a modified iproute2 package, a new fanctl package, and Docker installed by default. You can find the right AMIs here.
Build and Publish report for trusty 20150621.1228. ----------------------------------- BUILD INFO: VERSION=14.04-LTS STREAM=testing BUILD_DATE= BUG_NUMBER=1466602 STREAM="testing" CLOUD=CustomAWS SERIAL=20150621.1228 ----------------------------------- PUBLICATION REPORT: NAME=ubuntu-14.04-LTS-testing-20150621.1228 SUITE=trusty ARCH=amd64 BUILD=core REPLICATE=1 IMAGE_FILE=/var/lib/jenkins/jobs/CloudImages-Small-CustomAWS/workspace/ARCH/amd64/trusty-server-cloudimg-CUSTOM-AWS-amd64-disk1.img VERSION=14.04-LTS-testing-20150621.1228 INSTANCE_BUCKET=ubuntu-images-sandbox INSTANCE_eu-central-1=ami-1aac9407 INSTANCE_sa-east-1=ami-59a22044 INSTANCE_ap-northeast-1=ami-3ae2453a INSTANCE_eu-west-1=ami-d76623a0 INSTANCE_us-west-1=ami-238d7a67 INSTANCE_us-west-2=ami-53898c63 INSTANCE_ap-southeast-2=ami-ab95ef91 INSTANCE_ap-southeast-1=ami-98e9edca INSTANCE_us-east-1=ami-b1a658da EBS_BUCKET=ubuntu-images-sandbox VOL_ID=vol-678e2c29 SNAP_ID=snap-efaa288b EBS_eu-central-1=ami-b4ac94a9 EBS_sa-east-1=ami-e9a220f4 EBS_ap-northeast-1=ami-1aee491a EBS_eu-west-1=ami-07602570 EBS_us-west-1=ami-318c7b75 EBS_us-west-2=ami-858b8eb5 EBS_ap-southeast-2=ami-558bf16f EBS_ap-southeast-1=ami-faeaeea8 EBS_us-east-1=ami-afa25cc4 ---- 6cbd6751-6dae-4da7-acf3-6ace80c01acc
Next, ensure that those two instances can talk to one another. Here, I tested that in both directions, using both ping and nc.
ubuntu@ip-172-30-0-28:~$ ifconfig eth0 eth0 Link encap:Ethernet HWaddr 0a:0a:8f:f8:cc:21 inet addr:172.30.0.28 Bcast:172.30.0.255 Mask:255.255.255.0 inet6 addr: fe80::80a:8fff:fef8:cc21/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1 RX packets:2904565 errors:0 dropped:0 overruns:0 frame:0 TX packets:9919258 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:13999605561 (13.9 GB) TX bytes:14530234506 (14.5 GB) ubuntu@ip-172-30-0-28:~$ ping -c 3 172.30.0.27 PING 172.30.0.27 (172.30.0.27) 56(84) bytes of data. 64 bytes from 172.30.0.27: icmp_seq=1 ttl=64 time=0.289 ms 64 bytes from 172.30.0.27: icmp_seq=2 ttl=64 time=0.201 ms 64 bytes from 172.30.0.27: icmp_seq=3 ttl=64 time=0.192 ms --- 172.30.0.27 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.192/0.227/0.289/0.045 ms ubuntu@ip-172-30-0-28:~$ nc -l 1234 hi mom ───────────────────────────────────────────────────────────────────── ubuntu@ip-172-30-0-27:~$ ifconfig eth0 eth0 Link encap:Ethernet HWaddr 0a:26:25:9a:77:df inet addr:172.30.0.27 Bcast:172.30.0.255 Mask:255.255.255.0 inet6 addr: fe80::826:25ff:fe9a:77df/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1 RX packets:11157399 errors:0 dropped:0 overruns:0 frame:0 TX packets:1671239 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:16519319463 (16.5 GB) TX bytes:12019363671 (12.0 GB) ubuntu@ip-172-30-0-27:~$ ping -c 3 172.30.0.28 PING 172.30.0.28 (172.30.0.28) 56(84) bytes of data. 64 bytes from 172.30.0.28: icmp_seq=1 ttl=64 time=0.245 ms 64 bytes from 172.30.0.28: icmp_seq=2 ttl=64 time=0.185 ms 64 bytes from 172.30.0.28: icmp_seq=3 ttl=64 time=0.186 ms --- 172.30.0.28 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.185/0.205/0.245/0.030 ms ubuntu@ip-172-30-0-27:~$ echo "hi mom" | nc 172.30.0.28 1234
Now, import the Ubuntu image in Docker in both instances.
$ sudo docker pull ubuntu:latest Pulling repository ubuntu ... e9938c931006: Download complete 9802b3b654ec: Download complete 14975cc0f2bc: Download complete 8d07608668f6: Download complete
Now, let's create a fan bridge on each of those two instances. We can create it on the command line using the new fanctl command, or we can put it in /etc/network/interfaces.d/eth0.cfg.
We'll do the latter, so that the configuration is persistent across boots.
$ cat /etc/network/interfaces.d/eth0.cfg # The primary network interface auto eth0 iface eth0 inet dhcp up fanctl up 250.0.0.0/8 eth0/16 dhcp down fanctl down 250.0.0.0/8 eth0/16 $ sudo ifup --force eth0
Now, let's look at our ifconfig...
$ ifconfig docker0 Link encap:Ethernet HWaddr 56:84:7a:fe:97:99 inet addr:172.17.42.1 Bcast:0.0.0.0 Mask:255.255.0.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) eth0 Link encap:Ethernet HWaddr 0a:0a:8f:f8:cc:21 inet addr:172.30.0.28 Bcast:172.30.0.255 Mask:255.255.255.0 inet6 addr: fe80::80a:8fff:fef8:cc21/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1 RX packets:2905229 errors:0 dropped:0 overruns:0 frame:0 TX packets:9919652 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:13999655286 (13.9 GB) TX bytes:14530269365 (14.5 GB) fan-250-0-28 Link encap:Ethernet HWaddr 00:00:00:00:00:00 inet addr:250.0.28.1 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::8032:4dff:fe3b:a108/64 Scope:Link UP BROADCAST MULTICAST MTU:1480 Metric:1 RX packets:304246 errors:0 dropped:0 overruns:0 frame:0 TX packets:245532 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:13697461502 (13.6 GB) TX bytes:37375505 (37.3 MB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:1622 errors:0 dropped:0 overruns:0 frame:0 TX packets:1622 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:198717 (198.7 KB) TX bytes:198717 (198.7 KB) lxcbr0 Link encap:Ethernet HWaddr 3a:6b:3c:9b:80:45 inet addr:10.0.3.1 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::386b:3cff:fe9b:8045/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:648 (648.0 B) tunl0 Link encap:IPIP Tunnel HWaddr UP RUNNING NOARP MTU:1480 Metric:1 RX packets:242799 errors:0 dropped:0 overruns:0 frame:0 TX packets:302666 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12793620 (12.7 MB) TX bytes:13697374375 (13.6 GB)Pay special attention to the new fan-250-0-28 device! I've only shown this on one of my instances, but you should check both.
Now, let's tell Docker to use that device as its default bridge.
$ fandev=$(ifconfig | grep ^fan- | awk '{print $1}') $ echo $fandev fan-250-0-28 $ echo "DOCKER_OPTS='-d -b $fandev --mtu=1480 --iptables=false'" | \ sudo tee -a /etc/default/docker*
Make sure you restart the docker.io service. Note that it might be called docker.
$ sudo service docker.io restart || sudo service docker restart
Now we can launch a Docker container in each of our two EC2 instances...
$ sudo docker run -it ubuntu root@261ae39d90db:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr e2:f4:fd:f7:b7:f5 inet addr:250.0.28.3 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::e0f4:fdff:fef7:b7f5/64 Scope:Link UP BROADCAST RUNNING MTU:1480 Metric:1 RX packets:7 errors:0 dropped:2 overruns:0 frame:0 TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:558 (558.0 B) TX bytes:648 (648.0 B)
And here's a second one, on my other instance...
sudo docker run -it ubuntu root@ddd943163843:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 66:fa:41:e7:ad:44 inet addr:250.0.27.3 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::64fa:41ff:fee7:ad44/64 Scope:Link UP BROADCAST RUNNING MTU:1480 Metric:1 RX packets:12 errors:0 dropped:2 overruns:0 frame:0 TX packets:13 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:936 (936.0 B) TX bytes:1026 (1.0 KB)
Now, let's send some traffic back and forth! Again, we can use ping and nc.
root@261ae39d90db:/# ping -c 3 250.0.27.3 PING 250.0.27.3 (250.0.27.3) 56(84) bytes of data. 64 bytes from 250.0.27.3: icmp_seq=1 ttl=62 time=0.563 ms 64 bytes from 250.0.27.3: icmp_seq=2 ttl=62 time=0.278 ms 64 bytes from 250.0.27.3: icmp_seq=3 ttl=62 time=0.260 ms --- 250.0.27.3 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.260/0.367/0.563/0.138 ms root@261ae39d90db:/# echo "here come the bits" | nc 250.0.27.3 9876 root@261ae39d90db:/# ───────────────────────────────────────────────────────────────────── root@ddd943163843:/# ping -c 3 250.0.28.3 PING 250.0.28.3 (250.0.28.3) 56(84) bytes of data. 64 bytes from 250.0.28.3: icmp_seq=1 ttl=62 time=0.434 ms 64 bytes from 250.0.28.3: icmp_seq=2 ttl=62 time=0.258 ms 64 bytes from 250.0.28.3: icmp_seq=3 ttl=62 time=0.269 ms --- 250.0.28.3 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.258/0.320/0.434/0.081 ms root@ddd943163843:/# nc -l 9876 here come the bits
Alright, so now let's really bake your noodle...
That 250.0.0.0/8 network can actually be any /8 network. It could be a 10.* network or any other /8 that you choose. I've chosen to use something in the reserved Class E range, 240.* - 255.* so as not to conflict with any other routable network.
Finally, let's test the performance a bit using iperf and Amazon's 10gpbs instances!
So I fired up two c4.8xlarge instances, and configured the fan bridge there.
$ fanctl show Bridge Overlay Underlay Flags fan-250-0-28 250.0.0.0/8 172.30.0.28/16 dhcp host-reserve 1
And
$ fanctl show Bridge Overlay Underlay Flags fan-250-0-27 250.0.0.0/8 172.30.0.27/16 dhcp host-reserve 1
Would you believe 5.46 Gigabits per second, between two Docker instances, directly addressed over a network? Witness...
Server 1...
root@84364bf2bb8b:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 92:73:32:ac:9c:fe inet addr:250.0.27.2 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::9073:32ff:feac:9cfe/64 Scope:Link UP BROADCAST RUNNING MTU:1480 Metric:1 RX packets:173770 errors:0 dropped:2 overruns:0 frame:0 TX packets:107628 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:6871890397 (6.8 GB) TX bytes:7190603 (7.1 MB) root@84364bf2bb8b:/# iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 250.0.27.2 port 5001 connected with 250.0.28.2 port 35165 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.0 sec 6.36 GBytes 5.46 Gbits/sec
And Server 2...
root@04fb9317c269:/# ifconfig eth0 eth0 Link encap:Ethernet HWaddr c2:6a:26:13:c5:95 inet addr:250.0.28.2 Bcast:0.0.0.0 Mask:255.255.255.0 inet6 addr: fe80::c06a:26ff:fe13:c595/64 Scope:Link UP BROADCAST RUNNING MTU:1480 Metric:1 RX packets:109230 errors:0 dropped:2 overruns:0 frame:0 TX packets:150164 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:28293821 (28.2 MB) TX bytes:6849336379 (6.8 GB) root@04fb9317c269:/# iperf -c 250.0.27.2 multicast ttl failed: Invalid argument ------------------------------------------------------------ Client connecting to 250.0.27.2, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 250.0.28.2 port 35165 connected with 250.0.27.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 6.36 GBytes 5.47 Gbits/sec
Multiple containers, on separate hosts, directly addressable to one another with nothing more than a single network device on each host. Deterministic routes. Blazing fast speeds. No distributed databases. No consensus protocols. Not an SDN. This is just amazing!
RFC
Give it a try and let us know what you think! We'd love to get your feedback and use cases as we work the kernel and userspace changes upstream.
Over the next few weeks, you'll see the fan patches landing in Wily, and backported to Trusty and Vivid. We are also drafting an RFC, as we think that other operating systems and the container world and the Internet at large would benefit from Fan Networking.
I'm already a fan!
Dustin