人民网
人民网>>传媒>>人民网奖学金>>清华计算机2018

Quick NAT: High Performance NAT System on Commodity Platforms

李峻峰
2019年01月15日15:14 | 来源:人民网研究院
小字号

Abstract. NAT gateway is an important network system in today's IPv4 network when translating a private IPv4 address to a public address. However, traditional NAT system based on Linux Netfilter cannot achieve high network throughput to meet modern requirements such as data centers. To address this challenge, we improve the network performance of NAT system by three ways. First, we leverage DPDK to enable polling and zero-copy delivery, so as to reduce the cost of interrupt and packet copies. Second, we enable multiple CPU cores to process in parallel and use lock-free hash table to minimize the contention between CPU cores. Third, we use hash search instead of sequential search when looking up the NAT rule table. Evaluation shows that our Quick NAT system obtains very high scalability and line-rate throughput on commodity server, with an improvement of more than 860% compared to Linux Netfilter.

1 Introduction

The scale of Internet is ever increasing. Today there are more than 1 billion hosts connected to Internet. The total number of Internet users exceeds 3.5 billion, which is nearly half of the entire population of the world, according to ITU [1], The World Bank [2] and Statista [3]. Since IPv4 addresses are exhausted in 2011 [4], the move to IPv6 seems inevitable. However, the transition from IPv4 to IPv6 requires updating not only the Internet infrastructure but also a large amount of applications, which faces many obstacles in practice. As a result, IPv4 network and IPv4 users are still the dominant in today's Internet. In IPv4 network, the key technology to deal with the address insufficiency problem is NAT (Network Address Translation). By NAT technology, many IPv4 devices/users with private IPv4 address share the same public IPv4 address to reduce the total number of required public IPv4 addresses. The NAT gateway translates an IPv4 private address to an IPv4 public address and vice versa. Since NAT gateway needs to rewrite every packet, its throughput/delay performance will significantly affect the end-to-end performance of the flows crossing it, particularly for large campuses/companies like data centers with high-volume traffic. Although there are prior works focusing on how to improve the NAT performance, in this paper our interest lies in how to systematically solve the problem with the emerging new technologies, such as the multi-core platform, user-space network processing, etc. Actually, this kind of technology is not only important for NAT gateways, but also will help similar systems like NAT-PT, which will be important in future when IPv4 and IPv6 networks co-exist in the Internet [5].

In the past, most NAT systems are deployed in the Linux platform leveraging the Netfilter framework [6]. Although it may work in small-scale networks, its performance faces significant challenge with high traffic volume. Specifically, for small-sized packets, the throughput of NAT system on commodity servers can hardly exceed 1Gbps, which leads to a big gap between the system performance and the hardware capability with 10G/100G NIC cards and multiple CPU cores. In this work we try to improve the performance of NAT system by the following approaches. First, we leverage the Data Plane Development Kit (DPDK)'s capabilities to build NAT system in the user space instead of in the Linux kernel, and thus enable polling the NIC to read packets directly into user space to eliminate the high overhead caused by packet copy and interrupt. But we also need to manipulate the packet through pointers to achieve zero-copy in the process of NAT. Second, to leverage the multi-core capability of modern commodity servers, we enable Receive-side Scaling (RSS) to let multiple cores process packets in a parallel way. But we need to minimize the sharing cost between CPU cores. Third, we find that the algorithms used in today's NAT system can also be improved. In particular, we use hash based search instead of sequential search when looking up the NAT rule table, which also considerably helps improve the performance.

Based on the improvements above, we implement a NAT system called Quick NAT. Our experiments show that Quick NAT can obtain line rate throughput of 10Gbps for 64 byte packets, an improvement of more than 860% compared to Linux Netfilter. In fact, it can achieve even higher throughput on servers with more CPU cores and network interface cards (NICs) of higher speed.

The rest of the paper is organized as follows. Section II provides an overview of background and related work. Section III describes system architecture of Quick NAT and elaborates the methods we use to build Quick NAT. Section IV shows experiment results and we conclude the paper in section V.

2 Background and Related Work

2.1 Background

For commodity servers, NAT function is commonly achieved by Netfilter in Linux Kernel. The Netfilter is a packet filter framework composed of a set of hooks, which NAT registers with as build-in chains to modify packet's header. The figure below shows chains of NAT in Netfilter:

The PREROUTING chain is responsible for DNAT (Destination NAT) for packets that just arrive at the network interface. The routing decision is made after the packet has passed the PREROUTING chain. For the packets sent to local host, SNAT (Source NAT) is done in INPUT chain. For the forwarded packet, they pass the POSTROUTING chain for SNAT and then leave the server. There is a small difference for locally generated packets: they pass through the OUTPUT chain for DNAT and then move on to the POSTROUTING chain.

NAT rules are organized within those separate chains in Netfilter. When packets come to the chain, each rule is examined in linear order and this linear search is costly with a large number of rules. After finding a rule, the header of the packet is modified based on the rule and new connection records are added to the connection tracking table [7]. In this way, only the first packet of a flow needs to search for NAT rules and all subsequent packets exchanged in either direction can apply the same NAT decision made for the first packet without additional lookup. However, the connection tracking table is shared among multicores with high lock overhead.

2.2 Related Work

Since it is well known that Netfilter doesn't scale well to deal with a large number of small-size packets, some papers advertise improving Netfilter performance. Mingming Yang et al. [8] proposed an algorithm to dynamically change the order of rules and the order of hook functions in Netfilter. Feng Liu et al. [9] built hash tables for NAT mapping table. However, the performance improvement for Netfilter in these works is limited because they didn't modify linear search algorithm for NAT rule table. Accardi, Kristen et al. [10] prototyped a hybrid firewall, using a Network Processor (NP) to accelerate Linux Netfilter on a general-purpose server. Nonetheless, they didn't make any modification to Netfilter and the enhancement of performance mainly resulted from special-purpose hardware.

Apart from modifying Linux Kernel, there exist a number of networking implementations that bypass the kernel's protocol stack entirely. PFQ [11], PF_RING [12], PF_RING ZC (Zero Copy) [13], Intel DPDK [14] and Netmap [15] directly mapping NIC buffers into user-space memory to accelerate network. DPDK has been successfully used in accelerating virtual switch as in the case of DPDK vSwitch [16]. There is a growing demand for high-speed NAT in data centers with high-volume traffic, but there aren't previous works about high performance NAT system on those kernel-bypass frameworks.

3 System Design

In order to improve the performance of NAT on commodity platforms, we design Quick NAT system built on DPDK. Quick NAT Search (QNS) algorithm is designed to look up NAT rules with complexity of O(1). In addition, Quick NAT system leverages the lock-free hash table to reduce the expense of locks when sharing NAT mapping records among CPU cores on multicore commodity servers. Moreover, Quick NAT achieves full zero-copy in the process of NAT to cut down the overhead of copy.

3.1 System Overview

The architecture of Quick NAT system is shown in Figure 2. Quick NAT system utilizes DPDK's capabilities to bypass the kernel and be built in the user space. Quick NAT system is composed of four components, i.e. Connection Tracer, Rule Finder, Tuple Rewriter and IP/Port Pool. Quick NAT system bypasses Linux Kernel to access packets directly from the NIC. Once receiving a packet, Connection Tracer searches for a connection record using hash search at first. If one connection record is found, the header of this packet is modified by Tuple Rewriter according to this connection record. Otherwise, Rule Finder uses QNS algorithm to look up NAT rule tables with complexity of O(1) and a IP/Port pair is picked up from IP/Port pool according to the NAT rule. Then, Quick NAT system revises packet's header and adds two connection records for the following packets of this flow and the reply flow.

In Quick NAT system, we have made three major contributions to improve the performance of NAT. First, we design QNS algorithm to look for the NAT rules with complexity of O(1). QNS algorithm uses the hash search instead of sequential search, reducing the time to search NAT rule tables. Second, we use Receive-side Scaling (RSS) to distribute flows across multiple processor cores in multi-core servers. To reduce the overhead of locks, Quick NAT uses lock-free hash table to share connection records among CPU cores. Third, Quick NAT enables zero-copy delivery and polling to eliminate the overhead of copy and interrupt.

3.2 QNS Algorithm

Traditional NAT system leverages the Netfilter framework in Linux kernel on commodity servers. As the speed of NIC cards gets faster, the time interval between packets' arrival becomes smaller. For 10Gbps NICs, the arrival rate is 14.8M packets per second at most, giving only 67.2ns to deal with each packet. However, Linux Netfilter cannot handle this network-intensive workload. According to the source code of Linux Netfilter, we found that Netfilter uses sequential search algorithm to search for NAT rules that are stored in linear order. In other words, Netfilter does not scale well with a large number of rules.

To reduce the overhead of looking up for NAT rules, we design QNS Algorithm to search for NAT rules at high speed. It uses hash search instead of sequential search to look up NAT rule tables with complexity of O(1), reducing the time of NAT rule lookup.

QNS algorithm plays a key role in Quick NAT system and the following steps show how Quick NAT system works:

1)Initialize NAT rule tables

In the initialization process of Quick NAT system, we should store the NAT rules into small rule tables. That is to say, Quick NAT system sets up small rule tables (i.e. 32 DNAT rule tables and 32 SNAT rule tables) instead of one big NAT rule table. NAT rules are stored into different small rule tables according to the subnet mask and NAT-type. Moreover, each small NAT rule table has one bit as flag to indicate whether it contains rules. The Figure 3 illustrates how to put NAT rules into different small rule tables.

In Figure 3, NAT rules are shown on the left and small NAT rule tables of Quick NAT system are on the right. We use one SNAT rule as an example to illustrate how to rearrange rules. The hash function calculates hash value on the basis of IP address/mask:port and protocol type of this rule. Consequently this rule is put into the SNAT rule table with 24-bit subnet mask and the bucket to store this rule depends on the hash value (654). Each bucket contains a singly-linked list to deal with hash collisions. Because this rule is the first rule of this rule table with 24-bit subnet mask, we change the flag bit of this rule table from 0 to 1. In this way, we place all of NAT rules into different rule tables.

2)Search for connection record

In Figure 4, one user wants to set up a flow to visit the website and the first packet of this flow arrives at the Quick NAT system running on the commodity server.

Once receiving a packet, Quick NAT system searches connection records that are stored in a separate hash table and keep the NAT mapping records of flows. Since this packet is the first packet of this flow, there is no connection record of this flow.

3)QNS Search for NAT rule

Since the connection record is not found, Quick NAT uses QNS algorithm to search for NAT rules in DNAT rule tables at first and then in SNAT rule tables. The method of searching DNAT rule tables is similar to that of searching SNAT rules tables, so we only illustrate how to look up SNAT rules.

For SNAT rule tables, QNS searches different rule tables one by one in the order of decreasing subnet mask because the rule with longer subnet mask bits is preciser than that with lower mask. As Figure 5 shows, QNS starts with the SNAT rule table with 32-bit subnet mask. For this SNAT rule table, QNS computes hash value on the basis of IP address/mask and the exact port of this packet. The hash value is 1288 and thus it does not find a SNAT rule with exact port and 32-bit subnet mask. And then QNS computes hash value based on IP address/mask and zero port to search SNAT rules with wildcard port. In this case, it still does not find the rule with wildcard port to match this packet. In all, QNS dose not find the rule to match the packet in this sub-table.

Figure 6 turns to the SNAT rule table with 31-bit subnet mask. Since this rule table's flag bit is zero, there is no rule in this table and QNS skips this sub-table. For the same reason, it skips the next few sub-tables.

Figure 7 shows that QNS turns to the rule table with 24-bit subnet mask at this time. Due to the subnet mask of this rule table, QNS uses 255.255.255.0 to mask IP address from 192.168.88.32 to 192.168.88.0 and then calculates the hash value. Ultimately, it finds a SNAT rule on the basis of masked IP and wildcard port. Once finding a NAT rule, QNS stops searching for NAT rules.

4)Modify the tuple and send out

In Figure 8, this rule means that it modifies the source IP from 192.168.88.0 to 166.111.130.166 and chooses a port from the port pool. After modifying the tuple of this packet, Quick NAT system sends out this packet to the website server.

Fig. 8. Change the tuple and send out.

5)Install two connection records

Quick NAT system installs two new connection records for this flow and its reply flow. As Figure 9 shows, one record is the original connection record and another record is the reply connection record. The reply connection record is pre-installed to be used for dealing with the reply flow. Each connection record contains initial tuple and revised tuple to record the NAT mapping of the flow. In addition, each connection record has a timestamp field, which is refreshed when a packet matches this connection record. A dedicated core is used to delete timeout records and reclaim the IP/Port periodically.

6)Deal with the reply packet

In Figure 10, the website server sends out the response after receiving the request from user. Due to the pre-installed reply connection record, Quick NAT system recognizes this response packet and then revises its destination IP. After revising the tuple, it sends out the modified packet.

3.3 Lock-free Sharing Among CPU Cores

As it becomes more difficult to increase clock speed on single-core CPU, multi-core processors have been pursued to improve overall processing performance. In order to effectively utilize multi-core processors, we leverage RSS to distribute flows across multiple CPU cores of multi-core platform. However, this is not enough to improve the throughput because the performance improvement gained by a multi-core processor depends heavily on the software algorithms and their implementation. For Netfilter, one reason for high overhead is that the lock of connection record table is contended by multi threads.

In Quick NAT system, connection record table is built with lock-free hash table and shared among CPU cores. Based on compare-and-swap (CAS) atomic instruction, lock-free hash table eliminates the overhead of locks and makes it more efficient and scalable to share connection records on multicore server.

3.4 Polling and Zero-copy Packet Delivery

In tradition, interrupts are used to notify the operation system when a packet is ready for processing. However, the interrupt handling is really expensive. When the rate of packet reception rises, the throughput of commodity servers may drop considerably in such systems [17]. To overcome this problem, we take advantage of Intel's DPDK to poll for data from NIC. The poll mode driver provided by DPDK allows end-host applications to access the RX and TX descriptors directly without interrupts, which eliminates overheads existing in interrupt-driven packet processing.

Apart from using polling instead of interrupts, Quick NAT achieves full zero-copy to process the packets in user space. It manipulates the packet through pointers without copy operations in the process of NAT, increasing the throughput of packet processing.

4 Evaluation

Quick NAT system achieves high-speed lookup for NAT rules by utilizing QNS algorithm and shares connection records among CPU cores efficiently. In this section, we evaluate Quick NAT with the following goals:

?Show the performance of Quick NAT compared with that of Linux Netfilter,

?Analyze the scalability of Quick NAT on the multicore commodity platform,

?Display the throughput of Quick NAT under different packet sizes, and

?Demonstrate Quick NAT's ability to search for NAT rules at high speed.

4.1 Experimental Set-up

In our experiment, we use three Intel Core CPU i7-5930k @ 3.5GHz (6 cores) servers -- one for the Quick NAT system under test and another two acting as traffic generator and receiver -- each of which has an Intel 82599ES 10G Dual Port NICs and 16GB memory. Ubuntu 16.04 (kernel 4.4.0) and DPDK-16.11 are used. We use 8GB for huge pages according to Intel's performance reports. DPDK-based traffic generators are available, such as Wind River's DPDK-pktgen [18], which we use to generate traffic in line rate.

4.2 Experiment Topology

The Figure 11 shows the experiment topology. There are three commodity servers. Quick NAT system is installed on server B. Server A and server C are used as the packet sending/receiving engines.

4.3 Results

We perform a number of measurements to get an understand of throughput of Quick NAT and performance of QNS algorithm.

1)Throughput

In this experiment, we add 1k rules into Quick NAT system at first and then set up 8k flows and change the size of packet to measure the performance of Quick NAT and Netfilter with different packet sizes.

The result is shown in Figure 12. Quick NAT reaches close to line rate for minimized-sized packets, while Netfilter achieves only a fraction of line rate for small-sized packets and up to 9.3 Gb/s for packets larger than 800B.

2)Scalability

We add 1k rules and set up 8k flows with 64B packets. For this experiment, we change the number of CPU cores used by Quick NAT and Netfilter and record the throughput.

Figure 13 shows that the throughput of Netfilter increases slowly with growing number of CPU cores. In contrary, the performance of Quick NAT scales out linearly with the number of cores and achieves an improvement of more than 860% compared to that of Linux Netfilter. The reason is that Quick NAT use RSS to enable efficient distribution of flows across multiple CPU cores and leverage lock-free hash table to eliminate this overhead of sharing connection records among different cores.

3)QNS Performance

We conduct an experiment to measure the time of QNS algorithm in Quick NAT system and linear search algorithm that Netfilter uses to search for NAT rules. To start with, we add 100 rules to different rules hash tables and then use QNS algorithms to look for different NAT rules for 100 times to calculate the mean searching time. Then, we create a linked list to store the same rules and use linear search algorithm to search for different NAT rules in the same way as the preceding one. In addition, we change the number of rules up to 10k and do this experiment for many times.

We can learn from the result in table I that it takes about 43 ns for QNS algorithm to search for rules and that the number of rules makes no difference on the performance of QNS algorithm because QNS is based on hash search with the complexity of O(1). On the contrary, it is a time consuming process to use linear search to look for rules in Netfilter, especially when the number of rules is large.

5 Conclusion

This paper presents Quick NAT, a high performance NAT system at line rate on commodity hardware. Firstly, QNS algorithm is designed to search for the NAT rules with complexity of O(1). QNS algorithm is based on hash search instead of sequential search, which eliminates the time to look up NAT rule tables. Furthermore, we use Receive-side Scaling (RSS) to distribute flows across multiple CPU cores. To reduce the overhead of locks among CPU cores, Quick NAT leverages lock-free hash table to share connection records efficiently between different cores. Last but not least, Quick NAT takes advantage of Data Plane Development Kit (DPDK)'s capabilities to be built in user space, and enables zero-copy delivery and polling to cut down the overhead of copy and interrupt.

We implement Quick NAT on commodity servers. The experiments show that Quick NAT can obtain line rate throughput for 64B packets, an improvement of more than 860% in comparison with Linux Netfilter. Moreover, the performance of Quick NAT increases linearly with core number and thus Quick NAT provides high scalability.

Although Quick NAT achieves the line rate, there are still some works to do in the future. First of all, we plan to implement Quick NAT in virtual machines to support cloud computing environment and obtain more flexibility and manageability. In addition, we will collaboratively redesign the algorithm of allocating IP/Ports from IP/Port pool and the hash algorithm of RSS to achieve full localization of flows, avoiding sharing connection records among CPU cores and achieving higher performance and scalability on multicore platform. 

(责编:尹峥、赵光霞)

分享让更多人看到

传媒推荐
  • @媒体人,新闻报道别任性
  • 网站运营者 这些"红线"不能踩!
  • 一图纵览中国网络视听行业
返回顶部