Jia Yu's Research

NePSim: A Network Processor Simulator

Network processor (NP) is a new breed of microprocessor that integrates a parallel processing design on a single chip for processing complex algorithms, deep packet inspection, traffic management, and packet forwarding at wire speed. It has the advantage of both high performance and programmability, which cannot be achieved at the same time by ASIC or general-purpose CPU. Typical NPs employ parallel processing elements (PEs) with multi-threading technique to keep up with explosive internet packet processing demands. The PEs can be programmed in a parallel or pipelined fashion, based on the nature of the processing tasks.

There is an increasing interest in the NP architecture design for the sake of better performance and energy efficiency. However, there has not been an open-source simulation infrastructure that makes the performance/power tradeoffs in NPs clearly visible to computer architects. NePSim [1] is the first open source integrated infrastructure for analyzing and quantifying the NP performance/power dissipation at architecture-level. NePSim contains a cycle-accurate simulator for a typical NP architecture (Intel’s IXP series), an automatic verification framework for testing and validation, and a power estimation model for measuring the power consumption of the simulated NP.

NePSim 1.0 complies with the Intel’s IXP1200 specification since it is widely adopted in academia as a representative model for NP research. The simulator infrastructure is illustrated in Figure 1. The simulator implements most of the functionalities of the six multi-threaded PEs (i.e. Microengines), memory hierarchy (SRAM, DRAM, etc), full-fledged command bus arbiter, and device interfaces (input/output ports, MAC, etc). The “DLite” module serves as a debugger. Users can set break points, print pipeline status, display register values, and dump memory contents. The simulator incorporates a power estimator that estimates power consumption of the whole chip. Validating the NePSim 1.0 against IXP1200 architecture is conducted by a backend tool called IVERI [2][3]. IVERI processes the standard Intel SDK traces and NePSim traces, and asserts whether a pre-defined property is violated. We observed an average error of 1% in throughput and 6% in average processing time across a set of network benchmarks. Overall, we think the simulator can produce relatively dependable results.
NePSim 2.0 complies with the Intel’s second-generation network processor family, which includes IXP 2400 and 2800 (in short IXP 2xxx). IXP2xxx incorporates Hyper Task Chaining technology. This unique network processing approach allows a single stream packet/cell processing problem to be decomposed into multiple, sequential tasks that can be easily linked together. In addition, the hardware and instruction set are enhanced significantly. For example, the number of PEs on chip is increased from 6 to 16, the PEs have more flexible communication mechanisms among processes (e.g. Next Neighbor registers, Content Addressable Memory), the memory hierarchy is augmented with on chip local memory and off chip RDRAM. NePSim 2.0 is under development now.

Through performance-power study, we observe that NP’s power consumption increases faster than performance. Low power techniques would be critical for future NP designs. We proposed two schemes to reduce power dissipation: dynamic voltage scaling (DVS) and clock gating [1][4]. DVS exploits the PEs’ utilization variance, reducing voltage and frequency when the processor has low activity and increasing them when the peak processor performance is required. DVS can save up to 17% of power consumption with less than 6% performance loss. Clock gating can be used to turn off a subset of PEs when the packet processing requirement is low, and turn on the PEs when the need is high. Clock gating saves power in coarse granularity, and is particularly useful when the network traffic volume has high variance. With real world network traces, our experiment (Figure 2) showed that clock gating scheme can save power consumption by up to 30% with no packet loss and little impact to the overall throughput.

While cycle-accurate simulation tools have been widely used to measure chip performance and power, this approach will be hindered by the increasing simulation complexity of the multi-core multithreading architecture. Due to the specialty of NP applications, the existing simulation acceleration methods cannot be applied to NP simulation without modification. We proposed a new scheme [5] that uses stratified random sampling to choose a reduced and representative trace input for NP simulation. Our experiments showed that our approach can effectively reduce simulation time by an order of magnitude for seven NP benchmarks, and the error rate is bounded within 3% with 95% confidence.

NePSim structure

Figure 1. NePSim software structure. Figure 2. Power saving vs. packet arrival rate using clock gating low power technique on a NP.

Reference:

[1] Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, “NePSim: A Network Processor Simulator with Power Evaluation Framework”, IEEE Micro, Sept/Oct 2004

[2] Xi Chen, Yan Luo, Harry Hsieh, Laxmi Bhuyan, F. Balarin, “Utilizing Formal Assertions for System Design of Network Processor,” Design Automation and Test in Europe (DATE), 2004

[3] Jia Yu, Wei Wu, Xi Chen, Harry Hsieh, Jun Yang, F. Balarin, “Assertion-Based Automatic Design Exploration of DVS in Network Processor Architectures,” Design Automation and Test in Europe (DATE), 2005

[4] Yan Luo, Jia Yu, Jun Yang, and Laxmi Bhuyan, “Low Power Network Processor Design Using Clock Gating,” the 42^nd Design Automation Conference (DAC), 2005

[5] Jia Yu, Jun Yang, Shaojie Chen, Yan Luo and Laxmi Bhuyan, “Enhancing Network Processor Simulation Speed with Statistical Input Sampling,” International Conference on High Performance Embedded Architectures & Compilers (HiPEAC), 2005