Power-Energy Simulation for Multi-Core Processors in Bench- marking

A R T I C L E I N F O A B S T R A C T Article history: Received: 21 December, 2016 Accepted: 19 January, 2017 Online: 28 January 2017 At Microarchitectural level, multi-core processor, as a complex System on Chip, has sophisticated on-chip components including cores, shared caches, interconnects and system controllers such as memory and ethernet controllers. At technological level, architects should consider the device types forecast in the International Technology Roadmap for Semiconductors (ITRS). Energy simulation enables architects to study two important metrics simultaneously. Timing is a key element of the CPU performance that imposes constraints on the CPU target clock frequency. Power and the resulting heat impose more severe design constraints, such as core clustering, while semiconductor industry is providing more transistors in the die area in pace with Moore’s law. Energy simulators provide a solution for such serious challenge. Energy is modelled either by combining performance benchmarking tool with a power simulator or by an integrated framework of both performance simulator and power profiling system. This article presents and asses trade-offs between different architectures using four cores battery-powered mobile systems by running a custom-made and a standard benchmark tools. The experimental results assure the Energy/ Frequency convexity rule over a range of frequency settings on different number of enabled cores. The reported results show that increasing the number of cores has a great effect on increasing the power consumption. However, a minimum energy dissipation will occur at a lower frequency which reduces the power consumption. Despite that, increasing the number of cores will also increase the effective cores value which will reflect a better processor performance.


Introduction
Microprocessor performance has helmed its industry for four decades. Reducing power consumption has become a stringent design principle especially for battery-driven devices. Limiting the increase in CPU clock frequency, because of low-power constraints and high energy efficiency, has become a real challenge for improving microprocessor performance over the next generation. So, other aspects in microprocessor architecture (Instruction Set) and compilers opti-mizations have to be considered in order to optimize the offered workload. In addition, other factors in microprocessor hardware implementation must be taken into account in order to speed up this workload execution time such as using many cores. In this paper, we make the case for exploring the trade-off between low power and energy efficiency over a wide range of clock frequencies. We do the experiments on different battery-powered Laptops and Smartphones in [1] on a single core. We enface two problems: the choice of power measure-* Mona A. Abou-Of, mona.abouof@pua.edu.eg MA. Abou-Of et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 1, 255-262 (2017) ment tools and the choice of performance benchmark tools. An accurate reliable power measurement software has to be selected in such a way to be running on Linux platform for Laptop devices like Powerstat (Power consumption calculator for Ubuntu Linux. Available: http://www.hecticgeek.com), or running on Android platform for Smartphones like Powertutor [2].
To capture the transitions between power states, two different finite state machines (FSM) based power modeling scheme [3] are implemented: The standard CoreMark benchmark (Industry-standard benchmarks for embedded systems. Available: http://www.eembc.org/coremark), executed on Linux OS, represents a disk with tail power state model that writes the running power on a disk file and stays at high power state for a period after the active I/O activity. The custom-made Fibonacci benchmark, written with Java on Android, represents a free model that returns to the base state without inactivity period. In summary, this paper makes the following contributions: • We make laboratory experiments for exploring the relationship between processor performance, power consumption and energy efficiency over a range of clock frequencies on different number of enabled cores.
• We represent the experimental setup in order to obtain reliable results.
• We represent a detailed implementation on different laptops and Smartphones operating systems.
• The plotted results assure that minimum energy dissipation is always achieved even with different workloads, and at a certain clock frequency but with a limited performance, lower power consumption and without optimization realization.
• We have proved the Energy/ Frequency convexity rule on multi-core (instead of one core) processors [4].
• Such observations can be fed into an intelligent DVFS scheduling, power management module of an operating system, on multi-core processors, which can achieve energy and power savings without impacting the performance.
• We have proved that increasing number of cores has a great effect on increasing the power consumption. However, a minimum energy dissipation will occur at a lower frequency which reduces the power consumption. Despite that, increasing the number of cores will also increase the effective cores value which will reflect a better processor performance.
The rest of this paper is organized as follows: Section 2 presents the existing energy modeling approaches. Section 3 formulates the problem with some equations. In section 4 the experimental results are evaluated and analyzed. Finally, section 5 concludes the paper.

Related Work
Most of the existing system energy modeling approaches combine between power profiling systems and performance benchmark tools. SPEC has developed SPECpower ssj2008 (S.P.E. Corporation.
McPAT [5] is a fully-integrated power, area and timing modeling framework. It models all types of power dissipation and provides an integrated solution for multithreaded and multi-core processors. McPAT power modeling is combined with Sniper performance simulation in [6].

Power Profiling Systems
Existing power measurement methods are limited in two ways.
First, some systems [3,7,8] and Monsoon power monitor (Available: http://www.msoon.com/LabEquipment/PowerMonitor) generate their models by using external hardware lab equipments like sensors, meters, and data acquisition devices. Second, other systems like Powerstat, [2,9,10] are self-modeling. They construct their models without external circuitry. They use built-in battery sensors or the smart battery interface fuel gauge IC; or read system files available on mobile systems. Integrated sensors are provided on CPUs [11] such as Intel processors [12] and AMD processors [13], on GPU cards [14], or on motherboards equipped with a Baseboard Management Controller (BMC) monitoring chip [15]. Some of this systems are Event-based as in [3,5] or per-component power measurements in addition to the total power as in [5,7,9,16,17]. Others modeled power measurements by applications as in [2]. Industry simulators are typically cycle-accurate that run at a speed of 1 to 10 kHz. Academic simulators, such as [18,19] are not truly cycle-accurate compared to real hardware, and therefore they are faster, with simulation speeds in the tens to hundreds of KIPS (kilo simulated instructions per second) range. They do not scale well to large multi-core systems.

Performance Benchmark Tools
SPECpower ssj2008 benchmark and the Apache benchmarking tool (ab -apache benchmarking tool. www.astesj.com MA. Abou-Of et al. / Advances in Science, Technology and Engineering Systems Journal Vol. 2, No. 1, 255-262 (2017) Available: http://httpd.apache.org/docs/2.2/programs /ab.html) are used for HTTP server traffics. The SPECpower ssj2008 is the first industry standard SPEC benchmark that evaluates the power and performance characteristics of volume server class and multi-node class computers.The widespread used benchmark in industry and academia is SPEC CPU2006 [20]. EEMBC has benchmarks for generalpurpose performance analysis including CoreMark, MultiBench(multicore), and FPMark (floating-point).

Problem Formulation
The basic relationships among computer performance, power consumption and energy efficiency are expressed as follows: As shown in [21], the power consumed by a processor is directly proportional with the clock frequency (f ). In order to study the impact of clock speed on the processor performance without DVFS scheduling, the CPU Execution time (t x ) is computed as: where T cycle equals 1/f and CP I is the average number of cycles per instruction. i.e. t x is function of (1/f ), and improving the performance requires decreasing t x and speeding up the CPU frequency. Or t x = N umber of clock cycles/f (5) in case of single core. And in case of multi-core where c e is the effective cores parameter which reflects the degree of the execution parallelization achievement. Equation (3) shows that, in order to minimize the energy, power should be reduced. This can be achieved by using low clock frequency. On the other side, reducing t x requires high clock frequency. This trade-off between lower power and better performance leads to the existence of an optimum point for minimal energy usage with a tight performance improvement at a certain specific CPU frequency (f m ). The goal of the presented experiments in this paper is to search for such minimal energy when the CPU frequency is varied and find the optimum frequency f m for a varied number of cores.

Experimental Setup
The presented experiments measure the power and the execution time while running different workloads on specific Dynamic Voltage and Frequency Scaling (DVFS) mobile system settings over a 0.6 GHz to 1.7 GHz range of CPU frequencies.
The variation of CPU frequency settings needs the CPU frequency information of the used mobile device. These settings demand the resetting of the power management policy, the disabling of some cores; and the setting of the only enabled cores with one of its frequency values in parallel with its upper frequency limit. The experiments are implemented on three different battery-powered mobile systems shown in Table  1: two Intel Laptops (Acer and Dell) and one ARM Smartphone (Samsung A5), on different Operating Systems Ubuntu and Android respectively in [1] and extended to multi-core on the two Laptops only. The offered workloads were CoreMark, the standard benchmark tool for Laptops and a custom-made Fibonacci benchmark for the Smartphone and also for the Acer and Dell Laptops. The Fibonacci benchmark is implemented, in Java, iteratively for 2E8 iterations. The execution time is measured via those performance benchmark tools. The power consumed by these performance benchmark tools is measured by different power profiling systems: Powerstat on Linux O.S. and Powertutor [2] on Android. Both systems use the built-in smart battery interface to measure power at rate 1 Hz while the battery is discharging. Powerstat measures the total power while Powertutor measures also an individual power per application. Both power profiling systems have to be running by at least one minute before running the performance benchmark tools giving the chance to the power to be stabilized.

How to measure power?
For Laptops with Linux platforms, Powerstat is used to measure the power consumed by the running Core-Mark. Two factors are considered: Powerstat measures the total power of the Laptops and CoreMark is a disk with tail powerstate model. Steps to measure CoreMark consumed power (P c ):  10. Compute P c = P t − P s 11. Repeat steps from 5 to 10 in order to get 10 batches and get the average P c .
12. Repeat steps from 5 to 11 with all available CPU frequencies.
13. Repeat steps from 3 to 12 for i different number of cores (1 to 4) A sample output of Power measured by Powerstat with DVFS scheduling and another with 1.6 GHz fixed CPU frequency setting are shown by the Instantaneous Power Profiles in Fig. 1. The resulting power profile shows that the power with DVFS scheduling returns the base state (7.5 watts) 30 seconds earlier than the one with fixed 1.6 GHz CPU frequency setting and also drops about 0.7 watts. This DVFS scheduling saves about 30 sec * 0.7 watts or 21 joules. For Smartphones with Android platforms, Powertutor is used for power management. Referring to the steps described above to measure the CoreMark consumed power, apply the first 7 steps with interchanging Pow-erStat with Powertutor and CoreMark with Fibonacci Java code. No need to compute the average consumed power P c for benchmark since Powertutor measures power for each individual application separately and register it in its log file. Then, repeat steps from 5 to 7 with all available frequencies and cores.

Experimental Results and Analysis
This section illustrates the relationship between the CPU execution time, the power consumption, and the dissipated energy over a 0.  of the sixteen experiments ensure that the processor power is proportional to CPU frequency [21]. In addition, incrementing the number of enables cores also increases the power. All figures (2, 3, 4 and 5) illustrate that increasing frequencies decreases the execution time while increasing the consumed power by the processor. They also show that increasing the number of cores has a great effect on increasing the power consumption. So other design factors, rather than clock speed, have to be considered for a low-power achievement. In case of multi-core processors, increasing the number of enabled cores shifts f m to lower frequency and reduces the power but increases the c e value which reflects a better performance.     Figure 5: Running Fibonacci Benchmark on Dell Laptop. Although increasing number of cores increases the power consumption, there is always an optimal frequency for minimum energy.

Conclusion
Energy efficiency improvement can't be achieved by exploring the hardware implementation of the microprocessor design only. Referring to (4), the CPU performance is also improved by a good design of instruction set architecture (ISA). ISA optimization decreases the program Instruction Count and the CPI. Such optimization has a direct impact on minimizing the offered workload, consequently it reduces the power by decreasing the CPU utilization. Improving processor performance by hardware implementation as rising the CPU frequency has a greater side effect on the power. Another factor like CPI has to be considered. High-level of parallelism, including superscaler implementation based on instruction-level parallelism or multi-processing architecture where many core (MTC) are integrated, can achieve a better CPI. Using Multi-core processor, as detected by the experiments, reduces the execution time without extra power while enhancing the energy efficiency.
The demonstrated experiments assure the trade-off between optimizing the energy efficiency and improving the processor performance. Both always affect the power consumption while changing the CPU frequencies. Furthermore, we have proved that increasing number of cores has a great effect on increasing the power consumption. However, a minimum energy dissipation will occur at a lower frequency which re-duces the power consumption. Despite that, increasing the number of cores will also increase the effective cores value which will reflect a better processor performance.

Conflict of Interest
No conflict of interest.