Advertisement
BaSs_HaXoR

Golden-Finger and Back-Door Infos

Nov 26th, 2016
381
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 30.19 KB | None | 0 0
  1. http://sparc.nfu.edu.tw/~ijeti/download/V2-72-84.pdf
  2.  
  3. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  4. Golden-Finger and Back-Door: Two HW/SW Mechanisms for
  5. Accelerating Multicore Computer Systems
  6. Slo-Li Chu *
  7. , Chih-Chieh Hsiao
  8. Department of Information and Computer Engineering, Chung-Yuan Christian University
  9. 200, Chung Pei Rd., Chung Li, 32023 Taiwan.
  10. Received 25 October 2011; received in revised form 27 November 2011; accepted 22 December 2011
  11. Abstract
  12. Continuously requirements of high-performance computing make the computer system adopt more processors
  13. within a system to improve the parallelism and throughput. Although multiple processing cores are implemented in
  14. a computer system, the complicated hardware communication mechanism between processors will decrease the
  15. performance of overall system. Besides, the unsuitable process scheduling mechanism of conventional operating
  16. system can not fully utilize the computation power of additional processors. Accordingly, this paper provides two
  17. mechanisms to overcome the above challenges by using hardware and software mechanisms, respectively. In
  18. software aspect, we propose a tool, called Golden-Finger, to dynamically adjust the scheduling policy of the process
  19. scheduler in Linux. This software mechanism can improve the performance of the specified process by occupying a
  20. processor solely. In hardware aspect, we design an effective hardware mechanism, called Back-Door, to
  21. communicate two independent processors which can not be operated together, such as the dual PowerPC 405 cores
  22. in the Xilinx ML310 system. The experimental results reveal that the two mechanisms can obtain significant
  23. performance enhancements.
  24. Keywords: multicore, Xilinx ML310, hardware interprocessor communication.
  25. 1. Introduction
  26. The continuously requirements of multimedia and streaming processing consumes more computing power of modern
  27. computer systems. These heavy loading jobs, such as MPEG4 encoding, music playing, movie playing, and 3D gaming, are
  28. usually executed simultaneously. The state-of-the-art computer architectures usually integrate multiple processing cores into a
  29. single chip to improve overall throughput. However, the inefficient hardware inter-processor communication mechanisms
  30. enlarge the communicating latency, and decrease the performance enhancements of additional cores. Besides, the general
  31. process scheduling mechanism for multicore system also cannot handle the performance requirements of a mission-critical
  32. task suddenly due to its fair round-robin based scheduling policy, such as in Linux.
  33. The scheduling mechanisms of modern multicore operating systems generally arrange and schedule the processes in
  34. Round-Robin fashion according to the priorities of processes. For example, the scheduling mechanism in Linux kernel 2.6.11
  35. will schedule the processes to all the processors with a load balance mechanism which migrate processes between processors,
  36.  
  37. * Corresponding author. E-mail address: slchu@cycu.edu.tw
  38. Tel.: +886-3-2654721; Fax: +886-3-2654799
  39. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  40. Copyright © TAETI
  41. 73
  42. when the workload of processors are not balance. In such scheduling mechanism will postpone the execution of
  43. mission-critical applications, and delay their response time.
  44. Fig. 1 Architecture of conventional dual-core processor
  45. Besides, in the conventional multi-core architectures, such as Core 2 Duo (as Fig. 1) [8] and Athlon64 x2, there is no
  46. direct connection between processors. The only way to connect all processors is through processor local bus. The
  47. communication latency will be increased accordingly. It will waste a lot of time during sharing memory and system bus while
  48. processing streaming data. Even the conventional operating system, such as Linux, poorly supports to embedded multicore
  49. system, and often wastes too much time on scheduling. These problems can limit the overall performance of multicore system.
  50. In this paper, we propose hardware and software mechanisms to solve the problems of real-time process scheduling and
  51. hardware interprocessor communication problems mentioned above. In software aspect, we propose a user-adjustable
  52. dynamically process scheduling mechanism, called Golden-Finger to make the specified mission-critical process to occupy a
  53. processor solely, that can achieve the maximum performance of the specified process. The other non-critical processes that are
  54. originally executed on the processor will be migrated to other processors. The above software mechanism is implemented in
  55. Linux operating system.
  56. In hardware aspect, we propose an effective hardware communication mechanism called Back-Door for two PowerPC
  57. 405 processors in Xilinx ML310 FPGA development platform (as Fig. 2). The conventional multicore FPGA system, such as
  58. Xilinx Virtex II pro that the lacks of communication mechanism between two cores. This important function isn’t implemented
  59. in original Xilinx Virtex II pro FPGA. Therefore, it is very difficult to let both dual cores alive.
  60. The proposed Back-Door hardware communication mechanism can connect dual PowerPC cores in the Vitrex II pro
  61. FPGA that can overcome the problem of the dual cores FPGA but only can enable one processor for execution. Either in
  62. vender design tools, Xilinx EDK, or the corresponding Linux version, cannot enable dual PowerPC 405 concurrently.
  63. Therefore, the proposed hardware mechanism can improve the performance of Xilinx ML310 system dramatically by fully
  64. utilizing dual PowerPC 405 cores.
  65. The organization of this paper is as following.Section 2 is the related works about scheduling mechanism under Linux
  66. and related project on Xilinx ML310. Section 3 presents implementations for both software flow and hardware flow. In section
  67. 4, we demonstrate the experimental results and the performance enhancements obtained when our proposed mechanisms are
  68. used. In the end, there is a conclusion in section 5.
  69. Processor 0
  70. L1 Cache
  71. Processor Local Bus
  72. Main Memory I/O
  73. Processor 1
  74. L1 Cache
  75. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  76. Copyright © TAETI
  77. 74
  78. Fig. 2 Xilinx ML310 development platform and detailed system architecture of the FPGA.
  79. 2. Related Works
  80. 2.1. Introduction to the scheduler of the Linux kernel
  81. The scheduling mechanism of the process scheduler in Linux 2.6.x[1][2][4][5][6] [7] is discussed below. It can find the
  82. most suitable task to execute in most of the running situations. When a program is executed and entered into the scheduling
  83. stage, this task will be putting in the corresponding priorities run queue. When the preset period of scheduling cycle is time-out,
  84. the system timer interrupts processor and triggers the scheduling function: schedule(), to check status of current tasks and the
  85. remaining running time.
  86. If the tasks are time-out, they will swap out of run queue, and schedule() will find the candidate tasks in the run queue.
  87. According a normal queue, if there is a task running out its time slice, it will be removed from the head of queue to the tail of
  88. queue. In order to reduce time complexity of scanning the run queue, the scheduler maintains two priority arrays: active and
  89. expired for each processor.
  90. The active array keeps the tasks that haven’t time-out, and expired array includes the tasks that are time-out. The time
  91. slice of the task will re-calculate when a task runs out of its time slice before moving to expired array. Active array and expired
  92. arrays will be exchanged while all of the tasks in active array run out of its time slice. Fig.3 shows the mechanism of process
  93. scheduler in the Linux operating system.
  94. First, schedule() calls sched_find_first_set() to find the first bit in active array, and this bit corresponds to highest
  95. priority and executable task. Then, this task will be executed by processor. The executing time of these statements doesn’t
  96. influence on the number of tasks in the system due to its time complexity is O(1). While Linux kernel manages a shared
  97. memory multiprocessor system, every processor has its own run queue. Besides, within fix latency, the kernel has to check
  98. amount of tasks running on each processor is balance. If not, load_balanced() will move the tasks between processors to
  99. maintain balanced workload of each processor.
  100. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  101. Copyright © TAETI
  102. 75
  103. Priority 000 001 002 003 004 005 006 007 …… 136 137 138 139
  104. Occupied ……
  105. Fig. 3 The schedule mechanism of the conventional Linux kernel 2.6
  106. 2.2. ARTiS
  107. ARTiS is a real-time extension of Linux that targets on shared memory multiprocessor system. The goal of ARTiS
  108. (Asymmetric Real-Time Scheduling) Project [3] is accelerating the response time of real-time tasks. RT0 means the hard
  109. real-time task which has to be done as soon as possible while RTn is soft real-time tasks. When ARTiS is booting, all of the
  110. processors will be partitioned into two groups, RT and NRT. The processors belong to RT group are assigned to execute
  111. real-time tasks, and the processors of NRT group are specialized to execute non real-time tasks.
  112. Firstly, ARTiS arranges all RT tasks to RT processor, and NRT tasks will be moved to NRT processor. When there are
  113. free RT processors, NRT tasks will be moved to these RT processors. If the amount of RT tasks is larger than available RT
  114. processors, those un-assigned RT tasks will be arranged to NRT processor by the load balancer of ARTiS system. In order to
  115. implement the above capabilities, ARTiS adopts a task FIFO to save the moving tasks, instead of locking two run queues to
  116. diminish latencies during moving tasks, between NRT processors and RT processors. When there are any available RT
  117. processors, the RT tasks will be migrated to RT processors through the task FIFO. Therefore, these two run queues do not
  118. require waiting for the spin lock. Although this study proposes an asymmetric task scheduling mechanism to arrange the
  119. specific task on the assigned processor, extended from Linux kernel, ARTiS cannot executed on Xilinx Virtex II Pro FPGA.
  120. The scheduling mechanism of ARTiS system cannot be applied on Xilinx ML310 system.
  121. 2.3. ATLAS
  122. ATLAS[10] is the first implementation of transactional memory coherence (TCC) and consistency architecture as a
  123. scalable implementation for transactional parallel systems. ATLAS is a FPGA-based system that primarily serves as a rapid
  124. software development platform for the transactional memory model. ATLAS uses the two PowerPC hard cores and attaches to
  125. PLB (Processor Local Bus of IBM CoreConnect Architecture) that connects the data port of PPC to TCC cache with
  126. configurable capacity as 8, 16, or 32 kB. The internal PPC data caches are bypassed and disabled to prevent interference with
  127. TCC.
  128. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  129. Copyright © TAETI
  130. 76
  131. Instruction fetches are straightly attached to the DDR controller. The internal 16-kB, 2-way set-associative
  132. instruction-side caches in the PPC is activated since instruction fetches bypass the TCC caches. Finally, BRAM (Block RAM,
  133. Xilinx on-chip SRAM cells) connects directly to each PPC through the OCM (On-Chip Memory) bus for transactional check
  134. point storage. ATLAS proposed a TTC architecture for both PPC processors. However, it still require complicated cache
  135. coherence design and additional software programming model. The operating system support of ATLAS is still needed to be
  136. improved.
  137. 3. The Designs of Software Scheduling and Hardware Communication Mechanisms
  138. In modern multicore computer systems, the process scheduling capabilities and the interprocessor communication
  139. efficiency dominate the performance of computation. In this section, we proposed two mechanisms, from software and
  140. hardware aspects, to solve the above challenges. The former is a novel process scheduling mechanism, Golden-Finger
  141. mechanism, which can arrange the execution order of the processes and scheduling queues of the processors, clean up a free
  142. processor, and execute the specified mission-critical process on the processor solely. The later is an efficient hardware
  143. communicating mechanism, Back-Door, for interprocessor communication in the multicore system. The detailed description
  144. of these two mechanisms is mentioned below.
  145. 3.1. The Golden-Finger Scheduling Mechanism
  146. The process scheduling mechanism in conventional Linux operating system is focused on fairly round-robin assignment,
  147. which is suitable for symmetric multiprocessor system. However, when the computer system is consisted of asymmetric
  148. multiprocessor, such as IBM Cell processor and TI OMAP, the imbalanced computation power of these processors will make
  149. the fairly scheduling policy inefficient. Besides, when the user wants to execute the urgency process in real-time, the fairly
  150. scheduling policy will lead to problems. Accordingly, we proposed the Golden-Finger mechanism which modifies the
  151. scheduling mechanism of operating system to improve the real-time capabilities. This mechanism allows the user to assign a
  152. program to occupy the particular processors. The detailed scheduling states of the proposed Golden-Finger mechanism are
  153. illustrated in Fig. 4.
  154. The processing steps of Golden-Finger can be divided into five states. Firstly, the Golden-Finger is activated in the State
  155. 0. The user can assign a CPU as the target CPU, and the application, which is a mission-critical task and have to get the
  156. response as soon as possible. While Golden-Finger validates the correctness of above information, then the mechanism
  157. processes the next state. The second stage, State 1, will check the run queue of the target CPU and determine the candidate
  158. processes within the run queue, to empty the run queue of the target CPU. The total workload of these candidate processes will
  159. be evaluated for the following scheduling states. The third step, State 2, will retrieve the status of the alive CPUs by scanning
  160. their run queues, to determine the workloads of these CPUs. Then the mechanism will identify the most lightly-loading CPU to
  161. execute the processes that are migrated from the target CPU. The forth stage, State 3, will actually migrate the candidate
  162. processes from the target CPU. In order to implement this special system call, we modified the Linux Kernel and the patch of
  163. real-time capabilities for Linux in the ARTiS [3] system.
  164. Therefore, the Golden-Finger can move the required processes from one CPU to another. The final stage of
  165. Golden-Finger mechanism is to fork the mission-critical process that is assigned by user. Then, the Golden-Finger mechanism
  166. can go back to State 0 for next round of scheduling cycle. A simple scheduling example of Golden-Finger mechanism is as
  167. demonstrating in Fig. 5 and Fig. 6.
  168. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  169. Copyright © TAETI
  170. 77
  171. Golden-Finger
  172. Activated:
  173. User specifies the
  174. target CPU and the
  175. specific process
  176. State 0
  177. Check Target
  178. CPU:
  179. Determine the
  180. runqueue of the target
  181. CPU, find out the
  182. candidate processes
  183. to migrate
  184. State 1
  185. Find the Lightlyloaded
  186. CPU:
  187. Find a suitable CPU to
  188. executed the swap out
  189. processes from the target
  190. CPU, maintain the load
  191. balance
  192. State 2
  193. Migrate the
  194. Candidate
  195. Processes:
  196. Migrate the candidate
  197. processes from the target
  198. CPU to lightly-loaded
  199. CPU, maintain the target
  200. CPU as empty loading.
  201. State 3
  202. Execute the
  203. Specified
  204. Process:
  205. Execute the specified
  206. process solely on the
  207. target CPU
  208. State 4
  209. Fig. 4 The call graph of tasks’ movements
  210. In Fig. 5, it illustrates the snapshot of scheduling results by using conventional Linux process scheduler. In this snapshot,
  211. there are 11 processes scheduled in the runqueues of CPU 0, CPU 1, CPU 2, CPU 3 respectively. These processes are
  212. scheduled by fairly round-robin policy. Based on the execution situation of Fig. 5, the scheduling result of Golden-Finger
  213. mechanism is illustrated in Fig. 6. It assumes that the user assigns CPU 3 as the target CPU, and will execute the
  214. mission-critical process: “Process G”. After scheduling by the Golden-Finger mechanism illustrated in Fig. 4, the Process 4 &
  215. Process 6 will be moved to CPU 0 due to its most lightly-loading run queue. Finally, the assigned “Process G” will be executed
  216. on CPU 3 solely, and can achieve its best performance to accompolish its real-time critical mission.
  217. Fig. 5 The snapshot of processes executing status on conventional Linux system
  218. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  219. Copyright © TAETI
  220. 78
  221. Fig. 6 The scheduling result of Golden-Finger mechanism
  222. 3.2. The Back-Door Interprocessor Communication Mechanism
  223. The Back-Door interprocessor communication mechanism is based on Xilinx ML310 development platform, as shown
  224. in Fig. 7, which is consisted of a main FPGA chip, Xilinx Virtex-II Pro (XC2VP30), 256Mbytes DDR Memory, Ethernet, USB,
  225. PCI physical chips and connectors. The integrated System ACE CF controller is deployed to perform board bring-up and to
  226. load applications from 512MB Compact Flash card. The Xilinx Virtex-II Pro FPGA chip contains 2 PowerPC 405, 30,000
  227. logic cells, and 2,400KBits BlockRAM (BRAM). The whole hardware system is developed by Xilinx ISE and EDK, which
  228. can generate whole pre-defined system with user-defined hardware modules.
  229. Due to the unsupported features of Xilinx EDK, when two PowerPC 405 both connect to on the same PLB, EDK doesn’t
  230. support the multicore features and only allows single PowerPC 405 to use the bus and booting at the same time. Therefore, we
  231. need to modify the whole system and find a suitable solution. The vender-suggested operating system of Xilinx ML310 is
  232. MontaVista Linux. Because it doesn’t support multicore feature and doesn’t provide kernel source codes, we modified the
  233. open-source Linux Kernel of PowerPC 405 version to implement the program loader of Back-Door hardware mechanism. .
  234. Fig. 7 The original architecture of Xilinx ML310 with sole PowerPC 405 core
  235. Interrupt Controller
  236. SysACE UART SMBus SPI GPIO
  237. DDR
  238. DRAM OPB2PLB
  239. Bridge
  240. OPB
  241. Bus
  242. Original Xilinx ML310 Solo Core Design Original Xilinx ML310 Solo Core Design
  243. High Speed I/O
  244. BRAM for PPC 0
  245. PCI Bridge
  246. System
  247. ACE
  248. RS232
  249. SMBus
  250. SPI
  251. GPIO
  252. /LED
  253. OCM BRAM
  254. PPC405
  255. 0
  256. PPC405
  257. 0
  258. PLB2OPB
  259. Bridge
  260. DDR Controller
  261. PPC0 PLB Bus
  262. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  263. Copyright © TAETI
  264. 79
  265. The implementation of proposed Back-Door hardware communication mechanism is as below. Firstly, the sole core
  266. configuration of Xilinx ML310 is generated via Xilinx EDK, as shown in Fig. 7, to construct the fundamental system
  267. architecture. This system is consisted of a PowerPC 405 (PPC405 0), a processor local bus (PPC0 PLB Bus), a DDR DRAM
  268. controller (DDR Controller), low speed OPB bus (OPB Bus), inter-bus bridges (OPB2PLB, PLB2OPB Bridges), and required
  269. low speed peripherals (Interrupt Controller, SysACE, UART, SMBus, SPI, GPIO, PCI Bridge), which are integrated into
  270. Xilinx Virtex-II Pro FPGA.
  271. Since the Xilinx EDK doesn’t support the dual PowerPC configuration, the second PowerPC 405 only can be attached
  272. on an individual PLB bus. Then, the second PowerPC 405 (PPC405 1) and corresponding PLB bus (PPC 1 PLB Bus) is
  273. constructed manually, as shown in Fig. 8. However, it still lacks the capabilities of booting dual PowerPC 405, so the proposed
  274. Back-Door mechanism is constructed to connect two unconnected PLB buses to solve the problem that PPC405 1 only can
  275. access BRAM for PPC 1 but not DDR.
  276. The proposed Back-Door mechanism is attached on PPC0 PLB Bus and PPC1 PLB Bus simultaneously. After the
  277. implementation of Back-Door mechanism, the PPC405 0 can communicate with PPC405 1 via Back-Door, by using
  278. conventional method of memory map I/O. When PPC405 1 communicates with PPC 405 0 via Back-Door, it can adopt the
  279. method of direct memory access, as shown in Fig. 9..
  280. Fig. 8 The proposed Back-Door mechanism with dual PowerPC 405 cores
  281. Accordingly, Linux operating system can be porting on his new Xilinx ML310 platform which is consisted of two
  282. PowerPC 405 processors and the proposed Back-Door mechanism. Due to the limitation of Xilinx EDK and Linux kernel, the
  283. operating system has to boot on PPC405 0, to control all of the peripherals. The Back-Door is recognized as a specialized MTD
  284. device. Then, the PPC405 1 has to be executed a loader program which is responsible for loading the application from
  285. Back-Door mechanism which is assigned by PPC405 0, and then execute it. After the assigned program is finish, the results
  286. can be sent back via Back-Door mechanism. The PPC405 0 can receive the results from PPC405 1 when notifying by
  287. Back-Door mechanism.
  288. SysACE UART SMBus SPI GPIO
  289. DDR
  290. DRAM
  291. DDR Controller
  292. PPC405
  293. 1
  294. PPC405
  295. 1
  296. PPC405
  297. 0
  298. PPC405
  299. 0
  300. OPB2PLB
  301. Bridge
  302. OPB
  303. Bus
  304. PPC0 PLB Bus
  305. PPC1 PLB Bus
  306. Proposed Xilinx ML310 Dual Core Design Proposed Xilinx ML310 Dual Core Design
  307. High Speed I/O
  308. PCI Bridge
  309. System
  310. ACE
  311. RS232
  312. SMBus
  313. SPI
  314. GPIO
  315. /LED
  316. Back-Door
  317. Mechanism
  318. BRAM for PPC 0
  319. BRAM for PPC 1
  320. Interrupt Controller
  321. PLB2OPB
  322. Bridge
  323. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  324. Copyright © TAETI
  325. 80
  326. Fig. 9 The communication mechanism between PowerPC 405 1 and PowerPC 405 0
  327. 4. Experimental Results
  328. The proposed Golden-Finger software mechanism and Back-Door hardware mechanism have be implemented on a
  329. dual core Intel PC and Xilinx ML310 platform respectively. The experimental results of these mechanism will be discussed in
  330. the following subsections respectively.
  331. 4.1. Experimental Results of Proposed Golden-Finger Software Mechanism
  332. The target machine of implementing Golden-Finger mechanism is Intel x86 dual-cores PC, which is consisted of a Intel
  333. Pentium D 2.8GHz, 1GB DDR SDRAM, Linux Kernel 2.6.11 operating system, X-windows (GNOME 3.x) and GCC 3.4
  334. compiler.
  335. The benchmarks adopted in this experiment include LAME encoder, MPEG player, MPEG Decoder, and MPEG
  336. encoder. The experimental results are as shown in Fig, 10, and Fig. 11.
  337. The prefix, “Heavy“, denotes that a LAME MP3 encoder is executed in the background when the experiment is taken to
  338. simulated the heavy-loading situation. The prefix, “Medium“, denotes that the MPlay plays a MPEG-1 video and a MPEG-4
  339. video simultaneously to simulate the medium loading operating system.
  340. Finally, the prefix, “Light“, denotes that the MPlay plays a MPEG-1 video when take the Golden-Finger experiments.
  341. The programs to be the special applications by using Golden-Finger mechanism are MPEG Decoder and LAME Decoder.
  342. “MP3 Enc (40MB)” and “MP3 Enc (100MB)”denote that the scheduled application by Golden-Finger mechanism is to
  343. execute LAME mp3 encoder to convert 40MB wave file and 100MB wave file to mp3 format, respectively.
  344. Similarly, “MEPG4 Enc (50MB)” and “MEPG4 Enc (350MB)” denote that the scheduled application by Golden-Finger
  345. mechanism is to execute MEPG4 encoder to convert 50MB wave file and 350MB MPEG1 file to MPEG4 format respectively.
  346. The execution time and speedup comparisons of conventional Linux and proposed Golden-Finger mechanism, which
  347. are both evaluated by the configurations of above three system loading (Heavy, Medium, Light), and four kinds of assigned
  348. applications for Golden-Finger mechanism, as shown in Fig. 10 and Fig. 11 respectively.
  349. Memory Map
  350. I/O Access
  351. Direct Memory
  352. Access
  353. PPC405
  354. 0
  355. PPC405
  356. 0
  357. Linux
  358. PPC405
  359. 1
  360. PPC405
  361. 1
  362. Standalone
  363. Software
  364. PPC405
  365. 1
  366. PPC405
  367. 1
  368. Standalone
  369. Software
  370. Back-Door
  371. Mechanism
  372. PPC0
  373. PLB Bus
  374. PPC1
  375. PLB Bus
  376. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  377. Copyright © TAETI
  378. 81
  379. Execution Time Comparison
  380. 0
  381. 50
  382. 100
  383. 150
  384. 200
  385. 250
  386. 300
  387. 350
  388. 400
  389. 450
  390. 500
  391. Heavy-MP3 Enc(40MB)
  392. Heavy-MP3 Enc(100MB)
  393. Heavy-MPEG4 Enc(50MB)
  394. Heavy-MPEG4 Enc(350MB)
  395. Medium-MP3 Enc(40MB)
  396. Medium-MP3 Enc(100MB)
  397. Medium-MPEG4 Enc(50MB)
  398. Medium-MPEG4 Enc(350MB)
  399. Light-MP3 Enc(40MB)
  400. Light-MP3 Enc(100MB)
  401. Light-MPEG4 Enc(50MB)
  402. Light-MPEG4 Enc(350MB)
  403. Benchmark
  404. Execution Time (sec)
  405. Original Linux
  406. Golden-Finger
  407. Fig. 10 The execution time comparison of conventional Linux and proposed Golden-Finger mechanism
  408. Golden-Finger Speedup
  409. 0.00
  410. 0.50
  411. 1.00
  412. 1.50
  413. 2.00
  414. 2.50
  415. Heavy-MP3 Enc(40MB)
  416. Heavy-MP3 Enc(100MB)
  417. Heavy-MPEG4 Enc(50MB)
  418. Heavy-MPEG4 Enc(350MB)
  419. Medium-MP3 Enc(40MB)
  420. Medium-MP3 Enc(100MB)
  421. Medium-MPEG4 Enc(50MB)
  422. Medium-MPEG4 Enc(350MB)
  423. Light-MP3 Enc(40MB)
  424. Light-MP3 Enc(100MB)
  425. Light-MPEG4 Enc(50MB)
  426. Light-MPEG4 Enc(350MB)
  427. Benchmark
  428. Speedup
  429. Golden-Finger Speedup
  430. Fig. 11 The speedup comparisons of conventional Linux and proposed Golden-Finger mechanism
  431. In these three systems loading, the proposed Golden-Finger mechanism can obtain dramatically execution time
  432. reduction, especially in the heavy system loading cases. The speedup can achieve 2.1X to fully utilize the assigned target CPU
  433. to accompolish the assigned process. In contrast to the configuration of heavy system loading, the light loading system only
  434. can obtain a few speedups, due to the assigned applications will not be delayed by the background programs in the
  435. conventional Linux cases.
  436. 4.2. Experimental Results of Proposed Back-Door Hardware Mechanism
  437. The evaluated platform of Back-Door hardware mechanism is Xilinx ML310, as shown in Fig. 1. This experiment
  438. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  439. Copyright © TAETI
  440. 82
  441. adopts four benchmarks, which are selection sort (Selection Sort), dhrystone benchmark 2.1 (Dhrystone), fast Fourier
  442. transformation (FFT), and wavelet transformation (Wavelet), to evaluate the performance improvement of proposed
  443. Back-Door mechanism, from the aspects of execution time (Fig. 12) and speedup (Fig. 13).
  444. Due to the limitation of Xilinx EDK environment, the applied operating system and complier are Linux kernel 2.4.25
  445. and GCC 2.95.3 to meet the predefined board support package (BSP) and device drivers. The Virtex II pro FPGA is configured
  446. as 300 MHz, so the PowerPC 405 processors work at the same speed.
  447. The execution time and speedup comparisons of original Xilinx ML310 and proposed Back-Door mechanism are both
  448. evaluated by the above four benchmarks, as shown in Fig. 12 and Fig. 13 respectively. The speedup can achieve 1.6X at the
  449. case of Selection Sort since it contains more potential parallelism and doesn’t require a lot of interprocessor data transfer and
  450. DDR memory access. In contrast, the parallelism limitation of FFT makes it only can obtain 1.25X speedup by Back-Door
  451. mechanism.
  452. It is noted that in current Xilinx ML310 platforms, even adopts Back-Door mechanism, the second processor
  453. (PowerPC405 1) can not be recognized as a normal processor to schedule by Linux, just only can be a specialized hardware
  454. accelerator to execute the manually modified applications, which can not be parallelized by conventional parallelizing
  455. compilers automatically. However, the dual-cores PowerPC 405 still can achieve the speedup from 1.25X to 1.6X. This
  456. evaluation result can reveal the capabilities of proposed Back-Door mechanism.
  457. Execution Time Comparison
  458. 0
  459. 20
  460. 40
  461. 60
  462. 80
  463. 100
  464. 120
  465. 140
  466. 160
  467. FFT
  468. Selection Sort
  469. Wavelet
  470. Dhrystone
  471. Benchmark
  472. Execution Time (sec)
  473. Original ML310 (Sole PPC405)
  474. Back-Door (Dual PPC405)
  475. Fig. 12 The execution time comparisons of original Xilinx ML310 and proposed Back-Door mechanism
  476. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  477. Copyright © TAETI
  478. 83
  479. Back-Door Speedup
  480. 0.00
  481. 0.20
  482. 0.40
  483. 0.60
  484. 0.80
  485. 1.00
  486. 1.20
  487. 1.40
  488. 1.60
  489. 1.80
  490. FFT
  491. Selection Sort
  492. Wavelet
  493. Dhrystone
  494. Benchmark
  495. Speedup
  496. Back-Door Speedup
  497. Fig. 13 The speedup comparisons of original Xilinx ML310 and proposed Back-Door mechanism
  498. 5. Conclusions
  499. Continuously requirements of high-performance computing makes the computer system adopt more processors within a
  500. system to improve the parallelism and throughput. This paper proposed two mechanisms, Golden-Finger and Back-Door, to
  501. fully utilize the processor capabilities of modern multicore system, from hardware and software mechanisms, respectively.
  502. The proposed Golden-Finger software mechanism can dynamically adjust the scheduling policy to improve the performance of
  503. the specified process by occupying a processor solely. The hardware Back-Door mechanism can communicate two
  504. independent processors which can not be operated together, such as the dual PowerPC 405 cores in the Xilinx ML310 system.
  505. The experimental results reveal that proposed two mechanisms at different dual-cores computer system can obtain 2.1X and
  506. 1.6X speedups from variant benchmarks. These results can reveal the capabilities of the two mechanisms on multicore
  507. computer system.
  508. Acknowledgement
  509. This work is supported in part by the National Science Council of Republic of China, Taiwan under Grant NSC
  510. 100-2221-E-033-043
  511. References
  512. [1] J. Aas, Understanding the Linux 2.6.8.1 CPU Scheduler, Silicon Graphics, Inc., 2005.
  513. [2] R. Love, Linux Kernel Development, SAMS, Developer Library Series, 2003.
  514. [3] E. Piel, P. Marquet, J. Soula, and J.L. Dekeyser, “Asymmetric Real-Time Scheduler on Multi-Processor Architecture”,
  515. 20th International Parallel and Distributed Processing Symposium, Apr. 2006, pp. 25-29.
  516. [4] G. E. Allen and B. L. Evans. “Real-time sonar beamforming on workstations using process networks and POSIX threads”,
  517. IEEE Transactions on Signal Processing, pp. 921-926, Mar. 2000.
  518. [5] K. Morgan, “Preemptible Linux: A reality check”, Cuba: MontaVista Software, Inc., 2001.
  519. International Journal of Engineering and Technology Innovation, vol. 2, no. 1, 2012, pp. 72-84
  520. Copyright © TAETI
  521. 84
  522. [6] J. D. Valois. “Implementing lock-free queues”. In Proceedings of the Seventh International Conference on Parallel and
  523. Distributed Computing Systems, Oct. 1994.
  524. [7] I-Tao Liao, Koan-Sin Tan, Shau-Yin Tseng, and Wen-Feng Chen, “Interprocessor Communication for PAC”, ITRI SoC
  525. Technical Journal, No.002.
  526. [8] Intel Corp. Intel® Core™ Microarchitecture. http://www.intel.com/technology/architecture/coremicro/index.htm
  527. [9] IBM Corp. The Cell Architecture. http://www.research.ibm.com/cell
  528. [10] N. Njoroge, S. Wee, J. Casper, J. Burdick, Y. Teslyar, C. Kozyrakis, and K. Olukotun, “Building and Using the ATLAS
  529. Transactional Memory System”, 12th International Symposium on High-Performance Computer Architecture (HPCA),
  530. 2006.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement