用户名: 密码: 验证码:
A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU
详细信息    查看全文
  • 作者:Yulong Yu ; Xubin He ; He Guo ; Yuxin Wang…
  • 关键词:GPGPU ; CTA scheduler ; Credit ; based load ; balance ; aware scheduling scheme ; Load balance
  • 刊名:International Journal of Parallel Programming
  • 出版年:2016
  • 出版时间:February 2016
  • 年:2016
  • 卷:44
  • 期:1
  • 页码:109-129
  • 全文大小:1,323 KB
  • 参考文献:1.Narasiman, V., Shebanow, M., Lee, C. et al.: Improving GPU performance via large warps and two-level warp scheduling. In: International Symposium on Microarchitecture, pp. 308–317 (2011)
    2.Jog, A., Kayiran, O., Nachiappan, N. et al.: OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 395–406 (2013)
    3.Kayiran, O., Jog, A., Kandermir, M. et al.: Neither more nor less: optimizing thread-level parallelism for GPGPUs. In: International Conference on Parallel Architectures and Compilation Techniques, pp. 157–166 (2013)
    4.NVIDIA: CUDA C Programming Guide (2012) http://​docs.​nvidia.​com/​cuda/​cuda-c-programming-guide/​index.​html
    5.Khronos Group: The open standard for parallel programming of heterogeneous systems (2013) http://​www.​khronos.​org/​opencl/​
    6.NVIDIA: NVIDIA Visual Profiler (2014) https://​developer.​nvidia.​com/​nvidia-visual-profiler
    7.NVIDIA: CUDA C/C++ SDK code samples (2011) http://​www.​nvidia.​com/​cuda-cc-sdk-code-samples
    8.Bakhoda, A., Yuan, G., Fung, W. et al.: Analyzing CUDA workloads using a detailed GPU simulator. In: International Symposium on Performance Analysis of Systems and Software, pp. 163–174 (2009)
    9.NVIDIA: Tesla C2050 / C2070 GPU computing processor (2010). http://​www.​nvidia.​com/​docs/​IO/​43395/​NV_​DS_​Tesla_​C2050_​C2070_​jul10_​lores.​pdf
    10.Che, S., Boyer, M., Meng, J. et al.: Rodinia: a benchmark suite for heterogeneous computing. In: International Symposium on Workload Characterization, pp. 44–54 (2009)
    11.Stratton, J.A., Rodrigues, C., Sung, I.J. et al.: Parboil: a revised benchmark suite for scientific and commercial throughput computing. Tech. Rep. IMPACT-12-01 University of Illinois at Urbana-Champaign (2012)
    12.Lee, M., Song, S., Moon, J. et al.: Improving GPGPU resource utilization through alternative thread block scheduling. In: International Symposium on High Performance Computer Architecture, pp. 263–273 (2014)
    13.Adriaens, J., Compton, K., Kim, N. et al.: The case for GPGPU spatial multitasking. In: International Symposium on High Performance Computer Architecture, pp. 1–12 (2012)
    14.Jog, A., Kayiran, O., Mishra, A. et al.: Orchestrated scheduling and prefetching for GPGPUs. In: International Symposium on Computer Architecture, pp. 332–343 (2013)
    15.Gebhart, M., Johnson, D.R., Tarjan, D. et al.: Energy-efficient mechanisms for managing thread context in throughput processors. In: International Symposium on Computer Architecture, pp. 235–246 (2011)
    16.Rogers, T., O’Connor, M., Aamodt, T. et al.: Cache-conscious wavefront scheduling. In: International Symposium on Microarchitecture, pp. 72–83 (2012)
    17.Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: International Symposium on Computer Architecture, pp. 235–246 (2010)
    18.Fung, W.W.L., Sham, I., Yuan, G. et al.: Dynamic warp formation and scheduling for efficient GPU control flow. In: International Symposium on Microarchitecture, pp. 407–420 (2007)
    19.Fung, W., Aamodt, T.: Thread block compaction for efficient SIMT control flow. In: International Symposium on High Performance Computer Architecture, pp. 25–36 (2011)
    20.Brunie, N., Collange, S., Diamos, G.: Simultaneous branch and warp interweaving for sustained GPU performance. In: International Symposium on Computer Architecture, pp. 49–60 (2012)
    21.Jia, W., Shaw, K.A., Martonosi, M.: MRPB: memory request prioritization for massively parallel processors. In: International Symposium on High Performance Computer Architecture, pp. 274–285 (2014)
    22.Jog, A., Bolotin, E., Guz, Z. et al.: Application-aware memory system for fair and efficient execution of concurrent GPGPU applications. In: Workshop on General Purpose Processing Using GPUs, pp. 1–8 (2014)
    23.Lakshminarayana, N.B., Kim, H.: Effect of instruction fetch and memory scheduling on GPU performance. In: Workshop on Language, Compiler, and Architecture Support for GPGPU, pp. 1–10 (2010)
    24.Chen, L., Villa, O., Krishnamoorthy, S. et al.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2010)
  • 作者单位:Yulong Yu (1) (2)
    Xubin He (2)
    He Guo (1)
    Yuxin Wang (3)
    Xin Chen (1)

    1. School of Software Technology, Dalian University of Technology, Dalian, China
    2. Department of Electrical and Computer Engineering, Virginia Commonwealth University, Richmond, VA, USA
    3. School of Computer Science and Technology, Dalian University of Technology, Dalian, China
  • 刊物类别:Computer Science
  • 刊物主题:Theory of Computation
    Processor Architectures
    Software Engineering, Programming and Operating Systems
  • 出版者:Springer Netherlands
  • ISSN:1573-7640
文摘
GPGPU improves the computing performance due to the massive parallelism. The cooperative-thread-array (CTA) schedulers employed by the current GPGPUs greedily issue CTAs to GPU cores as soon as the resources become available for higher thread level parallelism. Due to the locality consideration in the memory controller, the CTA execution time varies in different cores, and thus it leads to a load imbalance of the CTA issuance among the cores. The load imbalance causes the computing resources under-utilized, and leaves an opportunity for further performance improvement. However, existing warp and CTA scheduling policies did not take account of load balance. We propose a credit-based load-balance-aware CTA scheduling optimization scheme (CLASO) piggybacked to a standard GPGPU scheduling system. CLASO uses credits to limit the amount of CTAs issued on each core to avoid the greedy issuance to faster executing cores as well as the starvation to leftover cores. In addition, CLASO employs the global credits and two tuning parameters, active levels and loose levels, to enhance the load balance and the robustness. Instead of a standalone scheduling policy, CLASO is compatible with existing CTA and warp schedulers. The experiments conducted using several paradigmatic benchmarks illustrate that CLASO effectively improves the load balance by reducing 52.4 % idle cycles on average, and achieves up to 26.6 % speedup compared to the GPGPU baseline scheduling policy. Keywords GPGPU CTA scheduler Credit-based load-balance-aware scheduling scheme Load balance

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700