基于GPU的TOUGHREACT并行化实现

英文题名：Parallel Implementation of TOUGHREACT Based on GPU
作者：朱彤
论文级别：硕士
学科专业名称：网络与信息安全
中文关键词：GPGPU ; 多相流模拟 ; 并行计算 ; CUDA ; 双共轭梯度法
英文关键词：GPGPU ; Multiphase Flow Simulation ; Parallel Computing ; CUDA ; Bi-Conjugate
英文关键词：Gradient
学位年度：2012
导师：魏晓辉 ; 张猛
学科代码：081001
学位授予单位：吉林大学
论文提交日期：2012-04-01

摘要

近年来，高性能并行计算技术发展迅速。利用新的多核、众核以及GPU计算平台高效实现复杂地质条件下物理化学状态数值模型的模拟，已经成为地质工作者越来越关心的科学课题。随着GPU通用计算的出现以及飞速发展，越来越多的研究人员利用GPU技术来加速地下多相流数值模拟软件的计算过程，以满足大尺度、高精度的应用需求。由劳伦斯伯克利实验室开发的TOUGHREACT是当前应用最广泛的解决地下多相流体运动与地球化学反应运移耦合过程和机理的模拟程序。当前，在对要求较大尺度、较高精度的复杂地质环境问题（如二氧化碳地质储存）进行数值模拟时，TOUGHREACT执行效率不高。因此通过GPU并行计算技术加速TOUGHREACT的数值模拟过程有非常重要的工程意义和研究价值。本文基于此目的在CPU-GPU异构计算平台上对TOUGHREACT软件进行了并行化实现。
     首先，通过了解相关专业知识，对软件的基本模拟过程进行简要理解。参考已有的研究工作，对软件的模块化结构进行了详细分析。对比多相流模块与地球化学反应运移模块在求解过程中的差异，综合考虑线性方程组的规模和每个时间步内迭代求解过程的并发性，确定多相流动数值模拟部分更适合在GPU平台上并行实现。
     在对自然科学和社会科学中许多实际问题进行数值求解时，经常使用偏微分方程作为数值模型来表示质量与能量守恒状态，而在对偏微分方程进行离散求解时，稀疏线性方程组的求解是主要的计算步骤之一。尤其是在对某些场地级大尺度问题进行模拟时，稀疏线性方程组的求解时间会达到80%以上。因此，本文对TOUREACT中各部分模块执行时间进行了对比，选择以其中线性方程组求解过程为重点开展并行化工作。
     由于求解多相流问题时遇到的系数矩阵具有非对称非正定的特征，因此本文使用krylov子空间法中的几种双共轭梯度法求解方程组。同时，为了不以牺牲求解效率为代价，决定不对预处理部分做GPU移植，而主要针对求解中最耗时的两个部分：稀疏矩阵向量乘（SPMV）和向量内积操作进行CUDA实现。确定了各个内核函数映射关系以后，基于CUDA的并行程序开发难度不大，但是一些必要的优化手段可以显著提高并行程序的性能。本文作了如下工作：选择合理的稀疏矩阵存储格式，减少内存占用以及主机与设备的数据传输开销；优化存储器访问，使用共享内存、页锁定存储器以及合并顺序执行的内核函数来减少全局内存访问；优化指令流，包括避免不必要的同步操作以及循环展开；实现多版本内核，建立线程规模判定树，根据不同的问题规模进行合理的线程组织，充分利用GPU上的处理器资源，以达到负载均衡的目的。
     最后，将实现的并行预处理共轭梯度求解器整合到TOUGHREACT程序中。在CPU-GPU构成的计算平台上，对不同规模的实际问题进行数值模拟，对本文实现的并行BICG和并行BICGSTB算法进行性能测试。实验表明，本文实现的线性方程组并行求解器相对于CPU串行程序有最多3.4倍的加速比，对多相流动数值模拟的整体求解过程有最多2.8倍的加速比。这一结果印证了本文使用的并行化策略的正确性，为进一步的对地球化学反应运移模块的GPU移植工作打下了很好的基础，积累了丰富的经验。
With the rapid development of high-performance parallel computing technology, it ismore and more important for geologists to use the new multicore and GPU computingplatform to compute physical and chemical state under complicated geological conditions bynumerical simulation. General Purpose GPU computing, in combination with the undergroundmultiphase flow numerical simulation, may be effective tool for variety of hydrogeology andenvironmental geology problems. TOUGHREACT, developed by Lawrence BerkeleyLaboratory,is currently the most widely used simulation program of underground multiphasefluid motion and transport of chemical reaction process and mechanism of the Earth Simulator.At present, when involving large scale and high precision, and high complexity of large scalenumerical simulation problems (such as nuclear waste underground disposal, such asunderground storage of CO2), TOUGHREACT is ineffective. Therefore, using the GPUparallel computing technology to accumulate the numerical simulation of TOUGHREACT,has very important engineering significance and research value.For this purpose, this articleworks on numerical simulation software parallelization based on CPU-GPU.
     First of all, to briefly understand the basic simulation process of the program, I study theknowledge of the relevant expertise. Reference to the existing research work, carried out adetailed analysis of the modular structure of the software. According to the comparison ofmultiphase flow module and geochemical reactions migration module differences, I foundthat the numerical simulation of multiphase flow part are more suitable in parallel on a GPUplatform because of the solution process in terms of the size of linear equations and theconcurrency of the iterative process.
     When solving partial differential equations, sparse linear equations have a very importantrole. Often use partial differential equations as mathematical models. Then, based on thecomparison of its various parts of the module execution time, I decide to parallel the solver oflinear equations which account for more than80%of the simulation time.
     Since the coefficient matrix for solving multiphase flow problems encountered withnon-symmetric non-positive definite characteristics, the article uses several pairs of conjugategradient method for solving equations in the Krylov subspace methods. Analysis of thepreconditioned conjugate gradient method using in the solver, is done. In order not to come atthe expense of solving efficiency, I decided not to use GPU accelerating pretreatment section.CUDA implementation focused on the sparse matrix-vector multiplication (SPMV) and vectorinner which are two of the most time consuming parts of the solution. CUDA-based parallelprogram development is not difficult. The main task is how to optimize it. This paper made a lot of work on it. Including the selection of a sparse matrix storage format; reduce the hostand client data traffic; take a basic matrix-vector multiplication algorithm is divided into twomethods, so the computing will be more efficient; optimize the organizational structure ofkernel for reducing the core switching overhead; use shared memory and the page-lockedmemory to optimize memory access; design multiple versions of each operating kernel threadorganization; set up the thread scale tree, so that the size of the problem; make full use ofprocessor resources on the GPU. Availability of the program has increased greatly.
     Finally, the parallel preconditioned conjugate gradient solver package is integrated toTOUGHREACT programs. In order to test the performance, we use the parallel BICG andparallel BICGSTB algorithm to simulate the practical problems of different sizes on theCPU-GPU computing platform. The experiments show that it can speed up the solvingprocess of the linear equations3.4times in double precision. Through the parallelization ofthe mainly part of the program, overall simulation of the solution process also has been wellaccelerated and can speed up the total solving process of the simulation2.8times. This resultconfirms the correctness of the parallelization strategy used in this article and laid a goodfoundation for further parallelization of geochemical reactions migration module on GPU andhas accumulated rich experience.

引文

[1] Randal Rheinheimer, Steven L Humphries, Hugh P Bivens, Judy I Beiriger. Constructingthe ASCI Computational Grid[C]. In Proceedings of the9th IEEE InternationalSymposium on High Performance Distributed Computing.2000,193-199.
    [2] Christian A Pagot, Carlos E Scheidegger. Computation on GPUs: From a ProgrammablePipeline to an Efficient Stream Processor[J]. Revista de Informática Teóricae Aplicada,2003:41-70.
    [3] Sundaresan Venkatasubramanian, Richard W Vuduc, None None. Tuned and wildlyasynchronous stencil kernels for hybrid CPU/GPU systems[C]. In Proceedings of the23rd international conference on Supercomputing.2009,244.
    [4] NSCC-TJ National Supercomputing Center in Tianjin [EB/OL]. http://www.nscc-tj.gov.cn.2011.
    [5] Michael G McDonald, Arlen W Harbaugh. The history of MODFLOW[J]. Ground Water,2003,41(2):280-283.
    [6] Junqi Huang, John A Christ, Mark N Goltz. An assembly model for simulation oflarge-scale ground water flow and transport[J]. Ground Water,2008,46(6):882-892.
    [7] Yanhui Dong, Guomin Li. A parallel PCG solver for MODFLOW[J]. Ground Water,2009,47(6):845-850.
    [8]中国科学院地质与地球物理研究所. Ground Water: MODFLOW的一种并行计算方法[EB/OL]. http://www.igg.cas.cn/xwzx/yjcg/200912/t20091211_2707156.html.2010.
    [9] Dandan Li, Xiaohui Ji, Qun Wang. CUDA-based Solver for Large-scale GroundwaterFlow Simulation [J]. Engineering with Computers,2012,28(1):13-19.
    [10]Eli Turkel. Preconditioning Techniques in Computational Fluid Dynamics[J]. AnnualReview of Fluid Mechanics,1999,31(1):385-416.
    [11]K Pruess. TOUGH2—A General Purpose Numerical Simulator for Multiphase Fluidand Heat Flow[R]. Lawrence Berkeley Laboratory Report,1991.
    [12]Erik Elmroth, Chris Ding. A parallel implementation of the TOUGH2software packagefor large scale multiphase fluid and heat flow simulations[C]. In Proceedings of theACM/IEEE SCC99Conference on High Performance Networking and Computing.2000,14-19.
    [13]Yushu Wu, Keni Zhang, Chris Ding, K Pruess, E Elmroth, G S Bodvarsson. An efficientparallel-computing method for modeling nonisothermal multiphase flow andmulticomponent transport in porous and fractured media[J]. Advances in WaterResources,2002,25(3):243-261
    [14]Keni Zhang, Yushu Wu, Bodvarsson G S. Parallel computing simulation of fluid flow inthe unsaturated zone of Yucca Mountain, Nevada[J]. Journal of Contaminant Hydrology,2003,62(63):381-399.
    [15]施小清,张可霓,吴吉春. TOUGH2软件的发展及应用[J].工程勘察,2009(10):29-34.
    [16]Wilder J, Moridis G, Wilson S, M. Kurihara. An international effort to compare gashydrate reservoir simulators[C]. In Proceedings of the6th International Conference onGas Hydrates.2008.
    [17]Xu T, Sonnenthal E L, Spycher N, Pruess K. TOURGHREACT: A simulation programfor non-isothermal multiphase reactive geochemical transport in variably saturatedgeologic media [J]. Computers&Geosciences,2006(32):145-165.
    [18]Zhejun Pan, Luke D Connell. Impact of coal seam as interlayer on CO2storage in salineaquifers: A reservoir simulation study[J]. International Journal of Greenhouse GasControl,2011,5(1):99-114.
    [19]Moridis G, K. Pruess. T2SOLV: An Enhanced Package of Solvers for the TOUGH2Family of Reservoir Simulation Codes[J]. Geothermics,1998,27(4):415-444.
    [20]Enrique Francisco Kaasschieter. Preconditioned conjugate gradients for solving singularsystems[J]. Journal of Computational and Apllied Mathematics,1988,24(24):265-275.
    [21]Shubhabrata Sengupta, Mark Harris, Yao Zhang, John D Owens. Scan Primitives forGPU Computing[J]. Computing,2007,21(3), Publisher:106.
    [22]Nvidia, NVIDIA CUDA programming guide[P/OL]. http://developer.nvidia.com/nvidia-gpu-computing-documentation.2010.
    [23]David Luebke, Mark Harris, Jens Krüger, Tim Purcell. GPGPU: general purposecomputation on graphics hardware[C]. IEEE Transactions on Visualization and ComputerGraphics.2004,121-121.
    [24]Henry G Dietz, B Dalton Young. MIMD Interpretation on a GPU[J]. Languages andCompilers for Parallel Computing,2010,5898:65-79.
    [25]Ramin Tajallipour, Khan Wahid. Fast Algorithm of A64-bit Decimal LogarithmicConverter[J]. Journal of Computers,2010,5(12):1847-1855.
    [26]Karl Heinz Hoffmann, Michael Hofmann, Jens Lang, Gudula Rünger, Steffen Seeger.Accelerating Physical Simulations Using Graphics Processing Units[J]. InformationTechnology,2011,53(2):49-59.
    [27]Kamran Karimi, Neil G Dickson, Firas Hamze. A Performance Comparison of CUDAand OpenCL[OL]. http://arxiv.org/abs/1005.2581v3.2011.
    [28]Nasser A Kurd, Subramani Bhamidipati, Christopher Mozak, Jeffrey L Miller, Timothy MWilson, Mahadev Nemani. Westmere: A Family of32nm IA Processors[C]. InProceedings of Solid-State Circuits Conference Digest of Technical Papers (ISSCC),2010IEEE International.2010,96-97.
    [29]Mingliang Wang, Hector Klie, Manish Parashar, Hari Sudan. Solving sparse linearsystems on NVIDIA tesla GPUs[M]. Lecture Notes in Computer Science. Heidelberg:Springer,2009,5544:864-873.
    [30]Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Victor W. Lee,Daehyun Kim, Pradeep Dubey. Fast sort on cpus, gpus and intel mic architectures[R].Technical report, Intel Labs,2010.
    [31]TN Narasimhan, Paul A Witherspoon1976. An integrated finite difference method foranalyzing fluid flow in porous media[J]. Water Resources Research,1976,12(1):57-64.
    [32]Hubbert M King. Darcy'S Law and the Field Equations of the Flow of UndergroundFluids[J]. Hydrological Sciences Journal,1957,2(1):23-59.
    [33]S W NG, Y S LEE. Variable Dimension Newton-Raphson Method[J]. IEEE, Trans,2000,47(6):809-817.
    [34]Zhaojun Bai. Krylov subspace techniques for reduced-order modeling of large-scaledynamical systems[J]. Applied Numerical Mathematics,2002,43(1-2):9-44.
    [35]陈国良.并行计算一一结构.算法.编程(修订版)[M].北京:高等教育出版,2003.
    [36]吴建平,王正华,李晓梅.稀疏线性方程组的高效求解与并行计算[M].湖南：湖南科学技术出版社,2008.
    [37]Saad Y. Iterative methods for sparse linear systes[M].2nded. Philadelphia: SIAM,2003.
    [38]Bell N, Garland M. Efficient sparse matrix vector multiplication on CUDA [R]. San Jose:NVIDIA, NVR-2008-004,2008.
    [39]张舒,褚艳利,赵开勇,张钰勃. GPU高性能运算之CUDA[M].北京：中国水利水电出版社,2009.
    [40]The University of Florida Sparse Matrix Collection[EB/OL]. http://www.cise.ufl.edu/research/sparse/matrices/.2011.