RNGAVXLIB: Program library for random number generation, AVX realization

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

RNGAVXLIB: Program library for random number generation, AVX realization

详细信息查看全文

作者：M.S. Guskova^a ; ^b ; L.Yu. Barash^a ; ^c ; ^d ; ^{barash@itp.ac.ru" class="auth_mail" title="E-mail the corresponding author} ; L.N. Shchur^a ; ^b ; ^c
关键词：Statistical methods ; Monte Carlo ; Random number generation ; Advanced Vector Extensions (AVX)
刊名：Computer Physics Communications
出版年：2016
出版时间：March 2016
年：2016
卷：200
期：Complete
页码：402-405
全文大小：438 K

文摘

We present the random number generator (RNG) library RNGAVXLIB, which contains fast AVX realizations of a number of modern random number generators, and also the abilities to jump ahead inside a RNG sequence and to initialize up to 10¹⁹ independent random number streams with block splitting method. Fast AVX implementations produce exactly the same output sequences as the original algorithms. Usage of AVX vectorization allows to substantially improve performance of the generators. The new realizations are up to 2 times faster than the SSE realizations implemented in the previous version of the library (Barash and Shchur, 2013), and up to 40 times faster compared to the original algorithms written in ANSI C.

New version program summary

Program title: RNGAVXLIB

Catalogue identifier: AEIT_v3_0

Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIT_v3_0.html

Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland

Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html

No. of lines in distributed program, including test data, etc.: 21061

No. of bytes in distributed program, including test data, etc.: 1763798

Distribution format: tar.gz

Programming language: C, Fortran.

Computer: PC, laptop, workstation, or server with Intel or AMD processor.

Operating system: Unix, Windows.

RAM: 4 Mbytes

Catalogue identifier of previous version: AEIT_v2_0

Journal reference of previous version: Comput. Phys. Comm. 184(2013)2367

Classification: 4.13.

Does the new version supersede the previous version?: Yes

Nature of problem: Any calculation requiring uniform pseudorandom number generator, in particular, Monte Carlo calculations. Any calculation requiring parallel streams of uniform pseudorandom numbers.

Solution method: The library contains realization of the following modern and reliable generators: MT19937 [2], MRG32K3A [3], LFSR113 [4], GM19, GM31, GM61 [5, 6], and GM29, GM55, GQ58.1, GQ58.3, GQ58.4 [7, 8]. The library contains realizations written in ANSI C, realizations based on SSE command set and realizations based on AVX command set. The use of vectorization allows substantial improvement in performance of all the generators. The library also contains the ability to jump ahead inside the RNG sequence and to initialize independent random number streams with block splitting method for each of the RNGs. C and Fortran are supported.

Reasons for new version: Modern CPUs better support vectorization compared to the CPUs available two years ago when the previous version of the library was prepared. In particular, Advanced Vector Instructions 2 (AVX2) are now supported by CPUs fabricated by Intel and AMD. AVX2 has been supported by Intel CPUs since the Haswell microarchitecture was released in June 2013, and has been supported by AMD CPUs since the Streamroller Family 15h microarchitecture was released in January 2014. An important new feature of this version is the ability to employ the AVX2 instruction set of a CPU in order to speed up the calculations. As a result, the new RNG realizations employing AVX2 are up to 2 times faster than the realizations implemented in the previous version of the library.

Summary of revisions:

bel">1.: We added fast AVX realizations for the generators, which are up to 2 times faster than the SSE realizations implemented in the previous version of the library [1], and up to 40 times faster compared to the original algorithms written in ANSI C.
bel">2.: The function call interface has been simplified compared to previous versions.
bel">3.: We added automatic detection of whether the CPU supports SSE and/or AVX vectorization at the compilation stage and the functions which employ SSE and AVX vectorization only if the CPU supports them.
bel">4.: We added support for simultaneous generation of two independent output sequences for the LFSR113 generator using the AVX vectorization.

Restrictions: For AVX realizations of the generators, Intel or AMD CPU supporting AVX2 command set is required. For SSE realizations of the generators, Intel or AMD CPU supporting SSE2 command set is required. In order to use the SSE realization for the lfsr113 generator, CPU must support SSE4.1 command set.

Additional comments: The function call interface has been simplified compared to the previous versions. For each of the generators, RNGAVXLIB supports the following functions, where rng should be replaced by name of a particular generator:

void rng_init_(rng_state^∗ $^{*}$ state);

void rng_init_sequence_(rng_state^∗ $^{*}$ state,unsigned long long SequenceNumber);

void rng_skipahead_(rng_state^∗ $^{*}$ state, unsigned long long N);

unsigned int rng_generate_(rng_state^∗ $^{*}$ state);

float rng_generate_uniform_float_(rng_state^∗ $^{*}$ state);

unsigned int rng_ansi_generate_(rng_state^∗ $^{*}$ state);

float rng_ansi_generate_uniform_float_(rng_state^∗ $^{*}$ state);

unsigned int rng_sse_generate_(rng_state^∗ $^{*}$ state);

float rng_sse_generate_uniform_float_(rng_state^∗ $^{*}$ state);

unsigned int rng_avx_generate_(rng_state^∗ $^{*}$ state);

float rng_avx_generate_uniform_float_(rng_state^∗ $^{*}$ state);

void rng_print_state_(rng_state^∗ $^{*}$ state);

The function call interface for the rng_skipahead_ function, which jumps ahead N $N$ output values inside an RNG sequence, can be slightly different for some of the RNGs. For example, the function

void mt19937_skipahead_(mt19937_state^∗ $^{*}$ state, unsigned long long a, unsigned b);

skips ahead N=a⋅2^b $N = a \cdot 2^{b}$ numbers, where N<2⁵¹² $N < 2^{512}$ , and the function

void gm55_skipahead_(gm55_state^∗ $^{*}$ state, unsigned long long offset64, unsigned long long offset0);

skips ahead N=2⁶⁴⋅ $N = 2^{64} \cdot$ offset6462b90ae66899fd89ab9d" title="Click to view the MathML source">+ $+$ offset0 numbers. The detailed function call interface can be found in the header files of the include directory. The examples of using the library can be found in the examples directory.

Some of the generators have several versions of the rng_init_sequence_ routine, for example, rng_init_short_sequence_, rng_init_medium_sequence_, rng_init_long_sequence_ (see details in [1, 10]). Maximal number of sequences and maximal length of each sequence for pseudorandom streams are indicated in [1, 10]. The algorithms used to jump ahead in the RNG sequence and to initialize parallel streams of pseudorandom numbers are described in detail in [9, 10].

This version of the library automatically detects whether the CPU supports SSE and/or AVX vectorization at the compilation stage. During the compilation of the library, the $- march = native$ compiler option is used, which allows the use of predefined macros such as __SSE2__ and __AVX2__ in the source code. This is supported by both GNU and Intel compilers. The functions rng_generate_ and rng_generate_uniform_float employ SSE and AVX vectorization only if the CPU supports them.

Table 1: Speed of the realizations. CPU: Intel Xeon E5-2650v3 (2.3 GHz); Compiler: gcc; Optimization: -O3.

This version of the library also supports simultaneous generation of two independent output sequences for the LFSR113 generator using the AVX vectorization:

void lfsr113_avx_generate_two_(lfsr113 _state^∗ $^{*}$ state, unsigned^∗ $^{*}$ out1, unsigned^∗ $^{*}$ out2);

This is the fastest possible way to generate LFSR113 random numbers using the CPU which supports the AVX2 instruction set. The function lfsr113_skipahead_ jumps ahead only in the first LFSR output sequence. Jumping ahead in the second output sequence can be performed with the separate lfsr113_skipahead2_ routine.

GNU Fortran does not have compiler directives for data alignment to assist vectorization, although Intel Fortran has directives for that, such as !dir$ attributes align:32. By default, GNU Fortran aligns all variables to 16-byte boundaries, which is sufficient to efficiently use SSE, but is not sufficient for AVX. We find that applying an additional SAVE command to the generator state in Fortran results, in particular, in alignment of the data to 32-byte boundaries. This allows one to employ AVX realizations from Fortran (see the examples directory). We have tested this on workstations with various CPUs and various versions of Linux.

Development and optimization of the algorithms were supported by the Russian Science Foundation project No. 14-21-00158. Benchmark testing was partially supported by Russian Foundation for Basic Research project No. 13-07-00570 and by the Supercomputing Center of Lomonosov Moscow State University [11].

Table 2: Speed of the realizations. CPU: Intel Core i7-4790K (4 GHz); Compiler: gcc; Optimization: -O3.

Running time: Running time is of the order of 20 sec for generating 10⁹ pseudorandom numbers with a PC based on Intel Core i7-940 CPU. Speed of the random number generation on CPUs widely used in modern servers and workstations is shown in Tables 1 and 2 respectively (see also [6, 7]).

References:

bel">[1]: L.Yu Barash, L.N. Shchur, RNGSSELIB: Program library for random number generation. More generators, parallel streams of random numbers and Fortran compatibility, Computer Physics Communications, 184(10), 2367–2369 (2013).
bel">[2]: M. Matsumoto and T. Tishimura, Mersenne Twister: A 623- dimensionally equidistributed uniform pseudorandom number generator, ACM Trans. on Mod. and Comp. Simul. 8 (1), 3–30 (1998).
bel">[3]: P L’Ecuyer, Good Parameter Sets for Combined Multiple Recursive Random Number Generators, Oper. Res. 47 (1), 159–164 (1999).
bel">[4]: P L’Ecuyer, Tables of Maximally-Equidistributed Combined LFSR Generators, Math. of Comp., 68 (255), 261–269 (1999).
bel">[5]: L. Barash, L.N. Shchur, Periodic orbits of the ensemble of Sinai- Arnold cat maps and pseudorandom number generation, Phys. Rev. E 73, 036701 (2006).
bel">[6]: L.Yu Barash, L.N. Shchur, RNGSSELIB: Program library for random number generation, SSE2 realization, Computer Physics Communications, 182 (7), 1518–1527 (2011).
bel">[7]: L.Yu. Barash, Applying dissipative dynamical systems to pseudorandom number generation: Equidistribution property and statistical independence of bits at distances up to logarithm of mesh size, Europhysics Letters (EPL) 95, 10003 (2011).
bel">[8]: L.Yu. Barash, Geometric and statistical properties of pseudorandom number generators based on multiple recursive transformations // Springer Proceedings in Mathematics and Statistics, Springer-Verlag, Berlin, Heidelberg, Vol. 23, 265–280 (2012).
bel">[9]: L.Yu. Barash, L.N. Shchur, On the generation of parallel streams of pseudorandom numbers, Programmnaya inzheneriya, 1 (2013) 24 (in Russian)
bel">[10]: L.Yu. Barash, L.N. Shchur, PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs, Computer Physics Communications, 185(4), 1343–1353 (2014).
bel">[11]: Voevodin Vl.V., Zhumatiy S.A., Sobolev S.I., Antonov A.S., Bryzgalov P.A., Nikitenko D.A., Stefanov K.S., Voevodin Vad.V., Practice of “Lomonosov” Supercomputer // Open Systems J. - Moscow: Open Systems Publ., 2012, no.7. (In Russian)

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700