用户名: 密码: 验证码:
RNGAVXLIB: Program library for random number generation, AVX realization
详细信息    查看全文
文摘
We present the random number generator (RNG) library RNGAVXLIB, which contains fast AVX realizations of a number of modern random number generators, and also the abilities to jump ahead inside a RNG sequence and to initialize up to 1019 independent random number streams with block splitting method. Fast AVX implementations produce exactly the same output sequences as the original algorithms. Usage of AVX vectorization allows to substantially improve performance of the generators. The new realizations are up to 2 times faster than the SSE realizations implemented in the previous version of the library (Barash and Shchur, 2013), and up to 40 times faster compared to the original algorithms written in ANSI C.

New version program summary

Program title: RNGAVXLIB

Catalogue identifier: AEIT_v3_0

Program summary URL:lass="interref" data-locatorType="url" data-locatorKey="http://cpc.cs.qub.ac.uk/summaries/AEIT_v3_0.html">http://cpc.cs.qub.ac.uk/summaries/AEIT_v3_0.html

Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland

Licensing provisions: Standard CPC licence, lass="interref" data-locatorType="url" data-locatorKey="http://cpc.cs.qub.ac.uk/licence/licence.html">http://cpc.cs.qub.ac.uk/licence/licence.html

No. of lines in distributed program, including test data, etc.: 21061

No. of bytes in distributed program, including test data, etc.: 1763798

Distribution format: tar.gz

Programming language: C, Fortran.

Computer: PC, laptop, workstation, or server with Intel or AMD processor.

Operating system: Unix, Windows.

RAM: 4 Mbytes

Catalogue identifier of previous version: AEIT_v2_0

Journal reference of previous version: Comput. Phys. Comm. 184(2013)2367

Classification: 4.13.

Does the new version supersede the previous version?: Yes

Nature of problem: Any calculation requiring uniform pseudorandom number generator, in particular, Monte Carlo calculations. Any calculation requiring parallel streams of uniform pseudorandom numbers.

Solution method: The library contains realization of the following modern and reliable generators: lass="boldFont">MT19937 [2], lass="boldFont">MRG32K3A [3], lass="boldFont">LFSR113 [4], lass="boldFont">GM19, lass="boldFont">GM31, lass="boldFont">GM61 [5, 6], and lass="boldFont">GM29, lass="boldFont">GM55, lass="boldFont">GQ58.1, lass="boldFont">GQ58.3, lass="boldFont">GQ58.4 [7, 8]. The library contains realizations written in ANSI C, realizations based on SSE command set and realizations based on AVX command set. The use of vectorization allows substantial improvement in performance of all the generators. The library also contains the ability to jump ahead inside the RNG sequence and to initialize independent random number streams with block splitting method for each of the RNGs. C and Fortran are supported.

Reasons for new version: Modern CPUs better support vectorization compared to the CPUs available two years ago when the previous version of the library was prepared. In particular, Advanced Vector Instructions 2 (AVX2) are now supported by CPUs fabricated by Intel and AMD. AVX2 has been supported by Intel CPUs since the Haswell microarchitecture was released in June 2013, and has been supported by AMD CPUs since the Streamroller Family 15h microarchitecture was released in January 2014. An important new feature of this version is the ability to employ the AVX2 instruction set of a CPU in order to speed up the calculations. As a result, the new RNG realizations employing AVX2 are up to 2 times faster than the realizations implemented in the previous version of the library.

Summary of revisions:l class="listitem" id="list_l000005">

lass="label">1.

We added fast AVX realizations for the generators, which are up to 2 times faster than the SSE realizations implemented in the previous version of the library [1], and up to 40 times faster compared to the original algorithms written in ANSI C.

lass="label">2.

The function call interface has been simplified compared to previous versions.

lass="label">3.

We added automatic detection of whether the CPU supports SSE and/or AVX vectorization at the compilation stage and the functions which employ SSE and AVX vectorization only if the CPU supports them.

lass="label">4.

We added support for simultaneous generation of two independent output sequences for the LFSR113 generator using the AVX vectorization.

l>

Restrictions: For AVX realizations of the generators, Intel or AMD CPU supporting AVX2 command set is required. For SSE realizations of the generators, Intel or AMD CPU supporting SSE2 command set is required. In order to use the SSE realization for the lfsr113 generator, CPU must support SSE4.1 command set.

Additional comments: The function call interface has been simplified compared to the previous versions. For each of the generators, RNGAVXLIB supports the following functions, where lass="boldFont">rng should be replaced by name of a particular generator:

lass="boldFont">void rng_init_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">void rng_init_sequence_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state,unsigned long long SequenceNumber);

lass="boldFont">void rng_skipahead_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state, unsigned long long N);

lass="boldFont">unsigned int rng_generate_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">float rng_generate_uniform_float_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">unsigned int rng_ansi_generate_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">float rng_ansi_generate_uniform_float_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">unsigned int rng_sse_generate_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">float rng_sse_generate_uniform_float_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">unsigned int rng_avx_generate_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">float rng_avx_generate_uniform_floatlass="boldFont">_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

lass="boldFont">void rng_print_state_(rng_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state);

The function call interface for the lass="boldFont">rng_skipahead_ function, which jumps ahead lsi13" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si13.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=476a1a428471063f5ac3a8d3517715d2" title="Click to view the MathML source">Nlass="mathContainer hidden">lass="mathCode">ltimg="si13.gif" overflow="scroll">N output values inside an RNG sequence, can be slightly different for some of the RNGs. For example, the function

lass="boldFont">void mt19937_skipahead_(mt19937_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state, unsigned long long a, unsigned b);

skips ahead lsi15" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si15.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=9fce1f3321595a38ba6a179bd1427b92" title="Click to view the MathML source">N=a⋅2blass="mathContainer hidden">lass="mathCode">ltimg="si15.gif" overflow="scroll">N=a2b numbers, where lsi16" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si16.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=9af995f8f86f45b15f2cbce51ce77bb4" title="Click to view the MathML source">N<2512lass="mathContainer hidden">lass="mathCode">ltimg="si16.gif" overflow="scroll">N<2512, and the function

lass="boldFont">void gm55_skipahead_(gm55_statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state, unsigned long long offset64, unsigned long long offset0);

skips ahead lsi18" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si18.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=2b6873e018839ac2f70237a851a9a90d" title="Click to view the MathML source">N=264lass="mathContainer hidden">lass="mathCode">ltimg="si18.gif" overflow="scroll">N=264lass="boldFont">offset64lsi19" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si19.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=64c228ae31df62b90ae66899fd89ab9d" title="Click to view the MathML source">+lass="mathContainer hidden">lass="mathCode">ltimg="si19.gif" overflow="scroll">+lass="boldFont">offset0 numbers. The detailed function call interface can be found in the header files of the include directory. The examples of using the library can be found in the examples directory.

Some of the generators have several versions of the lass="boldFont">rng_init_sequence_ routine, for example, lass="boldFont">rng_init_short_sequence_, lass="boldFont">rng_init_medium_sequence_, lass="boldFont">rng_init_long_sequence_ (see details in [1, 10]). Maximal number of sequences and maximal length of each sequence for pseudorandom streams are indicated in [1, 10]. The algorithms used to jump ahead in the RNG sequence and to initialize parallel streams of pseudorandom numbers are described in detail in [9, 10].

This version of the library automatically detects whether the CPU supports SSE and/or AVX vectorization at the compilation stage. During the compilation of the library, the lsi20" class="mathmlsrc">le="View the MathML source" class="mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si20.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=1add3f875e08353dbb58d81bfffc16b2">lass="imgLazyJSB inlineImage" height="12" width="116" alt="View the MathML source" style="margin-top: -5px; vertical-align: middle" title="View the MathML source" src="/sd/grey_pxl.gif" data-inlimgeid="1-s2.0-S0010465515004117-si20.gif">lass="mathContainer hidden">lass="mathCode">ltimg="si20.gif" overflow="scroll">le mathvariant="bold">marchle>le mathvariant="bold">=le>le mathvariant="bold">nativele> compiler option is used, which allows the use of predefined macros such as lass="boldFont">__SSE2__ and lass="boldFont">__AVX2__ in the source code. This is supported by both GNU and Intel compilers. The functions lass="boldFont">rng_generate_ and lass="boldFont">rng_generate_uniform_float employ SSE and AVX vectorization only if the CPU supports them.

Table 1: Speed of the realizations. CPU: Intel Xeon E5-2650v3 (2.3 GHz); Compiler: gcc; Optimization: -O3.

lk000005">lass="imgLazyJSB inlineImage" height="264" width="416" alt="Full-size image (77 K)" title="Full-size image (77 K)" src="/sd/grey_pxl.gif" data-inlimgeid="1-s2.0-S0010465515004117-fx1.jpg">

This version of the library also supports simultaneous generation of two independent output sequences for the LFSR113 generator using the AVX vectorization:

lass="boldFont">void lfsr113_avx_generate_two_(lfsr113 _statelsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">state, unsignedlsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">out1, unsignedlsi1" class="mathmlsrc">lass="formulatext stixSupport mathImg" data-mathURL="/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0010465515004117&_mathId=si1.gif&_user=111111111&_pii=S0010465515004117&_rdoc=1&_issn=00104655&md5=d1fd9c2150af7da4a0e729386c9757ce" title="Click to view the MathML source">∗lass="mathContainer hidden">lass="mathCode">ltimg="si1.gif" overflow="scroll">∗lass="boldFont">out2);

This is the fastest possible way to generate LFSR113 random numbers using the CPU which supports the AVX2 instruction set. The function lass="boldFont">lfsr113_skipahead_ jumps ahead only in the first LFSR output sequence. Jumping ahead in the second output sequence can be performed with the separate lass="boldFont">lfsr113_skipahead2_ routine.

GNU Fortran does not have compiler directives for data alignment to assist vectorization, although Intel Fortran has directives for that, such as lass="boldFont">!dir$ lass="boldFont">attributes align:32. By default, GNU Fortran aligns all variables to 16-byte boundaries, which is sufficient to efficiently use SSE, but is not sufficient for AVX. We find that applying an additional SAVE command to the generator state in Fortran results, in particular, in alignment of the data to 32-byte boundaries. This allows one to employ AVX realizations from Fortran (see the examples directory). We have tested this on workstations with various CPUs and various versions of Linux.

Development and optimization of the algorithms were supported by the Russian Science Foundation project No. 14-21-00158. Benchmark testing was partially supported by Russian Foundation for Basic Research project No. 13-07-00570 and by the Supercomputing Center of Lomonosov Moscow State University [11].

Table 2: Speed of the realizations. CPU: Intel Core i7-4790K (4 GHz); Compiler: gcc; Optimization: -O3.

lk000010">lass="imgLazyJSB inlineImage" height="264" width="416" alt="Full-size image (78 K)" title="Full-size image (78 K)" src="/sd/grey_pxl.gif" data-inlimgeid="1-s2.0-S0010465515004117-fx2.jpg">

Running time: Running time is of the order of 20 sec for generating 109 pseudorandom numbers with a PC based on Intel Core i7-940 CPU. Speed of the random number generation on CPUs widely used in modern servers and workstations is shown in Tables 1 and 2 respectively (see also [6, 7]).

References:l class="listitem" id="list_l000010">

lass="label">[1]

L.Yu Barash, L.N. Shchur, RNGSSELIB: Program library for random number generation. More generators, parallel streams of random numbers and Fortran compatibility, Computer Physics Communications, 184(10), 2367–2369 (2013).

lass="label">[2]

M. Matsumoto and T. Tishimura, Mersenne Twister: A 623- dimensionally equidistributed uniform pseudorandom number generator, ACM Trans. on Mod. and Comp. Simul. 8 (1), 3–30 (1998).

lass="label">[3]

P L’Ecuyer, Good Parameter Sets for Combined Multiple Recursive Random Number Generators, Oper. Res. 47 (1), 159–164 (1999).

lass="label">[4]

P L’Ecuyer, Tables of Maximally-Equidistributed Combined LFSR Generators, Math. of Comp., 68 (255), 261–269 (1999).

lass="label">[5]

L. Barash, L.N. Shchur, Periodic orbits of the ensemble of Sinai- Arnold cat maps and pseudorandom number generation, Phys. Rev. E 73, 036701 (2006).

lass="label">[6]

L.Yu Barash, L.N. Shchur, RNGSSELIB: Program library for random number generation, SSE2 realization, Computer Physics Communications, 182 (7), 1518–1527 (2011).

lass="label">[7]

L.Yu. Barash, Applying dissipative dynamical systems to pseudorandom number generation: Equidistribution property and statistical independence of bits at distances up to logarithm of mesh size, Europhysics Letters (EPL) 95, 10003 (2011).

lass="label">[8]

L.Yu. Barash, Geometric and statistical properties of pseudorandom number generators based on multiple recursive transformations // Springer Proceedings in Mathematics and Statistics, Springer-Verlag, Berlin, Heidelberg, Vol. 23, 265–280 (2012).

lass="label">[9]

L.Yu. Barash, L.N. Shchur, On the generation of parallel streams of pseudorandom numbers, Programmnaya inzheneriya, 1 (2013) 24 (in Russian)

lass="label">[10]

L.Yu. Barash, L.N. Shchur, PRAND: GPU accelerated parallel random number generation library: Using most reliable algorithms and applying parallelism of modern GPUs and CPUs, Computer Physics Communications, 185(4), 1343–1353 (2014).

lass="label">[11]

Voevodin Vl.V., Zhumatiy S.A., Sobolev S.I., Antonov A.S., Bryzgalov P.A., Nikitenko D.A., Stefanov K.S., Voevodin Vad.V., Practice of “Lomonosov” Supercomputer // Open Systems J. - Moscow: Open Systems Publ., 2012, no.7. (In Russian)

l>

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700