关键词:
Advanced Encryption Standard New Instruction;counter mode;encryption speed;encryption speedup;encryption time;OpenMP;parallel encryption;performance
摘要:
In the open environment of cloud computing, a large amount of user data needs to be encrypted/decrypted fast to maintain confidentiality and provide high quality of service. Advanced Encryption Standard (AES), the standard encryption algorithm, has better security and efficiency compared to its competitive algorithms, so it is widely used in cloud computing and other fields. However, the implementation of AES based on software still has the problem of low efficiency; whereas the implementation of AES based on hardware needs to purchase special purpose devices. Adopting the method of special instruction sets can resolve the above two drawbacks. Therefore, we propose a fast parallel cryptographic algorithm, NIPAES, which is based on the AES-NI (New Instructions) instruction set and CPU multiple cores. NIPAES makes use of the block property of AES and the parallel property of Counter (CTR) model, adopts OpenMP to evenly distribute workloads to each thread, which performs AES-NI instructions to complete encryption/decryption. Compared to CPU serial AES based on lookup tables, CPU parallel AES, and serial AES based on AES-NI, NIPAES has significant improvement on performance. The experimental results show that NIPAES achieves the average speedups of 3197.78x, 196.12x, and 7.71x, compared to the other aforementioned algorithms, respectively.
摘要:
Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have been proposed. This paper focuses on a special type of SpMV, namely sparse quasi-diagonal matrix-vector multiplication SQDMV. The sparse quasi-diagonal matrix is the key to solving many differential equations, and very little research has been done in this field. This paper discusses data structures and algorithms for SQDMV that are efficiently implemented on the compute unified device architecture CUDA platform for the fine-grained parallel architecture of the graphics processing unit GPU. A new diagonal storage format, a hybrid of the diagonal format DLA and the compressed sparse row format CSR HDC will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements. Furthermore, HDC can adjust the storage bandwidth of the diagonal to adapt to different discrete degrees of sparse matrix, so as to get a higher compression ratio than DLA and CSR, and reduce the computational complexity. Our implementation in a GPU shows that the performance of HDC is better than that of other formats, especially for matrices with some discrete points outside the main diagonal. In addition, we combine the different parts of HDC to make a unified kernel to get a better compression ratio and a higher speedup ratio in the GPU. Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have been proposed. This paper focuses on a special type of SpMV, namely sparse quasi-diagonal matrix-vector multiplication SQDMV. The sparse quasi-diagonal matrix is the key to solving many differential equations, and very little research has been done in this field. This paper discusses data structures and algorithms for SQDMV that are efficiently implemented on the compute unified device architecture CUDA platform for the fine-grained parallel architecture of the graphics processing unit GPU. A new diagonal storage format, a hybrid of the diagonal format DLA and the compressed sparse row format CSR HDC will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements. Furthermore, HDC can adjust the storage bandwidth of the diagonal to adapt to different discrete degrees of sparse matrix, so as to get a higher compression ratio than DLA and CSR, and reduce the computational complexity. Our implementation in a GPU shows that the performance of HDC is better than that of other formats, especially for matrices with some discrete points outside the main diagonal. In addition, we combine the different parts of HDC to make a unified kernel to get a better compression ratio and a higher speedup ratio in the GPU.