cjmcv / hpc

Learning and practice of high performance computing

Practice

cux -- An experimental framework for performance analysis and optimization of CUDA kernel functions.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/cux

tag: cuda / simd / openmp.

hcs -- A heterogeneous computing system for multi-task scheduling optimization.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/hcs

tag: std::thread / cuda.

vky -- A Vulkan-based computing framework.

https://github.com/cjmcv/hpc/tree/master/0-frameworks/vky

tag: vulkan.

Learning

Distributed computing

mpi/mpi4py

alg_matrix_multiply ： gemm: C = A * B.
base_broadcast_scatter_gather ： Record the basic usage of Bcast, Scatter, Gather and Allgather.
base_group ： Group communication.
base_hello_world ： Environment Management Routines.
base_reduce_alltoall_scan ： Record the basic usage of Reduce, Allreduce, Alltoall, Scan and Exscan.
base_send_recv ： Record the basic usage of MPI_Send/MPI_Recv and MPI_ISend/MPI_IRecv.
base_type_contiguous ： Send and receive custom types of data by using MPI_Type_contiguous.
base_type_struct ： Send and receive custom types of data by using MPI_Type_struct.
util_bandwidth_test ： Test bandwidth by point-to-point communications.
py_base_broadcast_scatter_gather ： Record the basic usage of Bcast, Scatter, Gather and Allgather.
py_base_reduce_scan ： Record the basic usage of Reduce and Scan.
py_base_send_recv ： Record the basic usage of Send and Recv.

Heterogeneous computing

cuda

cuda_util ： Utility functions.
alg_histogram ： histogram, mainly introduce atomicAdd.
alg_matrix_multiply ： gemm: C = A * B.
alg_vector_add ： Vector addition: C = A + B.
alg_vector_dot_product ： Vector dot product: h_result = SUM(A * B).
alg_vector_scan ： Scan. Prefix Sum.
base_aligned_memory_access ： An experiment on aligned memory access.
base_bank_conflict ： An experiment on Bank Conflict in Shared Memory.
base_coalesced_memory_access ： An experiment on coalesced memory access.
base_float2half ： Record the basic usage of float2half.
base_hyperQ ： Demonstrate how HyperQ allows supporting devices to avoid false dependencies between kernels in different streams.
base_kernel_layout ： Record the basic execution configuration of kernel.
base_occupancy ： Record the basic usage of cudaOccupancyMaxPotentialBlockSize.
base_texture ： Record the basic usage of Texture Memory.
base_unified_memory ： A simple task consumer using threads and streams with all data in Unified Memory.
base_zero_copy ： Record the basic usage of Zero Copy.
cub_block_reduce ： Simple demonstration of cub::BlockReduce.
cub_block_scan ： Simple demonstration of cub::BlockScan.
cub_device_reduce ： Simple demonstration of DeviceScan::Sum.
cub_device_scan ： Simple demonstration of DeviceScan::ExclusiveSum.
cub_warp_reduce ： Simple demonstration of cub::WarpReduce.
cub_warp_scan ： Simple demonstration of cub::WarpScan.
cublas_gemm_float16 ： gemm: C = A * B. Use cublas with half-precision.
thrust_iterators ： Record the basic usage of Iterators in Thrust.
thrust_sort ： Sort arrays with Thrust.
thrust_transformations ： Some of the parallel vector operations in Thrust.
thrust_vector ： Record the basic usage of Vector in Thrust.

vulkan

vky

opencl

ocl_util ： Utility functions.
alg_dot_product ： Vector dot product, h_result = SUM(A * B).
alg_vector_add ： Vector addition: C = A + B.
base_platform_info ： Query OpenCL platform information.

Thread

std

alg_vector_dot_product： Vector dot product: h_result = SUM(A * B). Record the basic usage of std::tread and std::sync.
base_async： Record the basic usage of std::async.
util_blocking_queue： Blocking queue. Mainly implemented by thread, queue and condition_variable.
util_internal_thread： Internal Thread. Mainly implemented by thread.
util_thread_pool： Thread Pool. Mainly implemented by thread, queue, future and condition_variable.

openmp

alg_matrix_multiply ： gemm: C = A * B.
alg_pi_calculate ： Calculate PI using parallel, for and reduction.
base_flush ： Records the basic usage of flush.
base_mutex ： Mutex operation in openmp, including critical, atomic, lock.
base_parallel_for ： Parallel and For.
base_schedule ： Records the basic usage of schedule.
base_sections_single ： Records the basic usage of Sections and Single.
base_synchronous ： Synchronous operation in openmp, including barrier, ordered and master.

tbb

base_allocator ： The basic use of allocator.
base_atomic ： The basic use of atomic.
base_concurrent_hash_map ： The basic use of concurrent_hash_map.
base_concurrent_queue ： The basic use of concurrent queue.
base_mutex ： The basic use of mutex in tbb.
base_parallel_for ： The basic use of parallel_for.
base_parallel_reduce ： The basic use of parallel_reduce.
base_parallel_scan ： The basic use of parallel_scan.
base_parallel_sort ： The basic use of base_parallel_sort.
base_task_scheduler ： The basic use of base_task_scheduler.
count_strings ： Count strings. Use the concurrent_hash_map.

Coroutines

libco

asyncio

base_future： Record the basic usage of future.
base_gather： Use gather to execute tasks in parallel.
base_hello_world： Hello world. Record the basic usage of async, await and loop.
base_loop_chain： Executes nested coroutines.

SIMD

sse/avx

x86_matrix_multiply ： Matrix Multiplication.
x86_vector_dot_product ： Vector dot product: result = SUM(A * B).
x86_vector_scan ： Scan. Prefix Sum.

neon

cjmcv / hpc

README.md

Learning and practice of high performance computing

Practice

Learning

Distributed computing

Heterogeneous computing

Thread

Coroutines

SIMD

About

Releases

Packages

Languages

cjmcv / hpc

Join GitHub today

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio

Latest commit

Git stats

Files

README.md

Learning and practice of high performance computing

Practice

Learning

Distributed computing

Heterogeneous computing

Thread

Coroutines

SIMD

About

Topics

Resources

License

Releases

Packages 0

Languages

Packages