logo

G

  • Tutorials
  • API
  • Examples
  • Plugins
  • Productsantv logo arrow
  • 6.1.26
  • Canvas
    • Introduction
    • Options
    • Built-in objects
    • Coordinate system
    • Scenegraph & Lifecycle
    • Event
    • OffscreenCanvas & Server-side Rendering
    • CustomElementRegistry
    • Frequently Asked Questions
  • Renderer
    • Introduction
    • Canvas Renderer
    • Canvaskit Renderer
    • SVG Renderer
    • WebGL Renderer
    • WebGPU Renderer
    • Custom Renderer
  • Camera
    • Introduction
    • Camera Parameters
    • Camera action
    • Camera animation
  • Event
    • Introduction
    • Event Object
    • Gesture & Drag'n'Drop
    • Frequently Asked Questions
  • Animation
    • Web Animations API
    • Lottie
  • Basic Shapes
    • Basic Concepts
    • DisplayObject
    • Group
    • Text
    • Circle
    • Ellipse
    • Rect
    • Image
    • Line
    • Polygon
    • Polyline
    • Path
    • HTML
  • Style System
    • Introduction
    • CSS Typed OM
    • Inheritance
    • CSS Properties & Values API
    • CSS Layout API
    • Pattern
    • Gradient
  • 3D
    • 材质
    • 几何
    • Mesh
    • 光源
    • 雾
    • 交互
  • Built-in Objects
    • EventTarget
    • Node
    • Element
    • Document
    • MutationObserver
    • Utils
  • GPGPU
    • Introduction
    • Programming Model
    • Kernel API
    • Principles of classical GPGPU implementation
    • webgpu-graph
  • Declarative programming
    • 使用 Web Components
  • Devtools
    • G 开发者工具
    • 内置的渲染统计信息
    • 第三方开发调试工具

Programming Model

Previous
Introduction
Next
Kernel API

Resource

Ant Design
Galacea Effects
Umi-React Application Framework
Dumi-Component doc generator
ahooks-React Hooks Library

Community

Ant Financial Experience Tech
seeconfSEE Conf-Experience Tech Conference

Help

GitHub
StackOverflow

more productsMore Productions

Ant DesignAnt Design-Enterprise UI design language
yuqueYuque-Knowledge creation and Sharing tool
EggEgg-Enterprise-class Node development framework
kitchenKitchen-Sketch Tool set
GalaceanGalacean-Interactive solution
xtechLiven Experience technology
© Copyright 2025 Ant Group Co., Ltd..备案号:京ICP备15032932号-38

Loading...

Referring to the CUDA programming model, understanding it helps us to write high-performance parallel code.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

Host & Device

In CUDA the Kernel is parallelized on the GPU side (Device), while the CPU side (Host) is responsible for serial tasks such as writing and reading data, specifying the size of the thread group, and calling the Kernel.

host & device

The two need to be executed in conjunction, for example, by allocating memory in the Host and copying it to the Device.

//allocate memory
cudaMalloc((void**) &d_in_V, V.size() *sizeof(int));
//copy to device memory
cudaMemcpy(d_in_V, V.data(), V.size() *sizeof(int), cudaMemcpyHostToDevice);

In the following CUDA program (matrix addition), for example, the kernel function is parallelized between each thread of the GPU, and each thread receives part of the data according to its number to perform the operation and writes the result back to the global array. Each thread in the addition is responsible for the computation of elements in the same position between two matrices.

// Kernel
__global__ void MatAdd(
float A[N][N], // Input array 1
float B[N][N], // Input array 2
float C[N][N]) // Result array
{
int i = blockIdx.x * blockDim.x + threadIdx.x; // These are built-in variables related to thread groups and are only available in Kernel functions
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j]; // Perform addition and write back
}
int main()
{
// ... Omit the buffer creation process
// Kernel invocation
dim3 threadsPerBlock(16, 16); // Specify the thread group size
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); // Calling Kernel Functions
}

CUDA vs Compute Shader

"single source" is undoubtedly one of the highlights of CUDA, i.e. Host and Device code are written in C++, which definitely reduces the learning cost for users. This is definitely not possible with Compute Shader, which uses the rendering API. Device code must be written in the Shader language, which makes synchronization difficult due to RPC-like calls, and the Shader language has a lot of limitations (no recursion, restricted parameter types).

The following figure from this PPT compares differences between CUDA and Compute Shader.

CUDA vs compute shaders

CUDA C++ allows developers to write kernel functions in C++ and compile them into GPU-executable code using nvcc. If we want to do the same thing on the web side, the JS language doesn't scale well, in other words Device and Host code are hard to write together.

CUDA vs compute shaders

A simple approach is to write the Device code in a string, taking advantage of the computational pipeline provided by the WebGPU API. The next question is which language should the Device code in the string be in?

  • WGSL. Using WebGPU's Shader language is the most straightforward, but there are some learning costs for front-end developers (but it's actually fine, you only need to learn Compute Shader), plus you lose syntax highlighting when writing code in strings.
  • TS. This is the idea behind the GWebGPU project, which provides syntax highlighting with the editor plugin.

Although the usage of CUDA and Compute Shader is quite different, it is not difficult to port a CUDA implementation to Compute Shader for the same algorithm, as long as Compute Shader is feature-rich enough.

Thread, Block and Grid

GPU threads are not quite the same as what we normally understand as threads, these threads execute the same instructions, but just use different data (SIMD). In the kernel function each thread finds the data it is responsible for by its ID.

Logic View

The image below is from: http://on-demand.gputechconf.com/gtc/2010/presentations/S12312-DirectCompute-Pre-Conference-Tutorial.pdf. It only shows the hierarchical relationship between grids and thread groups, and is not limited to DirectCompute.

GPU Programming Model
  • Assign a 3-dimensional thread grid via dispatch(x, y, z)
  • The grid contains many thread groups (Work Groups, Thread Groups, Thread Blocks, local workgroups are called differently), each of which contains many threads, and the thread groups are also 3-dimensional, generally specified in the Shader by numthreads(x, y, z)
  • Our Shader program will eventually run on each thread. For each thread, you can get the 3-dimensional coordinates of your own thread group, or you can get the 3-dimensional coordinates of the thread group in the whole thread grid, and map it to different data
Grid, Block and Thread

The number of Blocks and the number of threads in each Block are allocated in CUDA in the following way.

dim3 threadsPerBlock(16, 16); // Specify the thread group size
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); // Calling Kernel Functions

Instead, the following syntax is used in the Compute Shader. https://www.w3.org/TR/WGSL/#entry-point-attributes

@compute @workgroup_size(8,4,1)

Hardware View

The correspondence between grids, thread groups and threads is also reflected in the hardware implementation of the GPU.

There are many SMs (Streaming Multiprocessor) on the GPU and each SM contains many cores, the following diagram shows the correspondence of CUDA implementations.

software & hardware

The image below is from: http://www.adms-conf.org/2019-presentations/ADMS19_nvidia_keynote.pdf

GPU execution model

Thread Variables

Now that we understand the hierarchy of grids, thread groups and threads, each thread needs to know its own coordinates in the thread group it is in, and the coordinates of the thread group in the entire thread grid when it executes the Shader program. The following figure is from [https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-attributes-numthreads?redirectedfrom=MSDN](https://docs. microsoft.com/en-us/windows/win32/direct3dhlsl/sm5-attributes-numthreads?redirectedfrom=MSDN), shows the logic for calculating these coordinates.

attributes numthreads
parameterdata typeremarks
numWorkGroupsivec3Number of threaded workgroups for dispatch
workGroupSizeivec3The number of threads per thread group declared by numthreads within the Shader
workGroupIDivec3The index of the current thread workgroup. The range of values is from (0, 0, 0) to (numWorkGroups.x - 1, numWorkGroups.y - 1, numWorkGroups.z - 1)
localInvocationIDivec3The index of the current thread in its own thread group. The range of values is from (0, 0, 0) to (workGroupSize.x - 1, * workGroupSize.y - 1, workGroupSize.z - 1)
globalInvocationIDivec3The index of the current thread in the global thread group. It is calculated as workGroupID * workGroupSize + localInvocationID
localInvocationIndexintThe one-dimensional index of the current thread in its own thread group, calculated as localInvocationID.z * workGroupSize.x * workGroupSize.y + localInvocationID.y * workGroupSize.x + localInvocationID.x

Shared memory and Synchronization

In some computing tasks, each thread not only needs to process the part of data it is responsible for, but may also need to read and modify the data processed by other threads, which requires shared memory and synchronization.

shared memory

https://zhuanlan.zhihu.com/p/128996252

When a variable is declared as shared, it will be saved to a specific location and thus visible to all compute shaders in the same local workgroup. If a compute shader requests a write to a shared variable, then information about the changes to this data will eventually be notified to all shaders in the same local workgroup. Access to shared variables is usually much better than access to image or shader storage caches (e.g. main memory). Because shaders treat shared memory as a local quantity and can make copies in the device, accessing shared variables may be faster than using the buffer approach. Therefore, if a shader needs to make a large number of accesses to the same memory, give preference to copying the memory to a shared variable and then manipulating it.

Since shared memory is involved, it is definitely necessary to set up synchronization points.

The execution barrier, which can be triggered by the barrier() function. If a request from a compute shader encounters the barrier, it stops running and waits for all requests from the same local workgroup to also reach the barrier before executing the code that follows.

For example, in our implementation of Reduce summation example, the following is used.

  • shared memory
  • workgroupBarrier
var<workgroup> shared : array<f32, 128>;
workgroupBarrier();