GPU Particles
GPU particles are the most direct application of compute shaders and the first door to GPU physics simulation. In fact,
from my development process, the more difficult part of GPU particles is rendering rather than simulation, because the
simulation part is just simple unconstrained Newtonian mechanics. But rendering can use MSL and GLSL supported PointSize
method, you can also use Instancing's Billboard method; even when the number of particles is very large, you can also
apply something similar to Forward+ method, culling frustum blocks. But in any case, we need to call the compute shader
first to simulate the position of the particle and related life cycle variables. For particle systems, our architecture
and ShadowManager
and LightManager
Similarly, it is a singleton structure, and the ParticleRenderer
component adds itself to the manager when it is
constructed. The manager class is responsible for updating the state of all particles, and the ParticleRenderer
It is responsible for maintaining the configuration of particles and implementing the rendering of particles. Therefore,
the main logic is as follows:
void ParticleManager::draw(wgpu::CommandEncoder& commandEncoder) {
auto passEncoder = commandEncoder.BeginComputePass();
for (auto& particle : _particles) {
/* Max number of particles able to be spawned. */
uint32_t const num_dead_particles = ParticleRenderer::kMaxParticleCount - particle->numAliveParticles();
/* Number of particles to be emitted. */
uint32_t const emit_count = std::min(ParticleRenderer::kBatchEmitCount, num_dead_particles); //
_emission(emit_count, particle, passEncoder);
_simulation(particle, passEncoder);
}
passEncoder.End();
}
Atomic Counterâ
A very special point in GPU particles is that the initialization of particles is also carried out on the GPU, and it is not necessary to initialize the particles on the CPU every frame and then update them to the GPU, so that the operations of the CPU and GPU do not need any synchronous blocking. . To do this, you need the atomic counters described in the previous chapters. Each thread needs to obtain its own unique ID atomically, and then save the constructed data:
// Emit particle id.
let id = atomicAdd(&u_readAtomicBuffer.counter, 1u);
var p = TParticle();
p.position = vec4<f32>(pos, 1.0);
p.velocity = vec4<f32>(vel, 0.0);
p.start_age = age;
p.age = age;
p.id = id;
u_readConsumeBuffer[id] = p;
Note that u_readAtomicBuffer
and u_readConsumeBuffer
are used here, corresponding to u_writeAtomicBuffer
and u_writeConsumeBuffer
. In simulation applications, double buffering is a very common strategy. readBuffer is
responsible for saving the initialization data of the current step, and writeBuffer is responsible for saving the final
simulation data of the current step. The two are exchanged every frame. The exchange operation is very simple:
void ParticleRenderer::update(float deltaTime) {
setTimeStep(deltaTime * ParticleManager::getSingleton().timeStepFactor());
_write = 1 - _write;
_read = 1 - _read;
// todo
_mesh->setInstanceCount(_numAliveParticles);
_mesh->setVertexBufferBinding(*_appendConsumeBuffer[_read]);
_generateRandomValues();
}
Since read and write are just 0s and 1s, doing a simple subtraction will do the swap. It is also in order to support
this mode of Buffer exchange that when extending ShaderData
before, support for function objects was added:
_atomicBuffer[0] = std::make_unique<Buffer>(device, sizeof(uint32_t), wgpu::BufferUsage::Storage | wgpu::BufferUsage::CopySrc);
_atomicBuffer[1] = std::make_unique<Buffer>(device, sizeof(uint32_t), wgpu::BufferUsage::Storage | wgpu::BufferUsage::CopySrc);
shaderData.setBufferFunctor(_readAtomicBufferProp, [this]()->Buffer {
return *_atomicBuffer[_read];
});
shaderData.setBufferFunctor(_writeAtomicBufferProp, [this]()->Buffer {
return *_atomicBuffer[_write];
});
GPU Simulationâ
During simulation, u_readAtomicBuffer
is decremented by one, and one particle's data is popped
from u_readConsumeBuffer
; then u_writeAtomicBuffer
Add one, and push the result to u_writeConsumeBuffer
.
fn popParticle(index: u32) -> TParticle {
atomicSub(&u_readAtomicBuffer.counter, 1u);
return u_readConsumeBuffer[index];
}
fn pushParticle(p: TParticle) {
let index = atomicAdd(&u_writeAtomicBuffer.counter, 1u);
u_writeConsumeBuffer[index] = p;
}
note
The specific simulation process will not be introduced in particular, because the particle system is actually an art-oriented system, so it needs to be extended and specialized according to the needs of the art. Therefore, special effects software such as Unity and Houdini use visual scripting to gradually adjust the effect of particles. At present, only a very simple particle effect is implemented in the engine, the purpose is to run through the basic logic of GPU particle simulation, and more details will be added later and extended in a way similar to visual scripting.
Renderâ
PointSize (not supported by WGSL)â
When I first tried it out
in DigitalVox4, Particles
are rendered using the PointSize method supported by MSL. In this method, when rendering, select PrimitiveTypePoint
and set the point size in the vertex shader:
struct VertexOut {
float4 position [[position]];
float pointSize [[point_size]];
float3 color;
float decay;
};
The rendering pipeline will automatically rasterize a patch according to the size for rendering in the fragment shader, but this method is not actually supported in HLSL, so WGSL cannot support this feature. For related discussions, please refer to WebGPU Issue #1190.
Instancingâ
Since there is no geometry shader in MSL, WGSL cannot render particles in the way of geometry shader, so the only remaining way is to use GPU instancing, that is, Instancing. Actually Instancing in Forward+ It has been used in , and it was used to visualize the position of the light source for easy debugging. Here, we actually render the particles in the same way. The core of rendering lies in generating the Billboard matrix:
var<private> pos : array<vec2<f32>, 4> = array<vec2<f32>, 4>(
vec2<f32>(-1.0, 1.0), vec2<f32>(-1.0, -1.0), vec2<f32>(1.0, 1.0), vec2<f32>(1.0, -1.0)
);
// Generate a billboarded model view matrix\n"
var bbModelViewMatrix:mat4x4<f32> = mat4x4<f32>(1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0);
bbModelViewMatrix[3] = vec4<f32>(in.position.xyz, 1.0);
bbModelViewMatrix = u_cameraData.u_viewMat * bbModelViewMatrix;
bbModelViewMatrix[0][0] = 1.0;
bbModelViewMatrix[0][1] = 0.0;
bbModelViewMatrix[0][2] = 0.0;
bbModelViewMatrix[1][0] = 0.0;
bbModelViewMatrix[1][1] = 1.0;
bbModelViewMatrix[1][2] = 0.0;
bbModelViewMatrix[2][0] = 0.0;
bbModelViewMatrix[2][1] = 0.0;
bbModelViewMatrix[2][2] = 1.0;
out.position = u_cameraData.u_projMat * bbModelViewMatrix * vec4<f32>(worldPos, 1.0);
The calculation of bbModelViewMatrix
first needs to transform the coordinates of the particles into view space, and
then remove all rotation information, so that the particles always face the camera. Then act on the four vertices of the
Billboard to get the final projected coordinates.
Here, each particle is rendered as a small patch facing the camera, so 4 vertices need to be drawn, but the number of
particles can be as high as tens or millions, so instancing can draw so many particles at once. Since GPU particles need
to be calculated by compute shaders, wgpu::BufferUsage::Storage
needs to be set. But there are two ways to transfer
particle data to vertex shader, one is wgpu::BufferUsage::Uniform
The other is wgpu::BufferUsage::Vertex
. The former is used in the same way as any UniformBuffer, but the size of the
Buffer cannot exceed the upper bound of uint16_t
, so there can only be a maximum of more than 60,000 particles. In
order to support more numbers, the best choice is wgpu::BufferUsage::Vertex
, bind the data
through RenderPassEncoder::SetVertexBuffer
, for this we also need to describe the VertexBuffer Layout:
std::vector<wgpu::VertexAttribute> vertexAttributes(3);
vertexAttributes[0].format = wgpu::VertexFormat::Float32x4;
vertexAttributes[0].offset = 0;
vertexAttributes[0].shaderLocation = 0;
vertexAttributes[1].format = wgpu::VertexFormat::Float32x4;
vertexAttributes[1].offset = sizeof(Vector4F);
vertexAttributes[1].shaderLocation = 1;
vertexAttributes[2].format = wgpu::VertexFormat::Float32x4;
vertexAttributes[2].offset = 2 * sizeof(Vector4F);
vertexAttributes[2].shaderLocation = 2;
_mesh->setVertexLayouts(vertexAttributes, sizeof(TParticle), wgpu::VertexStepMode::Instance);
tip
wgpu::VertexStepMode::Instance
is the key here, traversing by Instancing. Generally
speaking, wgpu::VertexStepMode::Vertex
is more commonly used, which is a vertex-by-vertex traversal.
Advanced Topicsâ
At present, the following advanced features have not been implemented in the engine. If you are interested, you can refer to the implementation in GPUParticles11.
Physical Collisionâ
The biggest drawback of GPU particles compared to CPU particles is that it is not easy to interact with operations on the CPU, the most important of which is physics. In order to solve this problem, a depth map can be used, first rendering the position of the object, and then judging the relationship between the depth of the particle and the depth of the object, so as to perform collision detection.
Particle Cullingâ
When the number of particles is too large, even Instancing cannot directly render so many particles. Since the particles are stacked on top of each other, the particles that can actually be seen are limited. Therefore, it can be similar to Tile/Cluster-based used in light source culling method that records the number of particles associated with the grid, and then renders a specific alpha based on the number of particles.