Particles on PS3
Posted in Programming, PS3 on July 5th, 2010 by Chris – Be the first to commentI’ve recently started looking into programming for the Cell processor (or as everyone else knows it, the PS3). I installed Yellowdog Linux on my PS3 back in April, which required me to not upgrade to firmware version 3.21. After that, I proceeded to distract myself with my other coding work and not touch YDL for a couple months.
Last week, I decided to take a go at the cell, but didn’t exactly know what to do. I eventually decided to write a particle system as it would greatly employ the strengths of the Cell processor. I started with the Linux/GLX port of NeHe Lesson 19 since I was not very familiar with setting up a GL application in X. In addition, the code already had a “Struct of Arrays” implementation, which was what I was going to end up writing. In hindsight, the particle code that was in that port wasn’t exactly… working to begin with, which set me back a couple times.
After some massaging, I eventually got the program to run on my PS3. The performance wasn’t exactly top-notch, but it worked. The code for the PPU-only implementation (with a couple premature SPU files) can be found here:
http://bejitt.com/proj/ps3/particles/cell_particles_ppu_only.zip
After the PPU-only implementation was done, I moved on to working SPUs into my program. One of the first issues I encountered was that I couldn’t embed the 64bit SPU program inside of my 32bit PPU code, which has to be 32bit to link with libGL. I looked into dumping the binary into a header file as detailed in this dynamic code article from Insomniac, but that only seemed to provide a way to load SPU code after the SPU was already running (if I’m missing something there, please comment). I ended up keeping the SPU program as a binary and loading the image from the file system.
The next problem was how to efficiently juggle loading new particles while working on existing particles. I eventually came up with a double-buffered solution for both input and output (the IBM docs helped a tad on this). The process would go as follows:
Start transfer in of current batch
While we have more batches
Start transfer in of next batch
Wait on transfer in of current batch
Work on current batch
Start transfer out of current batch
Wait on transfer in of current batch
Work on current batch
Start transfer out of current batch
In code, this ended up looking like the following:
1: int batch_number = SPU_BATCH_COUNT;
2: uint32_t idx = 0;
3: uint32_t offset = base_offset;
4:
5: // Runtime changes every frame
6: mfc_get(&runtime, (uint32_t)ppe_runtime, sizeof(Runtime), idx, 0, 0);
7: // don't wait here, the first wait on the loop will wait for this
8:
9: // request current
10: DMAGetParticles(idx, offset, idx);
11:
12: while(--batch_number) {
13: // request next
14: DMAGetParticles(idx^1, offset + SPU_BATCH_SIZE, idx^1);
15:
16: // wait for current
17: DMAWaitAll(1<<idx);
18:
19: // run current
20: RunParticles(idx);
21:
22: // push current back to ppu
23: DMAPutParticles(idx, offset, idx);
24:
25: // switch to next
26: idx ^= 1;
27: offset += SPU_BATCH_SIZE;
28: }
29:
30: // wait current
31: DMAWaitAll(1<<idx);
32:
33: // run current
34: RunParticles(idx);
35:
36: // push current back to ppu
37: DMAPutParticles(idx, offset, idx);
There’s one little caveat on this implementation: ensuring the right order of the Gets after the put. To ensure all of the GET transfers would happen after the PUT transfers, I used mfc_getb as the first DMA inside DMAGetParticles. I’m not 100% certain if that is the best way to approach that, but it makes sense and seems to work so far (*explosion in the distance*).
The last thing that I put my attention on was working through the elements in each batch that is transferred in. This was pretty straightforward to implement; the data was arranged such that I had an array of 32 floats for each component (x,y,z,speed,etc). Using vector SIMD operations, the data could be acted on in just 8 iterations.
Here is the main chunk of code used for running the particles:
1: // load ptrs
2: vec_float4* restrict p_xPos = (vec_float4*) particles->xPos;
3: vec_float4* restrict p_yPos = (vec_float4*) particles->yPos;
4: vec_float4* restrict p_zPos = (vec_float4*) particles->zPos;
5: vec_float4* restrict p_xSpeed = (vec_float4*) particles->xSpeed;
6: vec_float4* restrict p_ySpeed = (vec_float4*) particles->ySpeed;
7: vec_float4* restrict p_zSpeed = (vec_float4*) particles->zSpeed;
8: vec_float4* restrict p_xGrav = (vec_float4*) particles->xGrav;
9: vec_float4* restrict p_yGrav = (vec_float4*) particles->yGrav;
10: vec_float4* restrict p_zGrav = (vec_float4*) particles->zGrav;
11: vec_float4* restrict p_life = (vec_float4*) particles->life;
12: vec_float4* restrict p_fade = (vec_float4*) particles->fade;
13:
14: for(int iter=0; iter<SPU_BATCH_ITERATIONS; ++iter) {
15: // load data
16: const vec_float4 xPos = p_xPos[ iter ];
17: const vec_float4 yPos = p_yPos[ iter ];
18: const vec_float4 zPos = p_zPos[ iter ];
19: const vec_float4 xSpeed = p_xSpeed[ iter ];
20: const vec_float4 ySpeed = p_ySpeed[ iter ];
21: const vec_float4 zSpeed = p_zSpeed[ iter ];
22: const vec_float4 xGrav = p_xGrav[ iter ];
23: const vec_float4 yGrav = p_yGrav[ iter ];
24: const vec_float4 zGrav = p_zGrav[ iter ];
25: const vec_float4 life = p_life[ iter ];
26: const vec_float4 fade = p_fade[ iter ];
27:
28: // operate on data
29: const vec_float4 n_xPos = spu_madd(xSpeed, slowdown, xPos);
30: const vec_float4 n_yPos = spu_madd(ySpeed, slowdown, yPos);
31: const vec_float4 n_zPos = spu_madd(zSpeed, slowdown, zPos);
32: const vec_float4 n_xSpeed = spu_add(xGrav, xSpeed);
33: const vec_float4 n_ySpeed = spu_add(yGrav, ySpeed);
34: const vec_float4 n_zSpeed = spu_add(zGrav, zSpeed);
35: const vec_float4 n_life = spu_sub(life, fade);
36:
37: // store data
38: p_xPos[ iter ] = n_xPos;
39: p_yPos[ iter ] = n_yPos;
40: p_zPos[ iter ] = n_zPos;
41: p_xSpeed[ iter ] = n_xSpeed;
42: p_ySpeed[ iter ] = n_ySpeed;
43: p_zSpeed[ iter ] = n_zSpeed;
44: p_life[ iter ] = n_life;
45: }
After working with the code for a while, I’m sticking to an assumption that the OpenGL rasterizing is being done on the PPU itself; the rendering crawls, even for ‘small’ numbers of triangles. Because it was so slow, I ended up hard coding the number of drawn particles to 512 but kept the SPUs working on the total 12288 particles. Perhaps not the best way to display all of the particles, but currently the best option I have.
I’ve uploaded the complete code to this location:
http://bejitt.com/proj/ps3/particles/cell_particles.zip
One downside to doing this development work on my personal PS3 is that I can’t connect to PSN, so I’ll be missing out on some good releases for a while. Also, the lack of profile for trophies is rough.