Tricking the Compiler into Giving You Size Information

Posted in Programming on May 3rd, 2011 by Chris – Be the first to comment

Figured out an interesting way to get the compiler to tell you the size of a class/structure.. at compile time!

#define SHOW_SIZE(type) \
template<size_t _size> class DbgSize##type {}; \
struct t_b##type {}; \
DbgSize##type<sizeof(t_b##type)> b##type = DbgSize##type<sizeof(type)>();

struct SomeStruct { char data[55]; };

I get the following in Visual Studio:
error C2440: 'initializing' : cannot convert from 'DbgSizeSomeStruct<_size>' to 'DbgSizeSomeStruct<_size>'

See what it did there? We “trick” the compiler to tell us the template arguments, which we have set to the size of the structure.

As a note, this does work in SN, though the error is a bit.. cleaner?

Much credit should be given to my coworker, Daniel Silver, who came up with the initial idea for this.

Particles on PS3

Posted in Programming, PS3 on July 5th, 2010 by Chris – Be the first to comment

I’ve recently started looking into programming for the Cell processor (or as everyone else knows it, the PS3). I installed Yellowdog Linux on my PS3 back in April, which required me to not upgrade to firmware version 3.21. After that, I proceeded to distract myself with my other coding work and not touch YDL for a couple months.

Last week, I decided to take a go at the cell, but didn’t exactly know what to do. I eventually decided to write a particle system as it would greatly employ the strengths of the Cell processor. I started with the Linux/GLX port of NeHe Lesson 19 since I was not very familiar with setting up a GL application in X. In addition, the code already had a “Struct of Arrays” implementation, which was what I was going to end up writing. In hindsight, the particle code that was in that port wasn’t exactly… working to begin with, which set me back a couple times.

After some massaging, I eventually got the program to run on my PS3. The performance wasn’t exactly top-notch, but it worked. The code for the PPU-only implementation (with a couple premature SPU files) can be found here:

After the PPU-only implementation was done, I moved on to working SPUs into my program. One of the first issues I encountered was that I couldn’t embed the 64bit SPU program inside of my 32bit PPU code, which has to be 32bit to link with libGL. I looked into dumping the binary into a header file as detailed in this dynamic code article from Insomniac, but that only seemed to provide a way to load SPU code after the SPU was already running (if I’m missing something there, please comment). I ended up keeping the SPU program as a binary and loading the image from the file system.

The next problem was how to efficiently juggle loading new particles while working on existing particles. I eventually came up with a double-buffered solution for both input and output (the IBM docs helped a tad on this). The process would go as follows:

Start transfer in of current batch
While we have more batches

Start transfer in of next batch
Wait on transfer in of current batch
Work on current batch
Start transfer out of current batch

Wait on transfer in of current batch
Work on current batch
Start transfer out of current batch

In code, this ended up looking like the following:

 1:  int batch_number = SPU_BATCH_COUNT;
 2:  uint32_t idx = 0;
 3:  uint32_t offset = base_offset;
 5:  // Runtime changes every frame
 6:  mfc_get(&runtime, (uint32_t)ppe_runtime, sizeof(Runtime), idx, 0, 0);
 7:  // don't wait here, the first wait on the loop will wait for this
 9:  // request current
10:  DMAGetParticles(idx, offset, idx);
12:  while(--batch_number) {
13:    // request next
14:    DMAGetParticles(idx^1, offset + SPU_BATCH_SIZE, idx^1);
16:    // wait for current
17:    DMAWaitAll(1<<idx);
19:    // run current
20:    RunParticles(idx);
22:    // push current back to ppu
23:    DMAPutParticles(idx, offset, idx);
25:    // switch to next
26:    idx ^= 1;
27:    offset += SPU_BATCH_SIZE;
28:  }
30:  // wait current
31:  DMAWaitAll(1<<idx);
33:  // run current
34:  RunParticles(idx);
36:  // push current back to ppu
37:  DMAPutParticles(idx, offset, idx);

There’s one little caveat on this implementation: ensuring the right order of the Gets after the put. To ensure all of the GET transfers would happen after the PUT transfers, I used mfc_getb as the first DMA inside DMAGetParticles. I’m not 100% certain if that is the best way to approach that, but it makes sense and seems to work so far (*explosion in the distance*).

The last thing that I put my attention on was working through the elements in each batch that is transferred in. This was pretty straightforward to implement; the data was arranged such that I had an array of 32 floats for each component (x,y,z,speed,etc). Using vector SIMD operations, the data could be acted on in just 8 iterations.

Here is the main chunk of code used for running the particles:

 1:  // load ptrs
 2:  vec_float4* restrict p_xPos = (vec_float4*) particles->xPos;
 3:  vec_float4* restrict p_yPos = (vec_float4*) particles->yPos;
 4:  vec_float4* restrict p_zPos = (vec_float4*) particles->zPos;
 5:  vec_float4* restrict p_xSpeed = (vec_float4*) particles->xSpeed;
 6:  vec_float4* restrict p_ySpeed = (vec_float4*) particles->ySpeed;
 7:  vec_float4* restrict p_zSpeed = (vec_float4*) particles->zSpeed;
 8:  vec_float4* restrict p_xGrav = (vec_float4*) particles->xGrav;
 9:  vec_float4* restrict p_yGrav = (vec_float4*) particles->yGrav;
10:  vec_float4* restrict p_zGrav = (vec_float4*) particles->zGrav;
11:  vec_float4* restrict p_life = (vec_float4*) particles->life;
12:  vec_float4* restrict p_fade = (vec_float4*) particles->fade;
14:  for(int iter=0; iter<SPU_BATCH_ITERATIONS; ++iter) {
15:    // load data
16:    const vec_float4 xPos = p_xPos[ iter ];
17:    const vec_float4 yPos = p_yPos[ iter ];
18:    const vec_float4 zPos = p_zPos[ iter ];
19:    const vec_float4 xSpeed = p_xSpeed[ iter ];
20:    const vec_float4 ySpeed = p_ySpeed[ iter ];
21:    const vec_float4 zSpeed = p_zSpeed[ iter ];
22:    const vec_float4 xGrav = p_xGrav[ iter ];
23:    const vec_float4 yGrav = p_yGrav[ iter ];
24:    const vec_float4 zGrav = p_zGrav[ iter ];
25:    const vec_float4 life = p_life[ iter ];
26:    const vec_float4 fade = p_fade[ iter ];
28:    // operate on data
29:    const vec_float4 n_xPos = spu_madd(xSpeed, slowdown, xPos);
30:    const vec_float4 n_yPos = spu_madd(ySpeed, slowdown, yPos);
31:    const vec_float4 n_zPos = spu_madd(zSpeed, slowdown, zPos);
32:    const vec_float4 n_xSpeed = spu_add(xGrav, xSpeed);
33:    const vec_float4 n_ySpeed = spu_add(yGrav, ySpeed);
34:    const vec_float4 n_zSpeed = spu_add(zGrav, zSpeed);
35:    const vec_float4 n_life = spu_sub(life, fade);
37:    // store data
38:    p_xPos[ iter ] = n_xPos;
39:    p_yPos[ iter ] = n_yPos;
40:    p_zPos[ iter ] = n_zPos;
41:    p_xSpeed[ iter ] = n_xSpeed;
42:    p_ySpeed[ iter ] = n_ySpeed;
43:    p_zSpeed[ iter ] = n_zSpeed;
44:    p_life[ iter ] = n_life;
45:  }

After working with the code for a while, I’m sticking to an assumption that the OpenGL rasterizing is being done on the PPU itself; the rendering crawls, even for ‘small’ numbers of triangles. Because it was so slow, I ended up hard coding the number of drawn particles to 512 but kept the SPUs working on the total 12288 particles. Perhaps not the best way to display all of the particles, but currently the best option I have.

I’ve uploaded the complete code to this location:

One downside to doing this development work on my personal PS3 is that I can’t connect to PSN, so I’ll be missing out on some good releases for a while. Also, the lack of profile for trophies is rough.

Old Personal Project ‘SeeD’

Posted in Programming on May 28th, 2010 by Chris – Be the first to comment

I’ve been looking through all of my old personal code so I can start filling up my “Projects” page, which has been an interesting trip down memory lane.

One thing in particular is sticking out: my massive “DO EVERYTHING” codebase, dubbed “SeeD”. This code grew from my experiences with Quake3Arena modding plus a lot of NeHe/Flipcode reading. In fact, this was more than likely the codebase that I used to teach myself ‘advanced’ graphics concepts; specifically: loading mesh files (md3), loading image files (tga, jpg), and tree-based spatial hierarchies (octree, kd-tree, quadtree).

Now, when I say “loading files”, I mean I looked up the file format specifications and I wrote the loaders from scratch. At the time, I was still pretty new to most everything, so this was a pretty big deal for me. At the same time, I wasn’t about to work through loading a JPG file from sratch, so I did cop-out on that one.

As for the tree-based spatial hierarchies, I worked out the concepts out from articles I read online (particularly from Flipcode). Thanks to those same sources, I worked these out in the best way any overachieving CS student would: with massive amounts of inheritance, templates and STL lists. While it runs well for what little it does, I can’t help but want to change it to use some ‘better’ approaches. I’ve been looking for a reason to try a ‘flat’ tree 😉

I’ve uploaded the source I’m playing with here, sans a couple libraries/assets:

If you really want to play with it, I’ll see about sending you the project in full.