# How to get a C program to use 100% of the CPU

## Problem

Let me set the atmosphere for you since this is a fascinating question. I work at The National Museum of Computing, and we recently got a Cray Y-MP EL supercomputer from 1992 up and running, and we’re excited to see how fast it can go!

We concluded that the simplest approach to do this would be to develop a simple C program that would calculate prime numbers and display the time it took, then run it on a fast current desktop PC and compare the results.

To count prime numbers, we immediately devised the following code:

``````#include <stdio.h>
#include <time.h>

void main() {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;

while (num <= 1000) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num)
primes++;

system("clear");
printf("%d prime numbers calculated\n",primes);
num++;
}

end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}
``````

Which on our dual core laptop running Ubuntu (The Cray runs UNICOS), worked perfectly, getting 100% CPU usage and taking about 10 minutes or so. When I got home I decided to try it on my hex-core modern gaming PC, and this is where we get our first issues.

I originally converted the code to operate on Windows because that was the operating system on the gaming PC, but I was disappointed to see that the process was only receiving approximately 15% of the CPU’s power. That had to be Windows being Windows, so I booted into an Ubuntu Live CD, hoping that Ubuntu would allow the process to operate to its full extent, as it did on my laptop earlier.

However, I only received 5% usage! So, how can I get the software to operate at 100% CPU utilization on my gaming system, whether it’s running Windows 7 or live Linux? Another plus, although not required, is if the final product can be packaged as a single.exe file that can be readily distributed and run on Windows machines.

Thanks a lot!

P.S. Of course, this software didn’t function with Crays 8 specialized processors, but that’s an another story… Give us a shout if you know anything about optimizing code for Cray supercomputers from the 1990s!

## Solution #1

If you want 100 percent CPU performance, you’ll need more than one core. Multiple threads are required to accomplish this.

Here’s an OpenMP-based parallel version:

On my machine, it takes more than a second.

``````#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() {
double start, end;
double runTime;
start = omp_get_wtime();
int num = 1,primes = 0;

int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
for (num = 1; num <= limit; num++) {
int i = 2;
while(i <= num) {
if(num % i == 0)
break;
i++;
}
if(i == num)
primes++;
//      printf("%d prime numbers calculated\n",primes);
}

end = omp_get_wtime();
runTime = end - start;
printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

return 0;
}
``````

Output:

## Solution #2

On a multi-core machine, you’re running a single process, thus it only uses one core.

Because you’re only trying to peg the CPU, the solution is simple: if you have N cores, execute your program N times (in parallel, of course).

Here’s some code that parallelizes your program NUM OF CORES times. Because it’s POSIXy code (it uses fork), you should be able to execute it on Linux. If my understanding of the Cray is true, this code should be easier to port than the OpenMP code in the other solution.

``````#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define NUM_OF_CORES 8
#define MAX_PRIME 100000

void do_primes()
{
unsigned long i, num, primes = 0;
for (num = 1; num <= MAX_PRIME; ++num) {
for (i = 2; (i <= num) && (num % i != 0); ++i);
if (i == num)
++primes;
}
printf("Calculated %d primes.\n", primes);
}

int main(int argc, char ** argv)
{
time_t start, end;
time_t run_time;
unsigned long i;
pid_t pids[NUM_OF_CORES];

/* start of test */
start = time(NULL);
for (i = 0; i < NUM_OF_CORES; ++i) {
if (!(pids[i] = fork())) {
do_primes();
exit(0);
}
if (pids[i] < 0) {
perror("Fork");
exit(1);
}
}
for (i = 0; i < NUM_OF_CORES; ++i) {
waitpid(pids[i], NULL, 0);
}
end = time(NULL);
run_time = (end - start);
printf("This machine calculated all prime numbers under %d %d times "
"in %d seconds\n", MAX_PRIME, NUM_OF_CORES, run_time);
return 0;
}
``````
``````\$ ./primes
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
This machine calculated all prime numbers under 100000 8 times in 8 seconds
``````

## Solution #3

Your prime number generation algorithm is inefficient. On a Pentium II-350, primegen creates the 50847534 primes up to 1000000000 in less than 8 seconds.

You could tackle an embarrassingly parallel problem, such as computing the Mandelbrot set or using genetic programming to paint the Mona Lisa in numerous threads, to easily consume all CPUs (processes).

Another option is to transfer a benchmark application designed for the Cray supercomputer to a current PC.

## Solution #4

Your code consumes one core at 100%, which explains why you’re getting 15% on a hex core processor. 100/6 = 16.67%, which using a moving average with process scheduling (your process would be running under normal priority) could easily be reported as 15%.

As a result, to use 100% CPU, you’d have to employ all of your CPU’s cores – launch 6 simultaneous execution code lines for a hex core CPU, and have this scale up to whichever many processors your Cray machine has:)

## Solution #5

Also, pay attention to how you’re using the CPU. A CPU can perform a variety of tasks, and while many of them will be described as “loading the CPU 100%,” they may each consume 100% of the CPU in different ways. To put it another way, comparing the performance of two distinct CPUs, especially two different CPU architectures, is extremely difficult. When performing activity A, one CPU may be favored over another, whereas when performing task B, the opposite may be true (since the two CPUs may have different resources internally and may execute code very differently).

This is why software is equally as crucial as hardware in making computers work at their best. This is also true in the case of “supercomputers.”

Instructions per second could be one metric for CPU performance, however not all instructions are created equal on different CPU architectures. Another metric to consider is cache IO performance, although not all cache infrastructure is created equal. The number of instructions per watt used might then be used as a metric, as power delivery and dissipation is often a limiting factor when developing a cluster computer.

As a result, the first question you should ask yourself is: Which performance parameter is most essential to you? What exactly are you looking to quantify? If you want to check which system gets the most frames per second out of Quake 4, the answer is simple: your gaming rig, because the Cray can’t run that software at all.

Cheers, Steen