# How to get 100% CPU usage from a C program

## Problem

This is quite an interesting question so let me set the scene. I work at The National Museum of Computing, and we have just managed to get a Cray Y-MP EL super computer from 1992 running, and we really want to see how fast it can go!

We concluded that the simplest approach to do this would be to develop a simple C program that would calculate prime numbers and display the time it took, then run it on a fast current desktop PC and compare the results.

To count prime numbers, we immediately devised the following code:

``````#include <stdio.h>
#include <time.h>

void main() {
clock_t start, end;
double runTime;
start = clock();
int i, num = 1, primes = 0;

while (num <= 1000) {
i = 2;
while (i <= num) {
if(num % i == 0)
break;
i++;
}
if (i == num)
primes++;

system("clear");
printf("%d prime numbers calculated\n",primes);
num++;
}

end = clock();
runTime = (end - start) / (double) CLOCKS_PER_SEC;
printf("This machine calculated all %d prime numbers under 1000 in %g seconds\n", primes, runTime);
}
``````

Which, on our dual-core laptop running Ubuntu (the Cray runs UNICOS), performed well, consuming 100% of the CPU and taking around 10 minutes. I decided to check it out on my hex-core modern gaming PC when I got home, and this is where we run into our first problems.

I originally converted the code to operate on Windows because that was the operating system on the gaming PC, but I was disappointed to see that the process was only receiving approximately 15% of the CPU’s power. That had to be Windows being Windows, so I booted into an Ubuntu Live CD, hoping that Ubuntu would allow the process to operate to its full extent, as it did on my laptop earlier.

However, I only received 5% usage! So, how can I get the software to operate at 100% CPU utilization on my gaming system, whether it’s running Windows 7 or live Linux? Another plus, although not required, is if the final product can be packaged as a single.exe file that can be readily distributed and run on Windows machines.

Thanks a lot!

P.S. Of course, this program didn’t work with Crays 8 specialist processors, but that’s a different story… Give us a shout if you know anything about optimizing code for Cray supercomputers from the 1990s!

## Solution #1

If you want 100 percent CPU performance, you’ll need more than one core. Multiple threads are required to accomplish this.

Here’s a parallel version using OpenMP:

To make it take more than 1 second on my machine, I have to set the limit to 1000000.

``````#include <stdio.h>
#include <time.h>
#include <omp.h>

int main() {
double start, end;
double runTime;
start = omp_get_wtime();
int num = 1,primes = 0;

int limit = 1000000;

#pragma omp parallel for schedule(dynamic) reduction(+ : primes)
for (num = 1; num <= limit; num++) {
int i = 2;
while(i <= num) {
if(num % i == 0)
break;
i++;
}
if(i == num)
primes++;
//      printf("%d prime numbers calculated\n",primes);
}

end = omp_get_wtime();
runTime = end - start;
printf("This machine calculated all %d prime numbers under %d in %g seconds\n",primes,limit,runTime);

return 0;
}
``````

Output:

## Solution #2

On a multi-core machine, you’re running a single process, thus it only uses one core.

Because you’re only trying to peg the CPU, the solution is simple: if you have N cores, execute your program N times (in parallel, of course).

Here’s some code that parallelizes your program NUM OF CORES times. Because it’s POSIXy code (it uses fork), you should be able to execute it on Linux. If my understanding of the Cray is true, this code should be easier to port than the OpenMP code in the other solution.

``````#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#define NUM_OF_CORES 8
#define MAX_PRIME 100000

void do_primes()
{
unsigned long i, num, primes = 0;
for (num = 1; num <= MAX_PRIME; ++num) {
for (i = 2; (i <= num) && (num % i != 0); ++i);
if (i == num)
++primes;
}
printf("Calculated %d primes.\n", primes);
}

int main(int argc, char ** argv)
{
time_t start, end;
time_t run_time;
unsigned long i;
pid_t pids[NUM_OF_CORES];

/* start of test */
start = time(NULL);
for (i = 0; i < NUM_OF_CORES; ++i) {
if (!(pids[i] = fork())) {
do_primes();
exit(0);
}
if (pids[i] < 0) {
perror("Fork");
exit(1);
}
}
for (i = 0; i < NUM_OF_CORES; ++i) {
waitpid(pids[i], NULL, 0);
}
end = time(NULL);
run_time = (end - start);
printf("This machine calculated all prime numbers under %d %d times "
"in %d seconds\n", MAX_PRIME, NUM_OF_CORES, run_time);
return 0;
}
``````
``````\$ ./primes
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
Calculated 9592 primes.
This machine calculated all prime numbers under 100000 8 times in 8 seconds
``````

## Solution #3

Your prime number generation algorithm is inefficient. On a Pentium II-350, primegen creates the 50847534 primes up to 1000000000 in less than 8 seconds.

You could tackle an embarrassingly parallel problem, such as computing the Mandelbrot set or using genetic programming to paint the Mona Lisa in numerous threads, to easily consume all CPUs (processes).

Another option is to transfer a benchmark application designed for the Cray supercomputer to a current PC.

## Solution #4

Your code consumes one core at 100%, which explains why you’re getting 15% on a hex core processor. 100/6 = 16.67%, which could easily be reported as 15% using a moving average with process scheduling (your process would be running at regular priority).

As a result, to use 100% CPU, you’d have to employ all of your CPU’s cores – launch 6 simultaneous execution code lines for a hex core CPU, and have this scale up to whichever many processors your Cray machine has:)

## Solution #5

Also, pay attention to how you’re using the CPU. A CPU can perform a variety of tasks, and while many of them will be described as “loading the CPU 100%,” they may each consume 100% of the CPU in different ways. To put it another way, comparing the performance of two distinct CPUs, especially two different CPU architectures, is extremely difficult. When performing activity A, one CPU may be favored over another, whereas when performing task B, the opposite may be true (since the two CPUs may have different resources internally and may execute code very differently).

This is why software is equally as crucial as hardware in making computers work at their best. This is also true in the case of “supercomputers.”

Instructions per second could be one metric for CPU performance, however not all instructions are created equal on different CPU architectures. Another metric to consider is cache IO performance, although not all cache infrastructure is created equal. The number of instructions per watt used might then be used as a metric, as power delivery and dissipation is often a limiting factor when developing a cluster computer.

As a result, the first question you should ask yourself is: Which performance parameter is most essential to you? What exactly are you looking to quantify? If you want to see which machine gets the most FPS out of Quake 4, the answer is easy; your gaming rig will, as the Cray can’t run that program at all 😉

Cheers, Steen