Hybrid Computing Dramatically Accelerates Software Performance

By Michael Calise -

The good old days, when you could sit back and relax, waiting for a new generation of CPUs to effectively double the performance of your existing software are gone. Lately, CPU clock frequencies have reached a practical limit at around 3GHz, and your software needs a major rewrite to take advantage of the capabilities of today’s many-core CPUs. The good news is you can turn this effort to your advantage by making use of software accelerators. These can provide an order of magnitude faster execution for certain computations, and be combined with traditional CPUs to form Hybrid Computing systems that deliver 10-100 times the performance of a non-accelerated solution.

Hybrid Computing

For decades, we have enjoyed constant performance improvements to our existing software as the clock speeds of CPUs in practice has been doubled every 18 months. However, higher clock frequency also means higher power consumption, and processor heat dissipation has put an end to free performance improvements with each new generation of semiconductor manufacturing processes.

# Figure 1. CPU Clock Frequency 1993-2006. Source: Tom’s Hardware Guide, “The Mother of all CPU Charts”

CPU manufacturers now use the improved processes to fit more and more CPU cores onto each device, producing generations of many-core processors, each running at about the same clock frequency as their predecessors. This means that if you are looking for faster execution, you must actively seek alternative ways to speed up your software.

Accelerators and Hybrid Computing Systems

Purely sequential software will not see any further performance improvements, and will have to be rewritten to exploit any available parallelism and distribute the computation among several processing resources – for example running on multiple CPU cores. However, the fact that the effort of making software run in parallel must be made opens up the playing field for new types of processing resources to complement the traditional CPU architecture.

Today, cost-efficient accelerators are available from several vendors as commercial off-the-shelf (COTS) products. Accelerators are specialised processors that speed up specific processing tasks. By combining accelerators with traditional CPUs, you create a hybrid computing system, where each processing resource executes the parts of the software for which it delivers the highest performance resulting in greatly increased system throughput and reduced execution time.

The main contenders for the COTS accelerator market are Field Programmable Gate Arrays (FPGAs) and Graphics Processors (GPUs). Unlike earlier accelerator technologies, such as vector processors or custom ASICs, these devices both have strong mass markets outside the high performance computing fields:

Field Programmable Gate Arrays are integral parts of many advanced electronic devices and continuously evolved to cater to the needs of the developers of electronic products.
Graphics Processors are used in most personal computers with their development driven by the gaming market. When used for accelerated computing, these devices are referred to as general purpose GPUs (GPGPUs).

These two types of alternative devices allow cost-efficient high performance computing accelerators to be built, and with the added benefit of on-going, well-funded core technology development.

Processing Sweet Spots in Hybrid Computing

The performance gains available from hybrid computing are based on leveraging the strengths of many-core CPUs, FPGAs and GPGPUs used in tight integration. Compared to the two others:

Many-core CPUs are an order of magnitude faster for command and control operations.
FPGAs are an order of magnitude faster for non-floating point operations. They provide very good performance and power efficiency in processing integer, character, binary or fixed point data. They can also deliver competitive performance for complex floating-point operations such as exponentials or logarithms.
GPGPUs are an order of magnitude faster on floating point operations.

Benefits of Hybrid Computing

Getting more compute performance from a server has a number of benefits. Most importantly, it reduces the number of servers needed for a specific workload. This means that Hybrid Computing delivers:

Reduced system cost
Greener computing and lower power cost, through lower power consumption and less energy spent on cooling
Smaller system footprint

Alternatively, Hybrid Computing can deliver capacity for larger workloads within given budgets.

FPGA devices typically use less power than CPUs and GPGPUs, and the remainder of this article focuses on hybrid computing using FPGAs. However, for power savings, the most important factor is the attained acceleration, so it is important to target the accelerator that provides the best acceleration for your algorithm. But that’s a whole different article.

FPGA-based Accelerators

Field Programmable Gate Arrays (FPGAs) are versatile configurable electronic components that are used in accelerators to implement tailored computational logic, specific to the application being executed. In the hybrid computer system, the FPGA acts as a configurable co-processor to a CPU, allowing applications take advantage of application-specific hardware. The FPGA component can be reconfigured any number of times for new applications, making it possible to use the hybrid computer system for a wide range of tasks.

FPGA Basics

FPGAs are used to create digital logic circuitry – including tailor made co-processors for a specific algorithm used as part of a hybrid computing system. Even though the P in FPGA stands for Programmable, it is important to realise that this is not in the software sense of programmable. For an FPGA it simply means that a circuit design can be loaded or re-loaded.

The bulk of an FPGA is made up from a large number of identical configurable logic blocks (CLBs). Each CLB consists of a number of slices which in turn consist of a number of (typically two or four) logic cells that can be configured to perform basic logic functions (such as and, or, not) on digital signals using the lookup table, LUT. The CLBs are interconnected through programmable switch matrixes, PSM, to form units that perform more complex functionality.

Figure 2. Organisation of an FPGA

FPGAs also provide internal RAM memory banks and specialised multiplier logic blocks, Multiply-accumulate circuits, MACs, for efficient multiplication and addition. FPGAs may have other blocks with specialised functions for purposes such as digital signal processing.

Configuring the FPGA

Before the FPGA can perform any functions at all, the configurable elements must be set up. This is done by “programming” the FPGA through uploading a binary configuration file known as a bitstream or bit file. This file holds the configuration settings for each CLB, PSM, MAC, I/O and other configurable element of the FPGA.

Synthesis, Place and Route

To create a bitstream that correctly configures the FPGA to perform a particular function, a circuit design for the desired function is needed. This design is created in a hardware description language, HDL, such as VHDL or Verilog, or, at a slightly higher level of abstraction, using any of numerous Electronic System Level, ESL design tools. Alternatively, the software offered by Mitrionics, is capable of producing the circuit design in VHDL from programs written in the high-level Mitrion-C programming language.

The HDL circuit design is run through a process known as synthesis, where the HDL is translated into interconnected basic logic functions. The output from the synthesis is fed to place-and-route software. The place-and-route process maps the logic functions to the configurable logic on the FPGA and calculates how the logic should be interconnected to meet constraints such as timing of electric signals between logic cells.

FPGAs Compared to Microprocessors

FPGAs represent a computer architecture ideally suited to exploiting parallel execution. They are able to run some algorithms 10-100 times faster than a microprocessor. It is, however, important to note that:

An FPGAs is not a better microprocessor, nor a microprocessor substitute.
FPGAs will not accelerate existing applications without significant porting efforts, and not all algorithms are suitable for FPGA acceleration.

Compared to microprocessors, FPGAs have a couple of notable performance disadvantages:

The maximum clock frequency for FPGAs is a few hundred MHz, while microprocessors run at a few GHz.
The FPGA’s configurability comes at the cost of a large overhead, leaving 10-100 times less logic available to the user compared to a microprocessor of similar size.

The reasons FPGAs are still able to outperform microprocessor for some algorithms are:

The FPGA is used to implement a circuit specialised for a specific task.
All the logic on the FPGA can be used to perform that task.
FPGAs deliver vast amounts of parallelism due to having highly numerous processor functions
FPGAs offer huge memory bandwidth through configurable logic, block RAM and local memories. See figure below.

FPGA	Microprocessor
Configurable logic	On Chip Registers
TB/sec	100’s GB/sec
10’s KB	1-10’s KB

Internal RAM	L1-L3 Cache
100’s GB/sec	100’s GB/sec
100’s KB	1-10’s KB

Local Memories
~10GB/sec
10-1000’s MB
System Memory	System Memory
100-1000’s MB/sec	100-1000’s MB/sec
Terabytes	Terabytes

Figure 3. Memory hierarchies of FPGA and microprocessor systems.

Power Consumption

The power consumption of FPGAs has distinct advantages over microprocessors. A large FPGA running at full speed typically uses less than 25 W, compared to 100 W or more for a modern microprocessor. Even at moderate acceleration, a FPGA system will use dramatically less electrical power than an all microprocessor system with the same performance.

FPGA-based Hybrid Computer Systems

FPGA-based hybrid computing systems are typically built using FPGA modules. These FPGA modules are connected to a hosting computer system through a variety of system bus architectures, depending on the system and module. In addition to control and interface logic, the modules hold one or more FPGAs for user algorithms and usually have local memory attached directly to the FPGAs. Some modules allow the FPGA to read and/or write directly from or to the host system’s memory.

Applications control the FPGA modules through an Application Programming Interface, API, to perform tasks such as:

Upload bitstreams to the FPGA.
Start and stop algorithm execution on the FPGA.
Transfer data to and from the FPGA and/or locally attached memories.

Developing Accelerated Applications

Development of FPGA-accelerated software adds two steps to the software development cycle:

Deciding on the partitioning of the application between the host CPU and the FPGA.
The design and implementation of the tailored co-processor circuit that will run the accelerated algorithm on the FPGA.

However, using the Mitrion’s software acceleration platform addresses the later through a virtual processor which is an adaptable parallel processor capable of efficiently running software on the FPGA.

With such a platform, the steps of deploying FPGA-acceleration to an application are as follows:

Identify the computationally intense routines of the application and decide on a partitioning of tasks between CPU and FPGA.
Rewrite the routines that are to run on the FPGA in the high-level Mitrion-C programming language.
Replace the routines in the original program with calls to the FPGA.
Use the Mitrion’s software development kit to debug the Mitrion-C routines and their interaction with the host program.
Use the combined software development kit and FPGA vendor’s tools to generate configuration files for the FPGA.
Install and run on target system.

Figure 5 Running accelerated applications with the Mitrion Software Acceleration Platform

Algorithm Development for FPGAs

Hardware Design

As mentioned above, accelerating algorithms on FPGAs requires that a circuit is designed to perform the algorithm on the FPGA. Since FPGAs originated as highly versatile components for implementing electronic hardware, numerous hardware design tools are available to hardware designers specialised in creating circuits for FPGAs. They can choose to develop their design in VHDL, Verilog or using an ESL design tool.

Regardless of the tools chosen, hardware design requires a solid knowledge of electrical engineering and the understanding of subjects of FPGA circuit design such as gates, wires, CLBs, multipliers, signal timing and clocking. From the perspective of software development, there is a large difference between the software source code (an instruction stream for a processor), and the gates and wires of the FPGA, making it a challenge to map software algorithms to the FPGA. However, Mitrion’s software acceleration platform allows software developers to benefit from FPGA-based software acceleration, without having to deal with the complexities of electronic hardware design. What makes the platform unique is it introduces a processor as an abstraction layer, a virtual processor, between the FPGA hardware and the software that is to be run. While typical FPGA programming solutions attempt to produce circuit designs directly from a high level language, the virtual processor executes the software on the FPGA. The benefit is that the processor gives a complete separation of software from hardware, and the software developer is isolated from all aspects of typically challenging FPGA hardware design. With the virtual processor, all the circuit design is taken care of.

Software Development Cycle and Portability

A major advantage of this platform is the development of accelerated applications follows a software life cycle. When porting to other FPGA-based hybrid computer systems, the programmer will only need to change the details pertaining to the specifics of the machine organization. This lets users immediately exploit FPGA evolution with regard to sizes, speeds, and on-chip memory.

Figure 6. Software development cycle

Putting Accelerators to Work

In hybrid computing systems, FPGA accelerators provides significantly improved performance for applications that process integer, character or bit data. They can also provide exceptional performance for more complex floating point operations such as exponentials and logarithms.

Genome Informatics

Today, some genome computing tasks take days, weeks or even months. In the future, the amount of data to process will increase while at the same time, these task must be completed in seconds or minutes in order to be used in medical decisions.

The challenge is to shorten genome computing processes by several orders of magnitude, while at the same time keeping system size and power consumption.

Genome data is typically encoded using two bits per nucleotide base pair. The virtual processor supports flexible data types that allow highly efficient processing of the nucleotide data. The platform is used to accelerate applications and algorithms such as:

NCBI BLAST-N – NCBI BLAST is the most used sequence alignment application available. Within the framework of Mitrion-C Open Bio Project, acceleration of the BLAST-N nucleotide search algorithm has achieved up to 60 times the performance of an un-accelerated solution.
Phylogenetic tree – Evolutionary history is a central problem in genomic research. The Markov Chain Monte Carlo Simulations for phylogenetic tree research have been accelerated by a factor of 15.

Internet Data Processing

Internet search engines, as well as new services such as video-on-demand and Internet telephony, adds to the demand for ultra-scale data centres to power these services. It is a challenge to increase power efficiency of ultra-scale data centres to allow for introduction of new services and scaling to more users. FPGAs have been proven efficient at processing large amounts of text based data. Applications could be search, filtering, character set conversions etc.

Text search – A simple text search algorithm has been implemented on the Mitrion Virtual Processor that is capable of simultaneously searching for any of 4000 strings in stream of data corresponding to a 10Gbit Ethernet link.
Grep – The virtual processor has been used to accelerate the standard UNIX grep function, providing FPGA-accelerated regular expression searches.

Business Process Optimization

Today, many businesses have workflows that require technology beyond standard enterprise solutions. This can involve applications that operate on very large datasets, for example data mining large databases, or that are very computationally intensive, for example performing statistical modelling or process simulations. Some applications, such as market feed processing, need to perform advanced operations on complex data in near real-time. Virtual processor acceleration can be used to expand the availability of high performance computing resources throughout organisations and deliver the performance required for critical workflows within constrained power envelopes and server footprints.

Finance

Within the finance sector, the virtual processing applications enables high-performance fixed and floating point binary coded decimal arithmetic, required by law for some financial applications. The virtual processor also delivers massive performance for exponentials and logarithms required for calculations such as Black- Scholes option pricing.

Black-Scholes – The virtual processor applied to Black-Scholes option pricing, producing 100M results per second on a single FPGA.

Computational Chemistry

Computational Chemistry algorithms that depend on computing bit-patterns are well suited for the virtual processing approach.

Tanimoto coefficients – Determining the similarity between two molecules is facilitated by calculating molecular fingerprints that represent the composition and structure of the molecules. The fingerprints are binary sequences in which each bit reflects some aspect of the molecule. The similarity can be measured using Tanimoto Coefficients, calculated by counting and comparing bits set in the molecular fingerprints. With the virtual processor, databases of all chemicals known in the world can be searched in seconds.

Digital Content Creation and Image Processing

The efficient bit-manipulation capability inherent in virtual processor approach makes for the creation of highly efficient accelerated compression algorithms and codecs.

Discrete cosine transform (DCT) – Discrete Cosine Transform is a core part of many audio, video and image compression algorithms, such as JPEG, MPEG and DV. DCTs have been accelerated by an order of magnitude.
Rice coding – Rice coding can be used for lossless compression of audio and video data. Rice coding has been shown to attain a speedup of 7x.
Convolution – In image processing, convolutional filtering is a common operations, for example in edge detection or gaussian blurring. The Mitrion platform comes with example code that performs efficient image convolution.
Thin plate splines – Thin plate splines is an interpolation method that finds a smooth surface that passes through all given points. It is used for image alignment and shape matching. providing a 10-fold acceleration over non-accelerated solutions.

Random Number Generation

High quality random numbers are required by many algorithms. This technology can be used to efficiently implement a wide range of pseudo-random number generators.

Mersenne twister – The Mersenne twister algorithm has been implemented to provide 800 million uniform or Gaussian random numbers per second while at the same time leaving plenty of room for other algorithms on the FPGA

Evaluating Algorithms for FPGA Acceleration

To determine the potential for acceleration, your candidate algorithm must be carefully analysed. Important factors determining the acceleration potential are:

Does the application have one, (or possibly a few) key routines which dominate computing time?
Computational intensity: is there enough work per word of data to offset the overhead to move the data to and from the FPGA module?
Is the data organised in a way that allows it to be transferred to and from the FPGA module without too much data marshalling overhead?
Do the accelerated parts of the application require floating point arithmetic or transcendental math functions?

In determining the expected performance it is important to realise that only parts of the application are being accelerated. Some share of the execution time will be spent on the non-accelerated parts of the application running on the host CPU. Also, implementing acceleration will introduce additional overhead for data transfers to and from the FPGA, data re-ordering and re-formatting. The illustration below shows the actual acceleration obtained if 90% of the execution time can be cut by a factor 10 using the FPGA, and assuming a 5% overhead copying data to and from the FPGA.

The application speed-ups will follow Amdahl’s law. The diagram below illustrates the attainable total speed-ups by accelerating a certain percentage (out of the execution time) of an application.

Figure 7. Total application speed-up related to FPGA speed-up.

Conclusion

Today, new commercial, off-the-shelf (COTS) accelerators based on GPU and FPGA technologies are available from several vendors. These can be used to build highly efficient systems that significantly reduce power costs and server footprints. For applications that process integer, text and binary data, FPGAs deliver a low power solution, and applying a software acceleration platform, software development for these FPGA accelerators does not require hardware design skills.

(author bio)

Michael Calise is executive vice president and US general manager for Mitrionics. Previously he was president of ClearSpeed, an HPC leader in low-power floating-point accelerators. He has held management positions at Intel, Benchmarq Microelectronics (Texas Instruments), Catalyst Semiconductor, and SOC IP providers, Palmchip and Improv Systems. He received his B.S.E.E. from the University of Buffalo.

For more information, please visit Mitrionics' Web site.

ARCHIVE