Search This Blog

Friday, April 30, 2010

& Domain-Specific Computing

So, I went to the public lecture at UCI today (the one I posted an event for on my Facebook page). These public lectures are by various guest speakers on a wide range of topics; each one is usually something of an overview of a particular field, often ones which the average computer science student has minimal or no exposure to in their classes. As with last time, there were free brownies and coffee/tea.

Note that this post is all from memory. I didn't take notes or anything, so there are likely to be omissions and vagueness compared to the actual lecture.

The topic this time was domain-specific computing. Essentially this refers to the use of some manner of special-purpose hardware to offload work from a general-purpose CPU, either eliminating the CPU entirely or allowing the use of a smaller, cheaper, and more efficient CPU. The ultimate purpose of this is to reduce overall cost and power consumption while increasing overall performance for a given task.

If this doesn't sound revolutionary, it's because it's not. Modular electronics have been around for as long as electronics in total. If you want a real-world example of something built this way, you can look at just about anything at all. Computers consist of many components (e.g. motherboard, video card), each composed of many further components (GPU, southbridge, northbridge, etc.), usually microchips.

The problem then is that it's hard to manufacture these chips. It's expensive, requires skill in engineering, and is largely done manually. Skilled electrical engineers must manually design a circuit, then a huge fabrication plant uses big and very expensive machines to manufacture the circuit, which is then available for use in various devices. Because of the expense and manufacturing requirements, most companies are not able to produce custom chips in this way.

As a result, very common specialized circuits are produced and sold off-the-shelf in large quantities, while less general circuits instead use code running on a general-purpose CPU. After chip design and fabrication, another electric engineer must then analyze everything a particular device must do, determine where it would be beneficial to use specialized off-the-shelf components (which may either be used directly or require some manner of mapping to make the data fit the exact component), then determine which CPU to use for everything else, and have the programmers write the code for it.

Done correctly, the resulting device is faster, cheaper to manufacture, and more efficient than simply using a fast CPU to do everything. One of the papers he cited examined different algorithms implemented on a general-purpose CPU, a general-purpose GPU, and a custom FPGA (chips that can be programmed without massive fabrication plants, but are less efficient than standard fabricated chips). For one algorithm (seemingly a highly parallel one), the GPU performed 93x as well as the CPU, and the FPGA 11x. While the GPU was 8x as fast as the FPGA in absolute terms (though presumably you could increase the size of the FPGA 8 times, to match the GPU's total performance), when you compare performance:power consumption, the GPU was 60x that of the CPU, and the FPGA was about 220x: about 3.5x the performance of the GPU (and I'd wager the FPGA cost less than either of them). In other algorithms, the difference was tens of thousands of times.

But it's still a whole lot of work. Consequently, this tends to be done for mass-produced electronic devices, while more specialized things are simply executed on a common computer, as it's cheaper to code a program than engineer an electrical device.

This guy (the one who gave the lecture) is working to develop analysis and automation tools to reduce the cost and difficulty of custom hardware development, to significantly increase the range of things that are viable to produce this way. Ideally, you'd be able to write a specification or algorithm (e.g. a C++ program that describes an entire device's function), then automation would analyze the system, performing these steps (essentially the same ones I summarized above):
  1. Break it down into major functional units
  2. Determine which units can use off-the-shelf components and what the optimal component is
  3. Locate algorithms that would substantially benefit from the production of new components and then design those components for manufacturing, either using FPGAs, circuit boards, or as actual fabricated chips
  4. Choose the optimal CPU and algorithm for the tasks remaining (that aren't worth implementing in hardware) and write the code. Remember that different tasks run better or worse on different CPU architectures, so this step must find the best CPU for the given algorithm or rewrite the algorithm in a way that runs better on a more preferable (e.g. cheaper) CPU. For example, CPUs that lack branch prediction are cheaper and more energy-efficient, but perform poorly on algorithms that use a great deal of branching; however, depending on the task to be performed, an alternate algorithm might be able to perform the task with less branching.
  5. Design an optimal bus architecture for all the different electronic to communicate
Of course, it's very likely that this ideal is fundamentally impossible - at least until artificial intelligence is able to perform as well as a skilled electrical engineer. However, he hopes to create tools that drastically reduce the amount of human time and cost needed for this process, and the talk was about recent research toward that goal. A few of the automation problems under research by different groups:
  • Identification of identical algorithm matches to existing circuits as well as partial matches that can work with some amount of data conversion
  • Profiling of different matching circuits to determine which performs optimally on a given problem
  • Determination of which algorithms would perform substantially better if implemented in custom hardware
  • Conversion of procedures (e.g. C++ programs) to electronic schematics for manufacturing
  • Profiling of different CPU architectures (e.g. x86 and ARM), models (e.g. i5 and i7), and variants (e.g. cache size, with or without optional subunits, etc.), to determine which is optimal for a given device, based on performance and cost
  • Profiling of different bus architectures (e.g. mesh and crossbar) and configurations (e.g. number and size of channels) to determine what is optimal for a device
  • Modularization for energy efficiency, to allow entire circuits to be shut off when they aren't in use (but others are)
I don't remember too much more detail than what I've already said. One last point is that this guy seemed to be a big fan of the multiband radio-frequency interconnect (RF-I) bus for the purpose of inter-component communication. This works essentially the same as the radio or broadcast TV, where bandwidth is split into separate channels, each with a different range of modulation frequencies. As this allows bandwidth to be arbitrarily allocated between many bus channels (and may even be configured in real time), this is very amenable to communication and power optimization.

Actually, it looks like the slides from the lecture are available online for you to look at, beyond what I've already written about it.

No comments: