Malcolm Mumme's Resume for patent #5,379,444

#5,379,444 "simplified synchronous mesh processor and array" Google Patents

This processor design is nearly the most extreme example of a minimal SIMD processor design. Here, I'll just describe the preferred variant of the design. There are four main parts to the design:
1 The control processor: a "standard" SISD processor sends commands to all other parts.
2 The main processor cell array: a 2D (rectangular) array of very simple processor cells. Each processor cell receives a single bit input from each neighbor cell on the x axis and on the y axis, for a total of 4 single bit inputs. All cells in the array also receive the same opcode from the control processor. Each processor also receives an x-enable signal and a y-enable signal from parts 3 and 4 respectively. The processor cells internals are described later.
3 An x-enable generator, generates a vector of signals "along the x axis", to be distributed to the processor cells.
4 A y-enable generator, similar to the x-enable generator, but on the other axis.
Part 2, the main processor cell array, is where most of the data processing takes place. Consider each processor cell to be associated with a pair of natural numbers n,m. The single bit inputs to this processor cell are from its four nearest neighbors: n-1,m and n+1,m and n,m-1 and finally n,m+1. At the boundaries of the array, some inputs may be absent.
Each processor cell in the array contains (exactly) one single bit of storage. This is the most unique feature of this processor design, and produces most of the simplicity. The content of the single bit of storage is output from each cell and is provided as the input to its four nearest neighbors. The array is synchronous, and, on each clock cycle, a new value for the bit may be computed by the cell from the neighbor's inputs and from the "old" value of the bit. This computation depends on the opcode received from the control processor, and is conditioned on the x-enable and y-enable signals both being ones.
In parts 3 and 4, each signal in the vector of generated signals is associated with a natural number on the relevant axis. For example, each signal supplied by the x-enable generator is associated with a natural number n, while each signal from the y-enable generator is associated with a natural number m. In this case, x-enable n and y-enable m are the enable signals provided to processor cell n,m. This is necessary in such a simple design, to provide a rectangular array of enable signals to the processor cells. The internal design of the enable signal generators is not specified in the patent, but is assumed to be sufficiently versatile to provide any needed patterns.
Given that each cell has only 4 data inputs and the "current" value of its bit with which to compute results, the set of possible instruction sets for the cell is relatively small compared with most other processors. The most powerful possible instruction set would be as follows:

A 32-bit opcode provided by the main processor would designate a 5-input binary function. This can be implemented easily as a 32-1 mux, with the 5 data inputs serving as the selector control inputs of the mux, and the opcode connected to the 32 data input positions of the mux. Although simple and powerful, this design has two main problems:

1: size: The relatively large size of a 32-1 mux compared to the size of the flip-flop that implements the stored bit means that the processor cell will be mostly occupied by the one large mux. The data storage density of the array will be extremely small.
2: size: A 32-bit wide instruction path must be present throughout the array. This means that the array chip would be occupied mostly by instruction wiring, leading to a surprisingly low processor density, further lowering the data storage density. The cells' logic would be farther from neighbors, increasing the length of data paths and decreasing speed.
For these reasons, the following design is preferred:

A 6-bit opcode provided by the main processor would be divided into two fields,

The 2-bit selector field will be used to select one of the four inputs from nearest neighbor cells. This is easily implemented with a 4-1 mux.
The 4-bit function field will designate a binary function of 2 inputs, one of which will be the "current" bit, and the second of which will be the selected bit from one of the four nearest neighbors. This also is easily implemented with a 4-1 mux.
Although not as powerful, the preferred design seems to work well for providing a reasonable processor density, while still being sufficiently powerful. Convenient algorithms exist to implement multi-bit arithmetic on groups of cells at reasonable speed.

The primary advantage of this type of array is that the simplicity of the cells allows one to have a great many of them in a small space. The nearest neighbor data connection scheme assures that all data lines/wires will be short (too bad about the opcode/enable lines). The multitude of cells almost completely makes up for the minimal data/cell ratio. The short wires and simple logic allows the clock to be very fast, compensating for the fact that even "primitive" arithmetic operations must be programmed explicitly.
At the time of the invention, a major disadvantage was the necessity of using a great many cells on a single chip to obtain the necessary advantages. Custom logic at that time would only permit an array of about 1K-2K cells on a chip. This was enough to do Winnograd-style 31-point FFT's at better than average rates, but could not handle larger amounts of data. Fortunately, with Moore's law, and the passage of time, a single custom chip should now be able to accommodate about 32K-64K processor cells per chip. This design would be very hard to compete against. Now, the primary problem is that the long control lines, which must be driven with very high power buffers, given the high clock rate desired, would require a very large amount of power and cooling.