Fully Interconnecting Multiple Computers with Pipelined Sorting Nets Hans P. Moravec Artificial Intelligence Lab Computer Science Dept. Stanford University Stanford, Ca. 94305 1978 Abstract A pipelined multiprocessor interconnection method functionally equivalent to a full crossbar, but with a per processor cost proportional to the square of the log of the total number of processors, is presented. Index Terms Multiprocessors, switching nets, sorting nets, MIMD organization, parallel computation. Computers are becoming cheaper, and the demands on their processing power appears open ended. Multiple computers interacting to solve single problems are one consequence of these conditions. The usefulness and generality of such systems depends strongly on the richness of the interconnections between the machines. A crossbar provides full generality, and is economical for systems with up to about 100 computers. The crossbar construction uses a number of components proportional to the square of the number of computers interconnected, and becomes prohibitively expensive for larger systems. In 1968 K.E. Batcher [1] described a method of interconnecting O(n log^2 n) simple sorting elements to produce a network able to sort n numbers. In the paper he also presented an interprocessor message routing system, with the same growth rate, constructed of such nets and a small amount of auxiliary circuitry. In Batcher's design all the processors could simultaneously each emit one message to an arbitrary destination. Each processor could also each receive one message at a time. Multiple simultaneous messages to a single processor were resolved by a priority-number, introduced by the message originator, in each message. The message with the highest priority to a given destination was always successfully delivered in a fixed time, the lower priority ones were always blocked. Simultaneous with the message deliveries, the processors originating the messages received an acknowledgment of successful delivery, or a notice of blockage. The messages from the processors, and a like number of place-holding "dummy" messages were introduced at one edge of the connection network. They filtered through the net to an exchange station, where portions were conditionally swapped, then retraced their paths backwards through the same net, which had to remember the forward routing, to arrive at their starting points. This organization demanded that the elements be bi-directional, and that the net handle only one message wave at a time. The switching elements of necessity sat idle a large fraction of the time. This paper presents an extension of Batcher's organization that preserves all the essential behavior, but does not require bidirectional flow. It uses a little over twice as many primitive sorting elements, but the elements, being unidirectional, are simpler. The net can now handle several message waves at a time, and one wave can immediately follow the previous one, with no delay, as in a pipeline. Overview The interconnection method delivers messages from N originating ports to N destination ports. Simultaneously the N originating ports each send out one message. Each message is addressed to one destination port, and includes a priority-number. These messages pass through the interconnection network and arrive at their destinations at the same time. The network is pipelined, and the originating ports can begin transmitting a subsequent wave of N messages as soon as they finish the first, even though the first wave may not yet have reached its destinations. Each destination port receives the message with the highest priority addressed to it in each message wave. If no one sent it a message in that wave, it receives a dummy (null) message instead. At the same instant a message wave arrives, the processors that had originated each delivered message receive an acknowledgment, the processors whose messages were blocked by higher priority ones receive notices of failure. The notice of failure is a copy of the undelivered message. The success notice is a dummy message. Using 100MHz ECL logic, and Batcher's sorting net construction, over a thousand small to medium size computers could be interconnected economically by this method. The interconnection net would cost about as much as the computers. Messages waves would start every 500 nanoseconds, and each wave would arrive two microseconds after it had been sent. Detail In the interconnector, a sorting net permutes a list of the messages and of place holders for all the possible destinations into an order in which each unblocked message is next to the place holder for its desired destination. A relatively simple set of circuits examines contiguous pairs of entries in this sorted list, and copies the messages into their desired destination place holders. A second sorting network restores this modified list into its original order. At the output of this second net, the place holder slots act as physical message delivery ports. The scheme is diagrammed in figs. 1, 2 and 3. Each processor is assigned an "address", a number from 1 to N. The processor with address i introduces its messages into port i in the lower left of the diagram, and receives messages sent to it through port i in the upper right. Acknowledgments about successful or failed deliveries of messages previously introduced by processor i come out on port i in the lower right. The sorters take their inputs in any order and reorder them so that the smaller numbers come out towards the top of the diagram. The merger takes two sorted lists of length N, and combines them into a single sorted list of length 2N. Note that the sorter/merger combination to the left of the exchanger in fig. 1 would be a complete sorter for 2N numbers with the addition of another sorter for N numbers in the upper left. Messages are serial bit streams, and consist of a destination processor address, a priority-number (invented by the originating computer), a one bit "dummy" flag field (set to 1 for actual messages), the address of the source processor (i.e. a return address), and the data to be communicated. A small number implies high priority. Zero is the highest priority. The net is assumed to run at 100% duty cycle, with the processors emitting successive synchronized waves of messages. Every processor emits a message every message interval (in an actual application a processor may not have a message to send every interval. In that case it can emit a message of lowest possible priority to an arbitrary destination. Such messages have absolutely no effect on higher priority mail). The following discussion examines a single message wave. +-------+ +---+ +----------+ 1 --+| |--+| |--+| |-+ 1 | | | | | | 2 --+| |--+| |--+| |-+ 2 | | | | | | Processor Dummy 3 --+| |--+| |--+| |-+ 3 | | | | | | Incoming Message | | | | | | . | | . | | . | | . Message Injection . | | . | | . | | . . | N x N | . | | . | | . Ports | | | E | | 2N | | M | | x | | | N --+| e |--+| c |--+| Input |-+ N +-+ r | | h | | | | g | | a | | Sorter | ------------ +---------+ +-+ e | | n | | | 1 -+| |--+| r |--+| g |--+| (Source |-+ 1 | | | | | e | | field) | 2 -+| |--+|(Dest. |--+| r |--+| |-+ 2 Processor | N | | field)| | | | | Processor 3 -+| |--+| |--+| |--+| |-+ 3 Outgoing | Input | | | | | | | Acknowledg- | | | | | | | | ment Message . | Sorter | . | | . | | . | | . . | | . | | . | | . | | . Input Ports . | (Desti- | . | | . | | . | | . | nation | | | | | | | Ports | field) | | | | | | | | | | | | | | | N -+| |--+| |--+| |--+| |-+ N +---------+ +-------+ +---+ +----------+ Figure 1: Block diagram of the interconnection scheme. +-----------------+----------+---+-----------------+----------------------+ | Destination Ad | Priority | 1 | Source Address | Data | +-----------------+----------+---+-----------------+----------------------+ Figure 2: Processor message format. High order bit is on the left. Messages are introduced into the net high order bit first. +-----------------+----------+---+-----------------+----------------------+ | Position Number | Zero | 0 | Position Number | Unused | +-----------------+----------+---+-----------------+----------------------+ Figure 3: Dummy message format. The first sorting net orders the messages by destination address, and within a given destination by priority-number. Thus the lower inputs of the merger receive a list of messages, grouped by destination, with the highest priority message to each processor heading its group. The upper inputs of the merger receive N place-holding dummy messages, exactly one for each destination processor. The priority field is the highest possible (i.e. zero), the dummy flag is 0, the source address is the same as the destination, and the data portion is unused. Merging these dummies with the sorted list of real messages results in a list still grouped by destination, with each group headed by a dummy, by virtue of its high priority, followed immediately by the highest priority real message, if any, for the destination. Note that real messages even with zero priority-numbers are sorted behind dummy messages because of the flag bit following the priority field, which acts as a least significant deciding bit in those cases. This list is fed to the exchange network, which examines adjacent pairs of messages (considering overlapping pairs), and exchanges the data portions of a pair if the first member happens to be a dummy to a given address, and the second is a real message to the same address (i.e. it is the highest priority real message to that destination). The sorting network following the exchanger sorts the messages by the field begining with the dummy flag, which acts as the high order bit, followed by the source address. Since there were N real messages, one from each processor, and N dummies, also nominally one from each processor, and since real messages are sorted behind dummy messages due to the high order bit, (i.e. the dummy flag) being 1, the second sorter restores the messages into the same order in which they were introduced. Each processor has two input ports, one labeled "incoming message", and the other "acknowledgment". The incoming message input of processor i is connected to output i of the second sorter. The acknowledgment input is connected to output N+i (i.e. the i'th output of the lower half). In the absence of the exchange network, the i'th processor would receive its own message back on its acknowledgment input, and the i'th dummy message on the incoming message input. Because of the exchanger, however, if the message that processor i had sent happened to be the highest priority message to the requested destination, then the data portion of the message on the acknowledgment input would be that of the dummy it had been swapped with (signaling success). Also, if any messages had been addressed to processor i, the data portion of the highest priority one would arrive on the incoming message port, in place of the dummy message. Thus a processor receives the highest priority message addressed to it on its incoming message port, or a dummy if nobody wanted to talk to it. It receives a dummy on its acknowledgment port if its message has gotten through, or the message back if it hasn't, due to the existence of a higher priority message to the same destination. A Further Extension The serial nature of the sorter causes the destination and priority field to be lost in the source address sorter (it tails after the previous wave of messages). In the case of messages that fail to get delivered, this means that the originating processor must remember to whom it sent the message (about four message times ago in a typical design, due to the latency of the net), if it wants to try again. This is probably undesirable. Also, delivered messages contain no indication of who sent them, having had their source address fields exchanged with those of dummies. These shortcomings can be overcome if the exchanger shuffles the destination address, source address and priority fields in the manner suggested by figs. 4, 5 and 6. Such shuffling can be accomplished with an amount of storage at each exchanger position equal to the number of bits in the destination and priority fields. Before: +-----------------+----------+---+-----------------+----------------------+ | Destination Ad | Zero | 0 | Destination Ad | Unused | +-----------------+----------+---+-----------------+----------------------+ +-----------------+----------+---+-----------------+----------------------+ | Destination Ad | Priority | 1 | Source Address | Data | +-----------------+----------+---+-----------------+----------------------+ After: +---+-----------------+----------------------+-----------------+----------+ | 0 | Destination Ad | Data | Source Address | Priority | +---+-----------------+----------------------+-----------------+----------+ +---+-----------------+----------------------+-----------------+----------+ | 1 | Source Address | Unused | Destination Ad | Zero | +---+-----------------+----------------------+-----------------+----------+ Figure 4: Rearrangements effected by the exchanger in an exchanged pair. Before: +-----------------+----------+---+-----------------+----------------------+ | Destination Ad | Priority | 1 | Source Address | Data | +-----------------+----------+---+-----------------+----------------------+ After: +---+-----------------+----------------------+-----------------+----------+ | 1 | Source Address | Data | Destination Ad | Priority | +---+-----------------+----------------------+-----------------+----------+ Figure 5: Rearrangements in an unsuccessful message. Before: +-----------------+----------+---+-----------------+----------------------+ | Destination Ad | Zero | 0 | Destination Ad | Unused | +-----------------+----------+---+-----------------+----------------------+ After: +---+-----------------+----------------------+-----------------+----------+ | 0 | Destination Ad | Unused | Destination Ad | Zero | +---+-----------------+----------------------+-----------------+----------+ Figure 6: Rearrangements in an isolated dummy message. Hardware and Timing Batcher's sorting nets consist of cascades of mergers. The mergers themselves are built up of simple two-number sorters. Although in principle the primitive sorters could deal with parallel numbers, they are much simpler when the inputs are presented as synchronized serial bit streams, high order bit first. In the latter case the element has two inputs, A and B for the quantities to be sorted, two outputs L and H and a clock and a reset line. Prior to the arrival of the high order bits on the inputs, the element is reset, putting it into an "undecided" state. The inputs then stream in. Until they differ in a bit position, the element cannot and need not decide which is the larger, and it simply passes them on to the outputs. When the first bit position in which A differs from B comes along, the sorting element switches the input with the 0 to output L, and the input with the 1 to output H. The sorting element retains this interconnection state for all subsequent bits, until the next time it is reset. Figure 7: A possible logic implementation for a primitive two-number sorting element. A circuit with these properties can be constructed with about 40 logic gates. It introduces the equivalent of one shift register stage of delay (fig. 7). Mergers that take 2 sorted lists of length N/2 and produce a single sorted list of length N are constructed of N/2 log2 N primitive elements, introducing log2 N shift register stages worth of delay. Batcher's bitonic sorter strategy provides a means of building large mergers out of smaller ones, and thus a method for grouping primitive elements into modular units. A 48 pin integrated circuit package can contain a merger taking two length 8 sequences and producing one of length 16. Larger mergers can be built principally with large numbers of this circuit, plus smaller numbers of 3 other similar packages containing two 4,4 to 8 mergers, four 2,2 to 4 mergers and eight primitive 1,1 to 2 elements (fig. 8). Such packages, each containing on the order of 1000 gates, would be only moderately complex by present day standards, and could be built to have a 100MHz shift rate with ECL logic. < G(6.4,4):BATCH8.GOD[FIG,HPM] > Figure 8: Packaging primitive sorters. Each package type in fig. 8 contains mergers built following Batcher's bitonic sorting strategy. When presented with the concatenation of two sequences, one in ascending and the other in descending order, such mergers produce a single sorted sequence. The inputs are on the left of each package, the outputs are on the right. Each small rectangle is a primitive sorter. In addition to the pins shown, each package must have a clock line, a reset input and output, and power and ground. The reset signals would be clocked through each package by a series of shift register stages, one for each column of primitive sorters. The input to the first stage would come from the reset input pin, the output of the last stage would be wired to the reset output line. Each stage would reset the corresponding column of primitive elements in the package. By wiring the reset outputs of one column of sorting packages to the reset inputs of the next, one can construct sorting nets where a synchronous reset wave propagates just ahead of each new wave of numbers to be sorted. Figure 9: A 16,16 and a 32,32 merger made with standard packages. Large mergers can be constructed from the packaged small ones (fig. 9). Each doubling in the length of the sequences to be merged requires an additional column of primitive sorting elements. The 1,1 merger package can contribute one such column. The 8,8 packages can add 4 layers, and an appropriate number of them can be used with 16 replicas of a given size merger to construct one 16 times as large. There are alternative, equally efficient, decompositions into package types for some merger sizes. In very large mergers the 8,8 would be the dominant package type. Sorters are made of cascades of mergers. 1,1 mergers take inputs in pairs and make sorted lists of length 2, which are combined by 2,2 mergers into lists of length 4, and so on until the inputs are fully sorted. A system interconnecting 1024 processors could be built with a total of 107,008 primitive sorting elements, grouped into 4224 48 pin packages, not counting a considerably smaller number at the exchange station. The interconnection net would introduce 133 stages of delay. If the messages were 50 bits long, the delay from start of transmission to end of arrival would be less than 2 microseconds. Each processor could introduce a message every half microsecond. The bandwidth of the entire net would be over 100 gigabits/second, about half of which would be devoted to conveying addresses. Processor-Memory Switch The network as presented is ideal as an interprocessor message switch. The requirements for a network which connects a large number of processors to many independent memory units are slightly different. In the simplest arrangement of the latter case the processors send messages (read or write requests) only to memories, and memories send messages (memory cell contents) only to processors. The faster than linear growth rate of Batcher nets indicates that these two directions of messages are most economically handled by two separate routing nets. Note that the arrangement we have outlined does not actually require that the number of message sending ports be equal to the number of receivers. In an unsymmetric version the number of dummy messages would equal the number of receiving ports, and be different from the number of sending ports. The biggest problem of using our net in a multi-port memory may be that, while its bandwidth is more than adequate, its latency is relatively long. A memory cache at each processor could alleviate some, but not all, of the ill effects. It is possible to decrease the transit time through the net, at the expense of bandwidth. Instead of re-clocking the bits at each primitive sorter, a number of sorting levels can be interconnected purely combinatorially. The settling time of such extended sorters would be longer than that of primitive elements, since signals would have to filter through more gate delays, so the maximum shift rate would be reduced. The number of clock cycles through the net would also be reduced, however, potentially decreasing the net's latency. Making portions of the net asynchronous permits further slight economies. Sub-nets built with Batcher's odd-even strategy have the same maximum delay as those built with bitonic elements, but need somewhat less logic. Van Voorhis' constructions [2] can save yet a few more gates, and even offer a little speedup. References and Notes Affiliation of Author: Artificial Intelligence Lab, Computer Science Dept., Stanford University, Stanford, Ca. 94305. [1] K.E. Batcher, Sorting Networks and their Applications, AFIPS 1968 Spring Joint Computer Conf. Proceedings, April 1968, 307-314. [2] D.C. Van Voorhis, An Economical Construction for Sorting Networks, AFIPS 1974 National Computer Conf. Proceedings, 921-927. Figure captions FIG 1: Block diagram of the interconnection scheme. FIG 2: Processor message format. High order bit is on the left. Messages are introduced into the net high order bit first. FIG 3: Dummy message format. FIG 4: Rearrangements effected by the exchanger in an exchanged pair. FIG 5: Rearrangements in an unsuccessful message. FIG 6: Rearrangements in an isolated dummy message. FIG 7: A possible logic implementation for a primitive two number sorting element. FIG 8: Packaging primitive sorters. FIG 9: A 16,16 and a 32,32 merger made with standard packages. NOTES: If at all possible, figures 1, 2 and 3 should be printed contiguously, one below the other. Figures 4, 5 and 6 also belong with each other. Observe that in the original drawings, figures 2 and 3 happen to be on the same sheet of paper.