COMMENT μί VALID 00036 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00004 00002 \|\\@macros.pox[doc,hpm]\άκ=0\ͺ\άκ=34\΅\F1\
C00005 00003 \FB
C00006 00004 \FB
C00009 00005 \FB
C00011 00006 \FB
C00012 00007 \FBSection 1: The Natural History of Intelligence
C00024 00008 \FDUnifying principles:\F0
C00031 00009 \FDHarangue:\F0
C00035 00010 \FDReferences:\F0
C00038 00011
C00040 00012
C00041 00013 \FBSection 2: Measuring Processing Power
C00046 00014 \FDEntropy measurement:\F0
C00051 00015 \FDA representative computer:\F0
C00057 00016 \FDA typical nervous system:\F0
C00062 00017 \FDThermodynamic efficiency:\F0
C00070 00018 \FDReferences:\F0
C00073 00019 \FBSection 3: The Growth of Processing Power
C00085 00020 \FDReferences:\F0
C00088 00021 \FBSection 4: Interconnecting Processors
C00093 00022 \F2\άκ=25\΅
C00098 00023 \FDCommunication scheme organization:\F0
C00103 00024 \F0\J
C00111 00025 \F2\άκ=25\΅
C00116 00026 \FDPackage counts:\F0
C00123 00027 \F0\J
C00128 00028 \FDSpeed calculations:\F0
C00131 00029 \FDPossible refinements:\F0
C00136 00030 \FDReferences:\F0
C00137 00031
C00140 00032 \FDA little Lisp:\F0
C00154 00033 \FDA little Algol:\F0
C00165 00034 \FDA little operating systems:\F0
C00171 00035 \FDDisclaimer:\F0
C00175 00036 \FB
C00179 ENDMK
Cμί;
\|\\@macros.pox[doc,hpm];\άκ=0;\ͺ\άκ=34;\΅\F1\
\δΗ'010000;\δΗ'1000002;\
\MBBDI40;
\MDNONMBI;
\MEMS25;
\MFCRTURZ;
\άκ'0;\άν#\ΰϋEVERYPAGE[\άκ#\+=1;\άν#\oEP{0 \D#}\WEP,=1435;=150;]
\FB
\C\fFC
\CThe Role of
\CRAW POWER
\Cin
\CINTELLIGENCE
\F1\RHans Moravec
\RMay 12, 1976
\FB
CONTENTS
\F0 1\FB Introduction
\F0 2\FB Acknowledgement
\F0 3\FB Section 1: The Natural History of Intelligence
\F0 3\FB \F0Product lines\FB
\F0 5\FB \F0Unifying principles\FB
\F0 6\FB \F0Harangue\FB
\F0 7\FB \F0References\FB
\F010\FB Section 2: Measuring Processing Power
\F010\FB \F0Low level vision\FB
\F011\FB \F0Entropy measurement\FB
\F013\FB \F0A representative computer\FB
\F014\FB \F0A typical nervous system\FB
\F015\FB \F0Thermodynamic efficiency\FB
\F017\FB \F0References\FB
\F018\FB Section 3: The Growth of Processing Power
\F020\FB \F0References\FB
\F021\FB Section 4: Interconnecting Processors
\F021\FB \F0\!jexp(Log,2); sorting net construction\FB
\F023\FB \F0Communication scheme organization\FB
\F027\FB \F0Package counts\FB
\F029\FB \F0Speed calculations\FB
\F030\FB \F0Possible refinements\FB
\F031\FB \F0References\FB
\F032\FB Section 5: Programming Interconnected Processors
\F033\FB \F0A little Lisp\FB
\F035\FB \F0A little Algol\FB
\F037\FB \F0A little operating systems\FB
\F038\FB \F0Disclaimer\FB
\F039\FB Section 6: Bombast
\δΗ'100;
\FB
Introduction:
\F0\J
This essay is an argument that an essential ingredient is
absent in many current conceptions of the road to Artificial
Intelligence.\.
\J
The first section discusses natural intelligence, and notes
two major branches of the animal kingdom in which it evolved
independently, and several offshoots. The suggestion is that
intelligence need not be so difficult to construct as is sometimes
assumed.\.
\J
The second part compares the information processing ability
of present computers with intelligent nervous systems, and finds a
factor of one million difference. This abyss is interpreted as a major
distorting influence in current work, and a reason for disappointing
progress.\.
\J
Section three examines the development of electronics, and
concludes the state of the art can provide more power than is now
available, and that the gap could be closed in a decade.\.
\J
Parts four and five introduce hardware and software aspects
of a system which is able to make use of the advancing technology.\.
\FB
Acknowledgement:
\F0\J
The following entities provided inspiration, encouragement,
suggestions, data, slave labor, proof reading services, etc.:
(in carefully randomized order) PDP-KA10, Scientific
American, Marc Le Brun, Andy Moorer, Ed Mcguire, Electronics
magazine, Don Gennery, Bill Gosper, John McCarthy, Macsyma, Mike
Farmwald, Russ Taylor, Cart Project, Les Earnest, Pierre van
Nypelseer, Robert Maas, Jeff Rubin, Bruce Baumgart, HAL-9000, Tom
Binford, Clem Smith, Tom Gafford, Brian Harvey, ...
\.
\FBSection 1: The Natural History of Intelligence
\F0
\FDProduct lines:\F0
\J
Natural evolution has produced a continuum of complexities of
behavior, from the mechanical simplicity of viruses to the magic of
mammals. In the higher animals most of the complexity resides in the
nervous system.
Evolution of the brain began in early multi-celled animals a
billion years ago with the development of cells capable of
transmitting electrochemical signals. Because neurons are more
localized than hormones they allow a greater variety of signals in a
given volume. They also provide evolution with a more uniform medium
for experiments in complexity.
The advantages of implementing behavioral complexity in
neural nets seem to have been overwhelming, since all modern animals
more than a few cells in size have them [animal].
Two major branches in the animal kingdom, vertebrates and
mollusks, contain species which can be considered intelligent. Both
stem from one of the earliest multi-celled organisms, an animal
something like a hydra made of a double layer of cells and possessing
a primitive nerve net.
Most mollusks are intellectually unimpressive sessile
shellfish, but one branch, the cephalopods, possesses high mobility,
large brains and imaging eyes, evolved independently of the
corresponding vertebrate structures. There are fascinating
differences. The optic nerve connects to the back of the retina, so
there is no blind spot. The brain is annular, forming a ring
encircling the esophagus. The circulatory system, also independently
evolved, has three blood pumps, a systemic heart pumping oxygenated
blood to the tissues and two gill hearts, each pumping venous blood
to one gill. The oxygen carrier is a green copper compound called
hemocyanin, evolved from an earlier protein that also became
hemoglobin.\.
\J
These animals have some unique abilities. Shallow water
octopus and squid are covered by a million individually controlled
color changing effectors called chromatophores, whose functions are
camouflage and communication. The capabilities of this arrangement
have been demonstrated by a cuttlefish accurately imitating a
checkerboard it was placed upon, and an octopus in flight which
produced a pattern like the seaweed it was traversing, coruscating
backward along the length of its body, diverting the eye from the
true motion. Deep sea squid have photophores capable of generating
large quantities of multicolored light. Some are as complex as eyes,
containing irises and lenses [squid]. The light show is modulated by
emotions in major and subtle ways. There has been little study of
these matters, but this must provide means of social interaction.
Since they also have good vision, there is the potential for high
bandwidth communication.
Cephalopod intelligence has not been extensively
investigated, but a few controlled experiments indicate rapid
learning in small octopus [Boycott]. The Cousteau film shows an
octopus' response to a "monkey and bananas" problem. A fishbowl
containing a lobster is sealed with a cork and dropped into the water
near it. The octopus is attracted, and spends a long while
alternately probing the container in various ways and returning to
its lair in iridescent frustration. On the final iteration it exits
its little hole in the ground and unhesitatingly wraps three
tentacles around the bowl, and one about the cork, and pulls. The
cork shoots to the surface and the octopus eats. The Time-Life film
contains a similar sequence, with a screw top instead of a cork! If
small octopus have almost mammalian behavior, what might giant deep
sea squid be capable of?
Birds are more closely related to humans than are
cephalopods, their common ancestor with us being a 300 million year
old early reptile. Size-limited by the dynamics of flying, some
birds have reached an intellectual level comparable to the highest
mammals.
Crows and ravens are notable for frequently outwitting man.
Their intuitive number sense (ability to perceive the cardinality of
a set without counting) extends to seven, as opposed to three or four
in man. Such a sense is useful for keeping track of the number of
eggs in a nest. Experiments have shown [Stettner] that most birds are
more capable of high order "reversal" and "learning set" learning
than all mammals except the higher primates. In mammals these
abilities increase with increasing cerebral cortex size. In birds
the same functions depend on areas not present in mammalian brains,
forebrain regions called the "Wulst" and the hyperstriatum. The
cortex is small and relatively unimportant. Clearly this is another
case of independent evolution of similar mental functions.
It is interesting to speculate whether penguins, now similar
to seals in behavior and habitat, will ever become fully aquatic, and
evolve analogously to the great whales.
The cetaceans are related to man through a small 30 million
year old primitive mammal. Some species of dolphin have body and
brain masses identical to ours, and archaeology reveals they have
been this way several times as long. They are as good as man at many
kinds of problem solving, and perhaps at language. The references
contain many anecdotes, and describe a few controlled experiments,
showing that dolphins can grasp and communicate complex ideas.
Killer whales have brains seven times human size, and their ability
to formulate plans is better than dolphins', occasionally being used
to feed on them. Sperm whales, not the largest animals, have the
world's largest brains. There may be intelligence mediated conflict
with large squid, their main food supply.
Elephants have brains about six times human size, matriarchal
tribal societies, and complex behavior. Indian domestic elephants
learn 500 commands, limited by the range of tractor-like tasks their
owners need done, and form voluntary mutual benefit relationships
with their trainers, exchanging labor for baths. They can solve
problems such as how to sneak into a plantation at night to steal
bananas, after having been belled (answer: stuff mud into the bells).
And they remember for decades. Inconvenience and cost has prevented
modern science from doing much work with them.
The apes are man's cousins. Chimps and gorillas can learn to
use tools and to communicate with human sign languages at a retarded
level. As chimps have one third, and gorillas one half, human
brainsize, similar results should be achievable with the larger
brained, but less man-like animals. Though no other species has
managed to develop cultures comparable to modern man's, it may be
that some of them can be made partners in ours, accelerating its
evolution by their unique capabilities.
\.
\FDUnifying principles:\F0
\J
A feature shared by all living organisms whose behavior is
complex enough to place them near human in intelligence is a hundred
billion neuron nervous system. Imaging vision requires a billion
neurons, while a million is associated with fast and interesting, but
stereotyped, behavior as in a bee. A thousand is adequate for slow
moving animals with minimal sensory input, such as slugs and worms.
A hundred runs most sessile animals. The portions of nervous systems
for which tentative wiring diagrams have been obtained, including
much of the brain of the large neuroned sea slug, Aplysia, the flight
controller of the locust and the early stages of some vertebrate
visual systems, reveal that the neurons are configured into
efficient, clever, assemblies. This should not be too surprising, as
unnecessary redundancy means unnecessary metabolic load, a distinct
selective disadvantage.
Evolution has stumbled on many ways of speeding up its own
progress, since species that adapt more quickly have a selective
advantage. Most of these, such as sex and individual death, have been
refinements of one of the oldest, the encoding of genetic information
in the easily mutated and modular DNA molecule. In the last few
million years the genetically evolved ability of animals, especially
mammals, to learn a significant fraction of their behavior after
birth has provided a new medium for growth of complexity. Modern
man, though probably not the most individually intelligent animal on
the planet, is the species in which this cultural evolution seems to
have had the greatest effect, making human culture the most potent
force on the earth's surface.
Cultural and technological evolution proceeds by massive
interchange of ideas and information, trial and error guided by the
ability to predict the outcome of simple situations, and other
techniques mediated by the intelligence of the participants. The
process is self reinforcing because its consequences, such as
improved communication methods and increased wealth and population,
allow more experiments and faster cross fertilization among different
lines of inquiry. Many of its techniques have not been available to
biological evolution. The effect is that present day global
civilization is developing capabilities orders of magnitude faster.
Of course biological evolution has had a massive head start.\.
\J
Although cultural evolution has developed methods beyond
those of its genetic counterpart, the overall process is essentially
the same. It involves trying large numbers of possibilities,
selecting the best ones, and combining successes from different lines
of investigation. This requires time and other finite resources.
Finding the optimum assembly of particular type of component
which achieves a desired function usually requires examination of a
number of possibilities exponential in the number of components in
the solution. With fixed resources this implies a design time rising
exponentially with complexity. Alternatively the resources can be
used in stages, to design subassemblies, which are then combined into
larger units, and so on, until the desired result is achieved. This
can be much faster since the effort rises exponentially with the
incremental size of each stage and linearly with the number of
stages, with an additional small term, for overall planning,
exponential in the number of stages. The resulting construct will
probably use more of the basic component and be less efficient than
an optimal design.
Biological evolution is affected by these considerations as
much as our technology. Presumably there is a way of using the
physics of the universe to construct entities functionally equivalent
to human beings, but vastly smaller and more efficient. Terrestrial
evolution has not had the time or space to develop such things. But
by building in the sequence atoms, amino acids, proteins, cells,
organs, animal (often concurrently), it produced a technological
civilization out of inanimate matter in only two billion years.
\.
\FDHarangue:\F0
\J
The existence of several examples of intelligence designed
under these constraints should give us great confidence that we can
achieve the same in short order.
The situation is analogous to the history of heavier than air
flight, where birds, bats and insects clearly demonstrated the
possibility before our culture mastered it. Flight without adequate
power to weight ratio is heartbreakingly difficult (vis. Langley's
steam powered aircraft or current attempts at man powered flight),
whereas with enough power (on good authority!) a shingle will fly.
Refinement of the aerodynamics of lift and turbulence is most
effectively tackled after some experience with suboptimal aeroplanes.
After the initial successes our culture was able to far surpass
biological flight in a few decades.
Although there are known brute force solutions to most AI
problems, current machinery makes their implementation impractical.
Instead we are forced to expend our human resources trying to find
computationally less intensive answers, even where there is no
evidence that they exist. This is honorable scientific endeavor,
but, like trying to design optimal airplanes from first principles, a
slow way to get the job done.
With more processing power, competing presently impractical
schemes could be compared by experiment, with the outcomes often
suggesting incremental or revolutionary improvements. Computationally
expensive highly optimizing compilers would permit efficient code
generation at less human cost. The expanded abilities of existing
systems such as MACSYMA, along with new experimental results, would
accelerate theoretical development. Gains made this way would improve
the very systems being used, causing more speedup. The intermediate
results would be inefficient kludges busily contributing to their own
improvement. The end result is systems as efficient and clever as
any designed by more theoretical approaches, but sooner, because more
of the labor has been done by machines.
With enough power anything will fly. The next section tries
to decide how much is needed.
\.
\C\fFB
\FDReferences:\F0
[animal]
JERISON, Harry J.,
"Paleoneurology and the Evolution of Mind",
Scientific American, Vol. 234, No. 1,
January 1976, 90-101.
BITTERMAN, M. E.,
"The Evolution of Intelligence",
Scientific American, Vol. 212, No. 1,
January 1965, 92-100.
GRIFFIN, Donald R., ed.
"Animal Engineering",
W.H. Freeman and Company,
San Francisco, June 1974.
BURIAN, Z. and Spinar, Z.V.,
"Life Before Man",
American Heritage Press, 1972.
RIOPELLE, A.J., ed.
"Animal Problem Solving",
Penguin Books, 1967.
GOODRICH, Edwin S.,
"Studies on the Structure and Development of Vertebrates",
Dover Publications Inc., New York, 1958.
BUCHSBAUM, Ralph,
"Animals without Backbones",
The University of Chicago Press, 1948.
FARAGO, Peter, and Lagnado, John,
"Life in Action"
Alfred A. Knopf, New York, 1972.
BONNER, John Tyler,
"Cells and Societies",
Princeton University Press, Princeton, 1955.
[squid]
COUSTEAU, Jacques-Yves and Diole, Philippe,
"Octopus and Squid",
Doubleday & Company, Inc., Garden City, New York, 1973.
(also a televised film of the same name)
"The Octopus",
a televised film, Time-Life films.
BOYCOTT, Brian B.,
"Learning in the Octopus",
Scientific American, Vol. 212, No. 3,
March 1965, 42-50.
LANE, Frank W.,
"The Kingdom of the Octopus",
Worlds of Science Book, Pyramid Publications Inc.
October, 1962.
[bird]
BAKKER, Robert T.,
"Dinosaur Renaissance",
Scientific American, Vol. 232, No. 4,
April 1975.
STETTNER, Laurence Jay and Matyniak, Kenneth A.
"The Brain of Birds",
Scientific American, Vol. 218, No. 6,
June 1968, 64-76.
[whale]
LILLY, John. C.,
"The Mind of the Dolphin" &
"Man and Dolphin",
Doubleday and Company, New York, 1967.
COUSTEAU, Jacques-Yves and Diole, Philippe,
"The Whale",
Doubleday & Company, Inc., Garden City, New York, 1972.
FICHTELIUS, Karl-Erik and Sjolander, Sverre,
"Smarter than Man?",
Ballantine Books, New York, 1974.
STENUIT, Robert,
"The Dolphin, Cousin to Man",
Bantam Books, New York, 1972.
"Whales and Dolphins",
A BBC produced film shown
in the NOVA television series
McINTYRE, Joan, ed.
"Mind in the Waters",
Charles Scribner's Sons, San Francisco, 1974.
[elephant]
RENSCH, Bernhard,
"The Intelligence of Elephants",
Scientific American,
February 1957, 44.
[primate]
"The First Words of Washoe",
Televised film shown in the NOVA series
LeGROS CLARK, W.E.,
"History of the Primates",
The University of Chicago Press, Chicago 1966.
PFEIFFER, John,
"The Human Brain",
Worlds of Science Books, Pyramid Publications Inc.,
New York, 1962.
\FBSection 2: Measuring Processing Power
\F0
\FDLow level vision:\F0
\J
The visual system of a few animals has been studied in some
detail, especially the layers of the optic nerve near the retina.
The neurons comprising these structures are used efficiently to
compute local operations like high pass filtering and edge,
curvature, orientation and motion detection.
Assuming the visual cortex (and possibly the optic nerve itself)
is as computationally intensive as the retina, successive layers producing
increasingly abstracted representations, we can estimate the total
capability. There are a million seperate fibers in a cross section of the
human optic nerve. At the retina each is connected to several of 10
million light sensitive rods and cones. The thickness of the optical
cortex is a thousand times the depth occupied by the neurons which apply a
single simple operation. The eye is capable of processing images at the
rate of ten per second (flicker at higher frequencies is detected by
special operators). This means that the human visual system evaluates
10,000 million pixel simple operators each second.
A tightly hand coded simple operator, like high pass
filtering by subtraction of a local average, applied to a 256\f2x256
(1/16 million) pixel picture takes at least 10 seconds when executed
on a PDP-10 (KA), not counting timesharing. A million pixel image
would require 160 seconds, if storage constraints permitted it. Since
the computer can evaluate only one at a time, the effective rate is
1/160 million pixel simple operators per second.
Thus a hand coded PDP-10 falls short of being the equal of
the human optical system by a speed factor of 1.6 million.
It may not be necessary to apply every operator to every
portion of every picture, and a general purpose computer, being more
versatile than the optic nerve, can take advantage of this. I grant
an order of magnitude for this effect, reducing the optic nerve to a
mere 100,000 PDP-10 equivalents.
The size of this factor is related to having chosen to
implement our algorithms in machine language. If we had opted to
disassemble a number of PDP-10's and reconfigure the components to do
the computation, far fewer would have been required. On the other
hand if we had run our algorithms in interpreted Lisp, 10 to 100
times as many would be needed. The tradeoff is that the design time
varies inversely with the execution efficiency. A good Lisp program
to compute a given function is easier to produce than an efficient
machine language program, or an equivalent piece of hardware.
Compilers and automatic design programs blur the issue a little by
passing some of the effort of implementation to a machine, but the
essential character remains.
\.
\FDEntropy measurement:\F0
\J
Is there a quantitative way in which the processing power of
a system, independent of its detailed nature, can be measured? A
feature of things which compute massively is that they change state
in complicated and unexpected ways. The reason for believing that,
say, a stationary rock does little computing is its high
predictability. By this criterion the amount of computing done by a
device is in the mind of the beholder. A machine with a digital
display which flashed 1, 2, 3, 4 etc., at fixed intervals would seem
highly predictable to an adult of our culture, but might be
justifiably considered to be doing an interesting, nontrivial and
informative computation by a young child. Information theory
provides a measure for this idea. If a system is in a given state and
can change to one of a number of next states, the information in the
transition, which I will call the Compute Energy (CE), is given by
\.
\CCE = \!jsum(i=1,N); -p\!jsub(i); log p\!jsub(i);
\J
where N is the number of next states, and p\!jsub(i); is the
probability that the i'th state will be occur next. If the base of
the logarithm is two, the measure is in binary digits, bits. If we
consider the system in the long run, considering all the states it
might ever eventually be in, then this measure expresses the total
potential variety of the system.
A machine which can accomplish a given thing faster is more
powerful than a slower one. A measure for Compute Power is obtained
by dividing each term in the above sum by the expected time of the
corresponding transition. This is
\.
\CCP = \!jsum(i=1,N); \!jdiv((-p\!jsub(i); log p\!jsub(i);)\N
,(t\!jsub(i);));
\J
and the units are bits/second.
These measures are highly analogous to the energy and power
capacities of a battery. Some properties follow:
They are linear, i.e. the compute power and energy of a system of
two or more independent machines is the sum of the individual power
and energies;
Speeding up a machine by a factor of n increases the CP by the same
factor;
A completely predictable system has a CP and CE of zero;
A machine with a high short term CP, which can reach a moderate
number of states in a short time, can yet have a low CE, if the total
number of states attainable in the long run is not high;
If the probabilities and times of all the transitions are
equal the measures simplify to
\.
\CCE = log\!jsub(2);N
\CCP = \!jdiv((log\!jsub(2);N),t);
\J
For N distinct outcomes and equal transition times the maximum CE and
CP is obtained if all the probabilities of occurrence are 1/N, so the
above simplifications represent an upper bound in cases where the
probabilities are variable or cannot be determined.\.
\FDA representative computer:\F0
\J
For the KA-PDP10, considering one instruction time, we have
(roughly) that in one microsecond this machine is able to execute one
of 2\!jsup(5); different instructions, involving one of 2\!jsup(4);
accumulators and one of 2\!jsup(18); memory locations, most of these
combinations resulting in distinct next sates. This corresponds to a
CP of
\.
\C\!jdiv((log\!jsub(2);(\!jexp(2,5); \f2x \!jexp(2,4); \f2x \!jexp(2,18);) bit),\
(\!jexp(10,-6); sec)); = 27 \f2x \!jexp(10,6); \!jdiv(bit,sec);
\J
This is an extreme upper bound, and represents the most efficient
code possible. It cannot be maintained for very long, because many
different sequences of instructions have the same outcome. Very
often, for instance, the order in which a number of instructions is
executed does not matter. Assuming for the moment that this is always
the case, we can calculate the effect on the CP, measured over a
second. The raw power says there are
\.
\C2\!jsup((27 \f2x 10\!jsup(6);));
\J
distinct possible states after a second, or one million instruction
times. If all permutations result in the same outcome, this must
be reduced by a factor of 1000000!. The log of a quotient is
the difference of the logs, so the adjusted CP is
\.
\C27 \f2x \!jexp(10,6); - log\!jsub(2);(\!jexp(10,6);!)\N
= 27 \f2x \!jexp(10,6); - 18.5 \f2x \!jexp(10,6);\N
= 8.5 \f2x \!jexp(10,6); \!jdiv(bit,sec);
\J
The CP is also limited by the total compute energy. If we
ignore external devices, this is simply the total amount of memory,
about 36\f2x2\!jsup(18); = 9.4\f2x10\!jsup(6); bits. The PDP 10
could execute at its maximum effectiveness for 9.4/8.5 = 1.1 seconds
before reaching a state which could have been arrived at more quickly
another way.
Large external storage devices such as disks and tapes can
increase the compute energy indefinitely, because they are a channel
through which independently obtained energy may be injected. On the
other hand, they have only a moderate effect on the power.
A disk channel connected to our KA-10 has a data rate of
6.4\f2x10\!jsup(6); bits/sec. If run at full speed, constantly
stuffing new information into memory, it would add slightly less than
that to the power, because it uses memory cycles the processor might
want. If, as is usually the case, the disk is used for both reading
and writing, the improvement is reduced by a factor of two. Further
reductions occur if the use is intermittent. The overall impact is
less than 10\!jsup(6); bits/sec, about 10% of the power of the raw
processor.\.
\J
Combining the above results, we conclude that the processing
power of a typical major AI center computer is at most 10\!jsup(7);
bits/sec. Time sharing reduces this to about 10\!jsup(6); b/s per
user. Programming in a moderately efficient high level language (vs.
optimal machine code), costs another factor of 10, and running under
an interpreter may result in a per user power of a mere 10,000
bits/sec, if the source code is efficient. These reductions can be
explained in the light of the CP measure by noting that the a
compiler or interpreter causes the executed code to be far more
predictable than optimal code, eg. by producing stereotyped
sequences of instructions for primitive high level constructs.\.
\FDA typical nervous system:\F0
\J
We now consider the processing ability of animal nervous
systems, using humans as an example. Since the data is even more
scanty than what we assumed about the PDP-10, some not unassailable
assumptions need to be made. The first is that the processing power
resides in the neurons and their interconnections, and not in more
compact nucleic acid or other chemical encodings. There is no
currently widely accepted evidence for the latter, while neural
mechanisms for memory and learning are being slowly revealed. A
second is that the neurons are used reasonably efficiently, as
detailed analysis of small nervous systems and small parts of large
ones reveals (and common sense applied to evolution suggests).
Thirdly, that neurons are fairly simple, and their state can be
represented by a binary variable, "firing" or "not firing", which can
change about once per millisecond. Finally we assume that human
nervous systems contain about 40 billion neurons.
Considering the space of all possible interconnections of
these 40 billion (treating this as the search space available to
natural evolution in its unwitting attempt to produce intelligence,
in the same sense that the space of all possible programs is
available to someone trying to create intelligence in a computer), we
note that there is no particular reason why every neuron should not
be able to change state every millisecond. The number of
combinations thus reachable from a given state is
\!jexp(2,(40\f2x10\!jsup(9);)); the binary log of which is
40\f2x10\!jsup(9);. This leads to a compute power of
\.
\C\!jdiv((40 \f2x \!jexp(10,9); bit),(\!jexp(10,-3); sec)); =\N
40 \f2x \!jexp(10,12); \!jdiv(bit,sec);
\J
which is about a million times the maximum power of the KA-10.
Keep in mind that much of this difference is due to the high
level of interpretation in the 10, compared to what we assumed for
the nervous system. Rewiring its gates or transistors for each new
task would greatly increase the CP, but also the programming time.
If the processor is made of 100,000 devices which can change state in
100 ns, the potential CP available through reconfiguration is
10\!jsup(5); bits/10\!jsup(-7); sec = 10\!jsup(12); b/s. The CE
would be unaffected. If automatic design and fabrication methods
result in small quantity integrated circuit manufacture becoming less
expensive and more widely practiced, my calculations may prove overly
pessimistic.
\.
\FDThermodynamic efficiency:\F0
\J
Thermodynamics and information theory provide us with a
minimum amount of energy per bit of information generated at a given
background temperature (the energy required to out shout the thermal
noise). This is approximately the Boltzmann constant,
\.
\C1.38 \f2x \!jexp(10,-16); \!jdiv(erg,deg molecule);
\J
for our purposes the units will be rephrased as erg/(deg bit), since
for normal matter molecules represent the limit of organization, due
to the quantum mechanical fuzziness of things. This measure allows
us to estimate the overall energy efficiency of computing engines.
For instance, we determined the computing rate of the brain, which
operates at 300 degrees K, to be 40\f2x10\!jsup(12); bits/sec. This
corresponds to a physical power of
\.
\C40 \f2x \!jexp(10,12); \!jdiv(bit,sec); \f2x 300 \N
deg \f2x 1.38 \f2x \!jexp(10,-16); \!jdiv(erg,deg bit); = \N
1.66 \!jdiv(erg,sec); = 1.66 \f2x \!jexp(10,-7); watt
\J
The brain runs on approximately 40 watts, so we conclude that
it is 10\!jsup(-8); times as efficient as the physical limits allow.
Doing the same calculation for the KA10, again at 300 deg, we
see that a CP of 8.5\f2x10\!jsup(6); bit/sec is worth
3.519\f2x10\!jsup(-14); watts. Since this machine needs 10
kilowatts the efficiency is only 10\!jsup(-18);. Conceivably a ten
watt, but otherwise equivalent, KA10 could be designed today, if care
were taken to use the best logic for the required speed in every
assembly. The efficiency would then still be only 10\!jsup(-15);.
As noted previously, there is a large cost inherent in the
organization of a general purpose computer. We might investigate the
computing efficiency of the logic gates of which it is constructed
(as was, in fact, done with the brain measure). A standard TTL gate
can change state in about 10ns, and consumes 10\!jsup(-3); watt.
The switching speed corresponds to a CP of 10\!jsup(8); bit/sec, or
a physical power of 4.14\f2x10\!jsup(-13); watt. So the efficiency
is 10\!jsup(-10);, only one hundred times worse than a vertebrate
neuron.\.
\J
The newer semiconductor logic families are even better. C-MOS
is twice as efficient as TTL, and Integrated Injection Logic is 100
times better, putting it on a par with neurons.
Experimental superconducting Josephson junction logic
operates at 4 deg K, switches in 10\!jsup(-11); sec, and uses
10\!jsup(-7); watts per gate. This implies a physical compute power
of 5\f2x10\!jsup(-12); watt, and an efficiency of
5\f2x10\!jsup(-5);, 1000 times better than neurons. At room
temperature it requires a refrigerator that consumes 100 times as
much energy as the logic, to pump the waste heat uphill from 4
degrees to 300. Since the background temperature of the universe is
about 4 degreees, this can probably eventually be done away with.
It is thus likely that there exist ways of interconnecting
gates made with known techniques which would result in behavior
effectively equivalent to that of human nervous systems. Using a
million I\!jsup(2);L gates, or 10 thousand Josephson junction gates,
and a trillion bits of slower bulk storage, all running at full
speed, such assemblies would consume as little as, or less than, the
power needed to operate a brain of the conventional type.
Past performance indicates that the amount of human and
electronic compute power available is inadequate to design such an
assembly within the next few years. The problem is much reduced if
the components used are suitable large subassemblies. Statements of
good high level computer languages are the most effective such
modularizations yet discovered, and are probably the quickest route
to human equivalence, if the necessary raw processing power can be
accessed through them. This section has indicated that a million
times the power of typical existing machines is required. The next
suggests this should be available at reasonable cost in about ten
years.
\.
\FDReferences:\F0
WILLOWS, A.O.D.,
"Giant Brain Cells in Mollusks",
Scientific American, Vol. 224, No. 2,
February 1971, 69-75.
KANDEL, Eric R.,
"Nerve Cells and Behavior",
Scientific American, Vol. 223, No. 1,
July 1970, 57-70.
AGRANOFF, Bernard W.,
"Memory and Protein Synthesis",
Scientific American, Vol. 216, No. 6,
June 1967, 115-122.
KENNEDY, Donald,
"Small Systems of Nerve Cells",
Scientific American, Vol. 216, No. 5,
May 1967, 44-52.
BAKER, Peter F.,
"The Nerve Axon",
Scientific American, Vol. 214, No. 3,
March 1966, 74-82.
HUBEL, David H.,
"The Visual Cortex of the Brain",
Scientific American,
November 1963, 54-62.
WILLIAMS, Peter L., Warwick, Roger
"Functional Neuroanatomy of Man",
W.B. Saunders Company, Philadelphia, 1975.
"PDP-10 Reference Handbook",
Digital Equipment Corporation, Maynard Mass., 1971.
TRIBUS, Myron and McIrvine, Edward C.,
"Energy and Information",
Scientific American, Vol. 224, No. 3,
Setember 1971, 179-188.
GLASSTONE, Samuel, Lewis, David,
"Elements of Physical Chemistry",
D. Van Nostrand Co. Inc., New York, 1960.
MILLER, Richard T.,
"Super Switch", 284, in Science Year annual 1975,
Field Enterprises Educational Corp., 1975.
LANDAUER, Rolf,
"Fundamental Limitations in the Computational Process",
IBM Research Report, Yorktown Heights N.Y., June 1976.
\FBSection 3: The Growth of Processing Power
\F0\J
The references below present, among other things, the
following data points on a price curve:
\.
\F2\άκ=25;\΅\άκ=-7;\ͺ
\άκ.\άνA
\F1Transistor price\F2
.0001\fE$\άκA\άν.\N
.01\fE$\άκA\άν.\N
1\fE$\άκA\άν.\N
\f1$1\άκA\άν.\N
\f1$100
δΗ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±¦Δ±±±δΙ\F1 Year\F2
~ ~
΅± O ±ͺ\F1 1950\F2
~ # ~\F1 1951 \f1$100 transistor\F2
~ # ~\F1 1952 transistor hearing aid\F2
~ # ~\F1 1953\F2
~ # ~\F1 1954\F2
΅± # ±ͺ\F1 1955 transistor radios\F2
~ # ~\F1 1956\F2
~ O ~\F1 1957 \f1$10 transistor\F2
~ # ~\F1 1958\F2
~ # ~\F1 1959\F2
΅± O ±ͺ\F1 1960 \f1$1 transistor\F2
~ # ~\F1 1961\F2
~ # ~\F1 1962 \f1$100,000 small computer (IBM 1620)\F2
~ # ~\F1 1963\F2
~ O ~\F1 1964\F2
΅± # ±ͺ\F1 1965 \f1$0.08 transistor (IC)\F2
~ # ~\F1 1966 \f1$1000 4 func calculator\F2
~ # ~\F1 1967 \f1$6000 scientific calc.\F2
~ # ~\F1 1968 \f1$10,000 small computer (PDP 8)\F2
~ # ~\F1 1969\F2
΅± O ±ͺ\F1 1970 \f1$200 4 func calculator\F2
~ # ~\F1 1971\F2
~ # ~\F1 1972 1K RAMS (1 \fE$/bit)\F2
~ # ~\F1 1973\F2
~ # ~\F1 1974 \f1$1000 small computer (PDP 11)\F2
΅± O ±ͺ\F1 1975 4K RAMS (.1 \fE$/bit)\F2
~ # ~\F1 1976 \f1$5 4 func calc (.05 \fE$/trans)\F2
~ ~\F1 1977\F2
~ ~\F1 1978\F2
~ ~\F1 1979\F2
%±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±ΰΔ±±±$
\άκ=34;\΅\άκ=0;\ͺ
\J\F0
The numbers indicate a remarkably stable evolution. The price
per electronic function has declined by a steady factor of ten every
five years, if speed and reliability gains are included.
Occasionally there is a more precipitous drop, when a price threshold
which opens a mass market is reached. This makes for high
incentives, stiff price competition and mass production economies.
It happened in the early sixties with transistor radios, and is going
on now for pocket calculators and digital wristwatches. It is
begining for microcomputers, as these are incorporated into consumer
products such as stoves, washing machines, televisions and sewing
machines, and soon cars. During such periods the price can plummet by
a factor of 100 in a five year period. Since the range of
application for cheap processors is larger than for radios and
calculators, the explosion will be more pronounced.
The pace of these gains is in no danger of slackening in the
forseeable future. In the next decade the current period may seem to
be merely the flat portion of an exponential rise. On the immediate
horizon are the new semiconductor techniques, I\!jsup(2);L, and super
fast D-MOS, CCD for large sensors and fast bulk memory, and magnetic
bubbles for mass storage. The new 16K RAM designs use a folded
(thicker) cell structure to reduce the area required per bit, which
can be interpreted as the first step towards 3 dimensional
integration, which could vastly increase the density of circuitry.
The use of V-MOS, an IC technique that vertically stacks the elements
of a MOS transistor is expanding. In the same direction, electron
beam and X-ray lithography will permit smaller circuit elements.
In the longer run we have ultra fast and efficient Josephson
junction logic, of which small IC's exist in an IBM lab, optical
communication techniques, currently being incorporated into
intermediate distance telephone links, and other things now just
gleams in the eye of some fledgling physicist or engineer.\.
\J
My favorite fantasies include the "electronics" of
super-dense matter, either made of muonic atoms, where the electrons
are replaced by more massive negative particles or of atoms
constructed of magnetic monopoles which (if they exist) are very
massive and affect each other more strongly than electric charges.
The electronics and chemistry of such matter, where the "electron"
orbitals are extremely close to the nucleus, would be more energetic,
and circuitry built of it should be astronomically faster and
smaller, and probably hotter. Mechanically it should exhibit higher
strength to weight ratios. The critical superconducting transition
field strengths and temperatures would be higher. For monopoles
there is the possibility of combination magnetic electric circuitry
which can contain, among many other goodies, DC transformers, where
an electric current induces a monopole current at right angles to it,
which in turn induces another electric current. One might also
imagine quantum DC transformers, matter composed of a chainlike mesh
of alternating orbiting electric and magnetic charges.
I interpret these things to mean that the cost of computing
will fall by a factor of 100 during the next 5 years, as a
consequence of the processor explosion, and by the usual factor of 10
in the 5 years after that. As an approximation to what is available
today, note that in large quantities an LSI-11 sells for under $500.
This provides a moderately fast 16 bit processor with 4K of memory.
Another $500 could buy an additional 32K of memory, if we bought in
quantity. The result would be a respectable machine, somewhat less
powerful than the KA-10, for $1000. At the crude level of
approximation employed in the previous section, a million machines
of this type should permit human equivalence. A million dollars would
provide a thousand of them today (a much better buy, in terms of raw
processing power, than a million dollar large processor). In ten
years a million dollars should provide the equivalent of a million
such machines, in the form of a smaller number of faster processors,
putting human equivalence within reach.
We now have the problem of what to do with this roomful of
isolated computers. The next section describes what seems to be the
best known solution to the problem of interconnecting a large number
of processors with maximum generality, at reasonable cost.
\.
\FDReferences:\F0
McWHORTER, Eugene W.
"The Small Electronic Calculator",
Scientific American, Vol. 234, No. 3,
March 1976, 88-98.
HODGES, David A.,
"Trends in Computer Hardware Technology",
Computer Design, Vol. 15, No. 2,
February 1976, 77-85.
SCRUPSKI, Stephen E., et al.,
"Technology Update",
Electronics, Vol. 48, No. 21, McGraw-Hill,
October 16, 1975, 74-127.
VACROUX, Andre G.
"Microcomputers",
Scientific American, Vol. 232, No. 5,
May 1975, 32-40.
TURN, Rien
"Computers in the 1980's",
Columbia University Press,
Rand Corporation, 1974.
TIEN, P. K.
"Integrated Optics",
Scientific American, Vol. 230, No. 4,
April 1974, 28-35.
HITTINGER, William C.
"Metal-Oxide-Semiconductor Technology",
Scientific American, Vol. 229, No. 2,
August 1973, 48-57.
BOBECK, Andrew H. and Scovil, H. E. D.
"Magnetic Bubbles",
Scientific American, Vol. 224, No. 6,
June 1971, 78-90.
HEATH, F. G.
"Large-Scale Integration in Electronics",
Scientific American, Vol. 222, No. 3,
February 1970, 22-31.
RAJCHMAN, Jan A.
"Integrated Computer Memories",
Scientific American, Vol. 217, No. 1,
July 1967, 18-31.
HITTINGER, William C. and Sparks, Morgan
"Microelectronics",
Scientific American, Vol. 213, No. 5,
November 1965, 56-70.
\FBSection 4: Interconnecting Processors
\F0\J
The highest bandwidth and most flexible way for a number of
computers to interact is by shared memory. Systems of the size
considered here would require a large, but not unreasonable, address
space of 100 billion words (40 bits of address). They would also
demand memories with a thousand to a million ports. Although a
variant of the method below could be used to construct such monsters,
they would cost much more than the processors they served.
Alternatively, we might consider how a mass of the usual kind
of machine, each with a moderate amount of local memory, could
communicate through their bandwidth limited IO busses. Define a
message period as the minimum interval in which a processor may
transmit and receive one message. Assuming that messages are
addressed to particular destination machines, an ideal
interconnection system would allow every processor to issue a message
during each message period, and deliver the highest priority message
for each processor to it, providing notices of success or failure to
the originators. The delay between transmission and reception of a
message would be uniform and short, and the cost of the network would
be reasonable (on the order of the cost of the processors served).
It so happens that this specification can be met. The
following construction is a design due to K.E. Batcher [Batcher],
extended considerably to permit pipelining of messages and higher
speed.
\.
\FDLog\!jsup(2); sorting net construction:\F0
\J
Batcher's major contribution is a method for constructing
nets that can sort N numbers in log(N)\!jsup(2); time, using only
N\f2xlog\!jsup(2); N primitive two element sorters. Two nets plus a
little additional circuitry can accomplish the message routing task.
The construction consists of a method for making a merger for
two ordered lists of 2N numbers (call it a 2N-merger) from two
N-mergers and 2N additional primitive elements. A primitive sorting
element is a 1-merger.
An N-merger can thus be constructed from N\f2xlog\!jsub(2); N
primitive elements. A sorter is made of a cascade of ever larger
mergers, forming length two ordered lists from individual inputs with
1-mergers, combining these pairwise into length 4 lists with
2-mergers, and so on, until the entire sorted list of N numbers comes
out of a final N/2-merger. The number of primitive elements involved
is N\f2xlog\!jsub(2);N\f2x(log\!jsub(2);N+1)/4.
\.
\F2\άκ=25;\΅
Primitive
sorting
elements
δΗ±±±±±±±±δΙ
A1 ±±±±±±±±±±±±±±ͺ ΅±±±±±±±±±±±±±±±±±±±±±±±±±
δΗ±±±±±ͺ ~ δΗ±±±δΙ
A2 ±±±±±±δΙ ~δΗ±±±±ͺ 3 ΅±±±±±±±±±±±±±±±ͺ ΅±±±±±
~ ~~ ~ ~ δΗ±±±±±±±±±±±±ͺ ΅±±±±±
A3 ±±±±±±Ύ±$~ %±±δΙ M ΅±±Ύ±±±±±±±±±±δΙ %±±±$
~ ~ ~ E ~ ~ ~
A4 ±±±±±δΙ~ ~ δΗ±±$ R ΅±±Ύ±±±±±±±±±δΙ~
~~ ~δΗ±±±ͺ G ~ ~ ~~ δΗ±±±δΙ
A5 ±±±±±ΎΎ±±$~δΗ±±ͺ E ΅±±Ύ±±±±±±±±δΙ~%±ͺ ΅±±±±±
I ~~ ~~δΗ±ͺ R ~ ~δΗ±±±±±±±ΎΎ±±ͺ ΅±±±±± O
A6 ±±±±δΙ~~ ~~~ ~ ΅±±ΎΎ±±±±±±δΙ~~ %±±±$
N ~~~ ~~~ %±±±±±±±±$ ~~ ~~~ U
~~~ ~~~ ~~ ~~~
P ~~~ ~~~ ~~ ~~~ δΗ±±±δΙ T
~~~ ~~~ ~~ ~~%±±ͺ ΅±±±±±
U ~~~ ~~~ ~~δΗ±±±±±ΎΎ±±±ͺ ΅±±±±± P
~~~ ~~~ ~~~ ~~ %±±±$
T B1 ±±±±ΎΎΎ±±±$~~ δΗ±±±±±±±±δΙ ~~~ ~~ U
~~%±±±±ΎΎ±ͺ ΅±±$~~ ~~
S ~%±±±±±ΎΎ±ͺ ~ ~~ ~~ δΗ±±±δΙ T
B2 ±±±δΙ%±±±±±±ΎΎ±ͺ 3 ΅±±±$~ ~%±±±ͺ ΅±±±±±
~ ~~ ~ ~ ~δΗ±±±±Ύ±±±±ͺ ΅±±±±±
B3 ±±±Ύ±±±±±±±$~ %±±δΙ M ΅±±±±$~ ~ %±±±$
~ ~ ~ E ~ ~ ~
B4 ±±δΙ~ ~ δΗ±±$ R ΅±±±±±$ ~ δΗ±±±δΙ
~%±±±±±±±±Ύ±ͺ G ~ %±±±±ͺ ΅±±±±±
B5 ±±Ύ±±±±±±±±±$ ~ E ΅±±±±±±±±±±±±±±±ͺ ΅±±±±±
%±±±±±±±±±±±ͺ R ~ %±±±$
B6 ±±±±±±±±±±±±±±ͺ ΅±±±±±±±±±±±±±±±±±±±±±±±±±
%±±±±±±±±$
\άκ=34;\΅
\JFIG 1: \F0An illustration of Batcher's odd-even merger construction,
producing a 6-merger from two 3-mergers and five primitive
elements. The first and last outputs undergo less delay than the
intermediate ones, which is undesirable. The delays can be
equalized by passing the first and last outputs through an extra
primitive element. Batcher provides an alternate construction,
which he calls a "bitonic sorter" which merges, has constant
delay, and uses the same number of elements as the above
construction with the extra element added.\F2\.
\FDCommunication scheme organization:\F0
\F2\άκ=25;\΅
δΗ±±±±±±±±±δΙ δΗ±±±±±±±δΙ δΗ±±±δΙ δΗ±±±±±±±±±±δΙ
1 ±άν~ ~±±άν~ ~±±άν~ ~±±άν~ ~±άν 1
~ ~ ~ ~ ~ ~ ~ ~
2 ±άν~ ~±±άν~ ~±±άν~ ~±±άν~ ~±άν 2
Processor ~ N ~ ~ ~ ~ ~ ~ ~ Processor
3 ±άν~ ~±±άν~ ~±±άν~ ~±±άν~ ~±άν 3
Outgoing ~ Input ~ ~ ~ ~ ~ ~ ~ Acknowledge±
~ ~ ~ ~ ~ ~ ~ ~ ment
Message . ~ Sorter ~ . ~ ~ . ~ ~ . ~ ~ .
. ~ ~ . ~ ~ . ~ ~ . ~ ~ . Input
Ports . ~ (Desti± ~ . ~ ~ . ~ ~ . ~ 2N ~ .
~ nation ~ ~ ~ ~ ~ ~ ~ Ports
~ field) ~ ~ ~ ~ E ~ ~ Input ~
~ ~ ~ N ~ ~ x ~ ~ ~
N ±άν~ ~±±άν~ ~±±άν~ c ~±±άν~ Sorter ~±άν N
%±±±±±±±±±$ %±δΙ M ~ ~ h ~ ~ ~
~ e ~ ~ a ~ ~ (Source ~ ±±±±±±±±±±±±±±
δΗ±$ r ~ ~ n ~ ~ field) ~
1 ±±άν~ g ~±±άν~ g ~±±άν~ ~±άν 1
~ e ~ ~ e ~ ~ ~
2 ±±άν~ r ~±±άν~ r ~±±άν~ ~±άν 2
~ ~ ~ ~ ~ ~ Processor
Dummy 3 ±±άν~ ~±±άν~ ~±±άν~ ~±άν 3
~ ~ ~ ~ ~ ~ Incoming
Message ~ ~ ~ ~ ~ ~
. ~ ~ . ~ ~ . ~ ~ . Message
Injection . ~ ~ . ~ ~ . ~ ~ .
. ~ ~ . ~ ~ . ~ ~ . Ports
~ ~ ~ ~ ~ ~
~ ~ ~ ~ ~ ~
N ±±άν~ ~±±άν~ ~±±άν~ ~±άν N
%±±±±±±±$ %±±±$ %±±±±±±±±±±$
\άκ=34;\΅
\JFIG 2: \F0Block diagram of the multiprocessor interconnection scheme.
The cost per processor, and the delay in the net, grows as the
square of the log of N, the number of processors.\.
\F2\άκ=25;\΅
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
FIG 3: \F0Processor message format. High order bit is on the right.\F2
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Unused ~ Position Number ~ 1 ~ Zero ~ Position Number ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
FIG 4: \F0Dummy message format.\F2
\άκ=34;\΅
\F0\J
The interconnection scheme is diagrammed in Figs 2, 3 and 4.
Each processor is assigned a number, its "address", as indicated. In
the sorters and the merger the smaller numbers come out towards the
top of the diagram.
Messages are serial bit streams, and consist of a destination
processor address, a priority number (invented by the originating
computer), a one bit "dummy" flag field (set to 0 for actual
messages), the address of the source processor (i.e. a return
address), and the data to be communicated. A low priority number
implies high priority. Zero is the highest priority.
The net is assumed to run at 100% duty cycle, with the
processors emitting successive synchronized waves of messages. Every
processor emits a message every message interval. The following
discussion examines a single message wave.
The first sorting net orders the messages by destination
address, and within a given destination by priority number. Thus the
upper inputs of the merger receive a list of messages, grouped by
destination, with the highest priority message to each processor
heading its group.
The lower inputs of the merger receive N dummy messages,
exactly one for each destination processor. The priority field is
the highest possible (i.e. zero), the dummy flag is 1, the source
address is the same as the destination, and the data portion is
unused.
Merging these dummies with the sorted list of real messages
results in a list still grouped by destination, with each group
headed by a dummy, by virtue of its high priority, followed
immediately by the highest priority real message, if any, for the
destination.
This list is fed to the exchange network, which examines
adjacent pairs of messages (considering overlapping pairs), and
exchanges the data portions of a pair if the first member happens to
be a dummy to a given address, and the second is a real message to
the same address (i.e. it is the highest priority real message to
that destination).
The sorting network following the exchanger sorts the
messages by the field begining with the dummy flag, which acts as the
high order bit, followed by the source address. Since there were N
real messages, one from each processor, and N dummies, also nominally
one from each processor, and since real messages are sorted ahead of
dummy messages due to the high order bit, (i.e. the dummy flag) being
0, the second sorter restores the messages into the same order in
which they were introduced.\.
\J
Each processor has two input ports, one labeled
"acknowledgement", and the other "incoming message". The
acknowledgement input of processor i is connected to output i of the
second sorter. The incoming message input is connected to output N+i
(i.e. the i'th output of the lower half).
In the absence of the exchange network, the i'th processor
would receive its own message back on its acknowledgement input, and
the i'th dummy message on the incoming message input.
Because of the exchanger, however, if the message that
processor i had sent happened to be the highest priority message to
the requested destination, then the data portion of the message on
the acknowledgement input would be that of the dummy it had been
swapped with (signaling success). Also, if any messages had been
addressed to processor i, the data portion of the highest priority
one would arrive on the incoming message port, in place of the dummy
message.
Thus a processor receives the highest priority message
addressed to it on its incoming message port, or a dummy if nobody
wanted to talk to it. It receives a dummy on its acknowledgement port
if its message has gotten through, or the message back if it hasn't,
due to the existence of a higher priority message to the same
destination.
Actually the serial nature of the sorter causes the
destination and priority field to be lost in the source address
sorter (it tails after the previous wave of messages). In the case of
messages that fail to get delivered, this means that the originating
processor must remember to whom it sent the message (about four
message times ago in a typical design, due to the latency of the
net), if it wants to try again. This is probably undesirable. Also,
delivered messages contain no indication of who sent them, having had
their source address field exchanged with that of a dummy, unless the
source address is included in the data field.
These shortcomings can be overcome if the exchanger shuffles
the destination address, source address and priority fields in the
manner suggested by Figs 5, 6 and 7. Such shuffling can be
accomplished with an amount of storage at each exchanger position
equal to the number of bits in the destination and priority fields.
\.
\F2\άκ=25;\΅
\F0Before:\F2
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Unused ~ Destination Ad ~ 1 ~ Zero ~ Destination Ad ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
\F0After:\F2
δΗ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±δΙ
~ Priority ~ Source Address ~ Data ~ Destination Ad ~ 1 ~
%±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±$
δΗ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±δΙ
~ Zero ~ Destination Ad ~ Unused ~ Source Address ~ 0 ~
%±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±$
FIG 5: \F0Rearrangements effected by the exchanger in an exchanged pair.\F2
\F0Before:\F2
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
\F0After:\F2
δΗ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±δΙ
~ Priority ~ Destination Ad ~ Data ~ Source Address ~ 0 ~
%±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±$
FIG 6: \F0Rearrangements in an unsuccessful message.\F2
\F0Before:\F2
δΗ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±¦Δ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±δΙ
~ Unused ~ Destination Ad ~ 1 ~ Zero ~ Destination Ad ~
%±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±$
\F0After:\F2
δΗ±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±¦Δ±±±δΙ
~ Zero ~ Destination Ad ~ Unused ~ Destination Ad ~ 1 ~
%±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±$
FIG 7: \F0Rearrangements in an isolated dummy message.\F2
\άκ=34;\΅
\FDPackage counts:\F0
\J
If the numbers to be sorted are sent into such a net as
serial bit streams, high order bit first, then a primitive sorting
element has two output wires labelled "smaller" and "larger", two
input wires and a reset and a clock line, and works as follows:
Just before the first bit time the element is reset. Bits
then stream into the input terminals, and simply stream out of the
output terminals until the first bit position in which the two inputs
differ comes along. At that instant, the input with the 0 is
connected to the "smaller" output, and the one with the 1 is
connected to "larger". This interconnection is latched by the element
and all subsequent bits stream from the inputs to the outputs on the
basis of it, until the next reset.
Such a unit can be built with approximately 20 gates, and
introduces one bit time of delay. Careful design should permit an ECL
version with a 100 or 200 MHz bit rate. These could be packed into 48
(say) pin LSI packages, 8 independent elements per package (the clock
and reset lines are common), 16 per package, configured into 4
2-mergers, 24 per package, arranged into two 4-mergers, and 32 per
package containing a single 8-merger.
Larger mergers can be constructed out of these using an
extension of the bitonic sorter strategies given in [Batcher],
resulting in total package counts (and partial eight merger, four,
two and single element package counts) shown in Fig. 8 (to
re-iterate, a merger size of N refers to one which takes two lists of
size N and produces a list of size 2N).
\.
\F2\άκ=25;\΅
δΗ±±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±δΙ
~ Merger size ~ Package counts ~
~ ~ Total Eights Fours Twos Ones ~
΅±±±±±±±±±±±±±±±Ύ±±±±±±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±Ύ±±±±±±±±Ύ±±±±±±±±±ͺ
~ ~ ~ ~ ~ ~ ~
~ 1 ~ 1/8 ~ ~ ~ ~ 1/8 ~
~ 2 ~ 1/4 ~ ~ ~ 1/4 ~ ~
~ 4 ~ 1/2 ~ ~ 1/2 ~ ~ ~
~ 8 ~ 1 ~ 1 ~ ~ ~ ~
~ 16 ~ 4 ~ 2 ~ ~ ~ 2 ~
~ 32 ~ 8 ~ 4 ~ ~ 4 ~ ~
~ 64 ~ 16 ~ 8 ~ 8 ~ ~ ~
~ 128 ~ 32 ~ 32 ~ ~ ~ ~
~ 256 ~ 96 ~ 64 ~ ~ ~ 32 ~
~ 512 ~ 192 ~ 128 ~ ~ 64 ~ ~
~ 1,024 ~ 384 ~ 256 ~ 128 ~ ~ ~
΅±±±±±±±±±±±±±±±Ύ±±±±±±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±Ύ±±±±±±±±Ύ±±±±±±±±±ͺ
~ 2,048 ~ 768 ~ 768 ~ ~ ~ ~
~ 4,096 ~ 2,048 ~ 1536 ~ ~ ~ 512 ~
~ 8,192 ~ 4,096 ~ 3072 ~ ~ 1024 ~ ~
~ 16,384 ~ 8,192 ~ 6144 ~ 2048 ~ ~ ~
~ 32,768 ~ 16,384 ~ 16384 ~ ~ ~ ~
~ 65,536 ~ 40,960 ~ 32768 ~ ~ ~ 8192 ~
~ 131,072 ~ 81,920 ~ 65536 ~ ~ 16384 ~ ~
~ 262,144 ~ 163,840 ~ 131072 ~ 32768 ~ ~ ~
~ 524,288 ~ 327,680 ~ 327680 ~ ~ ~ ~
~ 1,048,576 ~ 786,432 ~ 655360 ~ ~ ~ 131072 ~
~ ~ ~ ~ ~ ~ ~
%±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±ΰΔ±±±±±±±±±ΰΔ±±±±±±±±ΰΔ±±±±±±±±ΰΔ±±±±±±±±±$
FIG 8: \F0Package counts for mergers\F2
\άκ=34;\΅
\F0\J
These counts can now be used to calculate the number of
packages required to build sorters of various sizes:
\.
\F2\άκ=25;\΅
δΗ±±±±±±±±±±±±±±¦Δ±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±±δΙ
~ Sorter size ~ Package counts ~
~ ~ Total Eights Fours Twos Ones ~
΅±±±±±±±±±±±±±±Ύ±±±±±±±±±±±±±±±±±Ύ±±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±±±ͺ
~ ~ ~ ~ ~ ~ ~
~ 2 ~ 1/8 ~ ~ ~ ~ 1/8 ~
~ 4 ~ 1/2 ~ ~ ~ 1/4 ~ 1/4 ~
~ 8 ~ 3/2 ~ ~ 1/2 ~ 1/2 ~ 1/2 ~
~ 16 ~ 4 ~ 1 ~ 1 ~ 1 ~ 1 ~
~ 32 ~ 12 ~ 4 ~ 2 ~ 2 ~ 4 ~
~ 64 ~ 32 ~ 12 ~ 4 ~ 8 ~ 8 ~
~ 128 ~ 80 ~ 32 ~ 16 ~ 16 ~ 16 ~
~ 256 ~ 192 ~ 96 ~ 32 ~ 32 ~ 32 ~
~ 512 ~ 480 ~ 256 ~ 64 ~ 64 ~ 96 ~
~ 1,024 ~ 1,152 ~ 640 ~ 128 ~ 192 ~ 192 ~
~ 2,048 ~ 2,688 ~ 1536 ~ 384 ~ 384 ~ 384 ~
΅±±±±±±±±±±±±±±Ύ±±±±±±±±±±±±±±±±±Ύ±±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±±Ύ±±±±±±±±±±ͺ
~ 4,096 ~ 6,144 ~ 3840 ~ 768 ~ 768 ~ 768 ~
~ 8,192 ~ 14,336 ~ 9216 ~ 1536 ~ 1536 ~ 2048 ~
~ 16,384 ~ 32,768 ~ 21504 ~ 3072 ~ 4096 ~ 4096 ~
~ 32,768 ~ 73,728 ~ 49152 ~ 8192 ~ 8192 ~ 8192 ~
~ 65,536 ~ 163,840 ~ 114688 ~ 16384 ~ 16384 ~ 16384 ~
~ 131,072 ~ 368,640 ~ 262144 ~ 32768 ~ 32768 ~ 40960 ~
~ 262,144 ~ 819,200 ~ 589824 ~ 65536 ~ 81920 ~ 81920 ~
~ 524,288 ~ 1,802,240 ~ 1310720 ~ 163840 ~ 163840 ~ 163840 ~
~ 1,048,576 ~ 3,932,160 ~ 2949120 ~ 327680 ~ 327680 ~ 327680 ~
~ 2,097,152 ~ 8,650,752 ~ 6553600 ~ 655360 ~ 655360 ~ 786432 ~
~ ~ ~ ~ ~ ~ ~
%±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±±±±±±±±ΰΔ±±±±±±±±±±ΰΔ±±±±±±±±±ΰΔ±±±±±±±±±ΰΔ±±±±±±±±±±$
FIG 9: \F0Package counts for sorters\F2
\άκ=34;\΅
\F0\J
An interconnection net for N processors involves an N sorter,
an N merger and a 2N sorter. Thus a 128 processor system would
require 304 48 pin sorter packages, 2.3 for each processor. A 1024
processor needs 4224, roughly four per processor. A size 16,384
system needs 7 for each computer. A million processor would have
12.75 per machine.
It is likely that the biggest versions of this system will
require denser packaging. Remember, though, that a thousand
processor system is sufficient for human equivalence if each machine
is big and fast enough. It will probably not be necessary to build a
megaprocessor in the decade envisioned here.\.
\FDSpeed calculations:\F0
\J
As outlined, a message consists of a destination address, a
priority, a bit, a source address and a data portion. The two
addresses must be at least large enough to uniquely specify each
machine. Considering the case of a thousand and a million machine
system, we note that the address lengths are 10 and 20 bits. Let's
make the priority field the same length, leaving room for
considerable flexibility in priority assignment schemes. The message
portion should be fairly long, to permit messages like memory write
requests, which require both a memory address and the data. Four
address lengths, say. This gives us a message length 7 addresses
long, 70 bits in a 1000 processor, 140 bits in a megaprocessor.
The full time taken by a message from start of transmission
to completion of reception, in bit times, is the message length, plus
the depth of the net (in primitive elements), plus an address and a
priority time due to the buffering at the exchanger.
The depth of a 1024 processor net is 110 elements. This
combines with the message and exchanger delays to result in a transit
time of 110+70+20 = 200 bit times. If the bit rate is 100 MHz then
messages are delivered in two microseconds. If the bit rate is 200
MHz, the time is 1 microsecond. The net contains about two full
message waves at any instant, and a message time is 700 nanoseconds
for 100 MHz, or 350 nanoseconds for 200 MHz.
The same calculation for a megaprocessor reveals a depth of
420 and a total transit time of 420+140+40 = 600 bit times. This
corresponds to 6 and 3 microsecond transits for 100 and 200 MHz data
rates. Corresponding message times are 1.4 microseconds and 700
nanoseconds. The net contains slightly more than 3 message waves at
a time.\.
\FDPossible refinements:\F0
\J
The transit time of the net can be decreased, at the cost of
increasing the message times a little, by running some of the
primitive sorter elements asynchronously, clocking (and introducing
bit time delays) at larger intervals. For instance, an 8-merger
might accept a bit time worth of inputs, which would then trickle
through four stages of primitive sorters built without shift register
delays between them. When everything had settled down, the entire
merger would be clocked, and those elements which had decided to
latch at this bit time would do so. The delay of the unit is one bit
time rather than four, which reduces the 110 or the 420 term in the
transit calculations. The settling time is longer, however, so the
bit rate would be slower. Perhaps a factor of two in transit time can
be gained in this way, at a cost of 1.5 in data rate.
At an increase in gate count, several levels of asynchronous
primitive sorters can be replaced by an equivalent circuit with fewer
gate delays. This might enable a given data rate to be maintained
while the transit time was reduced.
[Van Voorhis] offers some slight reductions in primitive
sorter count, essentially by substituting special case solutions
better than the systematic construction wherever possible in a large
net. Unfortunately these invariably have an uneven amount of delay
along the various paths, making them almost worthless as synchronous
nets. They may be useful as designs for asynchronous subnets.
I have generalizations of Batcher's constructions using
primitive elements that are M sorters (as opposed to 2 sorters).
These allow building a merger which combines M sorted lists of size
N\f2xM into a single sorted M\f2xM\f2xN length list, out of M mergers
each capable of merging M size N lists, and some layers of extra M
sorter primitive elements to combine the output of the mergers (these
are analogous to the single layer of 2-sorters following the mergers
in the odd-even merger of Fig 1. The number of such layers grows
(empirically) roughly as log(M)). Although the number of gates
needed for a given size sorter grows slowly with M, the number
packages used shrinks. This is because each package, being a sorter
rather than a merger, contains more logic. My constructions are
complete for Mβ§8, and partially complete for general M.
\.
\J
Well, now we have a room full of not only processors, but a
massive switching system as well. Can it be made to pay for its keep?
The next section examines some programming implications and
opportunities.\.
\FDReferences:\F0
BATCHER, K.E.
"Sorting Networks and their Applications",
1968 Spring Joint Computer Conference Proceedings
April 1968, 307-314.
VAN VOORHIS, David C.
"An Economical Construction for Sorting Networks",
1974 National Computer Conference Proceedings
April 1974, 921-927.
KNUTH, D.E.
"Sorting and Searching",
The Art of Computer Programming, Vol. 3
Addison-Wesley, 1973.
\FBSection 5: Programming Interconnected Processors
\F0\J
A major feature of the scheme outlined is its flexibility.
It can function as any of the fixed interconnection patterns of the
current lackluster multiprocessors, like Illiac IV, or as a hexagonal
mesh, or a 7 dimensional cubic lattice, should that be desired, or
the tree organization being considered in a Stanford proposal. It can
simulate programmed pipeline machines, such as CDC Star, by using
processors as arithmetic units. What is more, it can do all of these
things simultaneously, since messages within one isolated subset of
the processors have no effect on messages in a disjoint subset. This
permits a very convenient kind of "time" sharing, where individual
users get and return processors as their demands change.
Such mimicry fails to take advantage of the ability to
reconfigure the interconnection totally every message wave. It is
easy to find particular applications, such as tree searching, where
the net could be used effectively by special purpose programs. It is
clearly desirable to have systems which minimize the user effort
needed to implement these algorithms.
As an example of a high level language well suited to the
architecture, consider the following parallel variant of Lisp:\.
\FDA little Lisp:\F0
\J
Take an existing Lisp and purify it to something closer to
its lambda calculus foundations. Flush RPLACA and RPLACD and even
SETQ (a pseudo SETQ can be introduced later), and discourage PROGs,
which are inherently sequential and do things that could have been
stated more clearly as recursive functions. In this system recursive
programs are also more efficient.
An evaluation is handled as follows. A free processor (one
not currently doing an evaluation) receives a message demanding the
value of Expr with variable bindings given in ALIST. Call this the
task [Expr,ALIST]. If the top level of Expr is F(exprA,exprB,exprC),
it generates the tasks [exprA,ALIST], [exprB,ALIST] and
[exprC,ALIST], and passes them to three other free processors. These
evaluate them and sooner or later return valA (the value of exprA),
valB and valC to the original processor. This processor then refers
to the definition of F, and in particular to the names of the dummy
arguments in the definition. Suppose these were A, B and C. It
combines the ALIST it was given with the (name, value) pairs (A,
valA), (B, valB), (C, valC) to form a new ALIST'. Then it generates
the task [bodyF,ALIST'], where bodyF is the body of the definition of
F, which is passed to another free processor. On receiving the value
of this expression, it passes it back to whoever had given it the
original task, and then declares itself free.
If Expr had happened to be an atom, the processor would
simply have looked it up in ALIST, and returned its value. If F had
been a predefined system function (such as CAR or CONS) the sequence
of actions would have been whatever the machine code for those
functions (of which each processor would have a copy) specified.
The parallelism comes from the fact that the arguments in a
multi-argument function can be evaluated simultaneously. This causes
moderate parallelism when functions are nested, and can cause
explosive parallelism when a function which at some level uses a
multi-argument function, invokes itself recursively (as in a tree
search). Most non-trivial programs stated as recursive functions do
this.
The description above implies that the processor waits for
the results of tasks it has farmed out to other machines. These
waits can be arbitrarily long. Also, the number of tasks spawned can
easily become more than the number of processors. In that case,
presumably, a processor with a task to farm out would have to wait
until a machine becomes free. Keeping a processor idle under these
conditions is clearly undesirable.\.
\J
If each processor were time shared, pretending to be many
machines, then when the job being run becomes temporarily idle there
may be another to switch to. When a message pertaining to a given
waiting job arrives, the processor deals with it, and if it provides
the information necessary to allow that job to resume, it is resumed.
This scheme makes the number of processors available seem larger,
perhaps enough to make it possible to acquire a processor for a new
task simply by picking a machine at random and asking if it has a
free job slot. This works well if the answer is usually yes, (i.e.
if there reasonably more slots than tasks), and replaces a more
complicated free processor pool method that must be able to deal with
many requests simultaneously.
Alternatively instead of each processor having its own little
pool of running jobs, the whole Lisp system can maintain a communal
pool, which processors refer to as they become free. In this
organization a task that must pause is placed into the pool (freeing
the processor that was running it), to be taken up again by a free
processor when its requirements are met. Moderate processing power
is required to manage the pool.
The list structure is spread out among all the processors in
the system. Pointers consist of a processor number and address within
processor field. A machine evaluating an expression whose top level
function is CONS creates the new cell in its own memory. A CAR or CDR
involves sending a message to the processor which owns the cell
involved, and waiting for a reply (thus a CONS is usually cheaper
than a CAR!).
Since RPLACA and RPLACD are eliminated, circular lists
cannot be created. This permits garbage collection to be mediated by
reference counts. A small fixed reference count field (three bits
wide, say) is part of each cell. A processor doing a CONS sends
messages to the owners of the cells that the new cell will be
pointing to, indicating that their reference count should be
incremented. If a cell getting a message of this type has a reference
count that has already reached maximum (7 in our example), it sends
back an indignant message to the CONS'ing processor saying,
effectively, "I'm full, you can't point to me. But here are my CAR
and my CDR, roll your own and point to that". This not only makes the
reference count field size manageable, but reduces message traffic
jams to processors containing very popular cells. It does make it
mandatory to use EQUAL rather than EQ when comparing lists. When a
processor no longer requires a cell to which it had a pointer it
sends a message to the owner saying so. The reference count of the
cell is decreased by one, and if it reaches zero, it is freed, and
similar messages are sent to the cells pointed to by its CAR and CDR.
This garbage collection process goes on continuously and
independently of the main computations.\.
\J
How is the A-list of bindings to be handled? The canonical
representation, using a simple list of dotted pairs, is very
inefficient, since it forces a sequential search. A hash table, as in
most current Lisp implementations is also undesirable, partly because
it requires an incompatible data type (a relatively large contiguous
block of memory), but primarily because in this parallel system there
are many different contexts (in different branches of the
computation) active at one time. The entire hash table would need to
be duplicated whenever new bindings are made (i.e. at every level of
evaluation), since the older context is in use in another branch.
A nice alternative is to use essentially an A-list, but to
structure it as a balanced tree, with each node containing a binding,
a left subtree for those variables with names lexicographically (say)
less than it, and a right subtree for those greater. An entry in
such a structure can be found in time proportional to the logarithm
of the number of nodes, and a version of the list with an entry
added, deleted or modified can also be constructed in log time
without affecting the original in any way, by building a new path
down to the element (and sometimes a few side paths, to keep the tree
reasonably balanced), which points to many of the subtrees of the
original. Since this structure can be built of standard list
elements, it can be managed via the normal allocation and garbage
collection mechanisms. Call this structure an A-tree.
In general applications a balanced tree is capable of being
used in this system more effectively than a simple list, because a
parallel process can get to all the nodes in log time, whereas the
last node takes linear time in a linear list. This suggests that
programmers wanting maximum efficiency will be coerced into using
them much more frequently than is now the case. This is similar to
the pressure on present day Lisp programmers to use PROGS rather than
recursive functions, because compiled PROGS run faster.
The [expr,ALIST] task description is conveyed simply as a
pair of pointers, expr to an S expression version of expr, and ALIST
to the head of the balanced A-tree. The result returned in an
evaluation is likewise a pointer.
Since the list structure is only built upon, and never
altered, it is possible to speed up the operation of the system by
having individual processors maintain software managed caches of
frequently referred to cells that reside in remote machines. This
cuts down on the message traffic, and generally speeds things up.\.
\FDA little Algol:\F0
\J
Although recursive functions provide an excellent way of
exploiting a very parallel architecture, there are other ways. An
algorithmic (iterative) language can be made to serve, if a few
features are present.
Existing Algol unfortunately has few of these. Consider the
following pair of statements excerpted from an imaginary program:
\.
\F2
A := 3 ;
B := 4 ;
\F0\J
These say that first A is set to 3, then B is assigned the value 4.
There is an implied sequentialness. Presumably the programmer knew
that the order of the assignments was unimportant, and that they
could be done simultaneously. The syntax of the language provided no
way for him to indicate this.
A simple construct which permits information of this type to
be conveyed is the "parallel semicolon", which we will indicate by a
vertical bar "|". Statements seperated by parallel semicolons may be
executed simultaneously. The following is an example of its use:
\.
\F2
A := 3 |
B := 4 |
C := 6 ;
D := A+B+C ;
\F0\J
The first three statements may be executed together, the fourth must
wait for their completion. It is implied that compound statements,
bracketed by BEGIN END pairs can be similarly separated by parallel
as well as sequential semicolons.
If compound statements are permitted to return as a value the
value of their last component statement, then the structure
\.
\F2
T := BEGIN
INTEGER A,B,C ;
A := expressionA |
B := expressionB |
C := expressionC ;
expression involving A, B and C
END;
\F0\J
is equivalent to a lambda expression construct in Lisp. A, B and C
are the dummy arguments, evaluated in parallel, and the fourth
expression, which must wait for A, B and C, is the body of the
lambda. The constructs involved can be used in many other ways,
unlike a real lambda expression.
If the Algol also has recursion, then it is possible to
obtain massive parallelism in the same way it was achieved in Lisp,
namely by writing functions which recurse deeply, and invoke
themselves a few times in parallel at each level.
A minor form of parallelism already automatically available
resides in arithmetic expressions. Different subexpressions of larger
formulas can be evaluated simultaneously, and then combined. This is
analogous to the evaluation of Lisp expressions.
More massive concurrency can be obtained if the data types of
Algol are expanded to include a complete set of array operators and
genuine dynamic allocation of arrays. It is this type of data and
operator set that makes APL an extremely powerful language in spite
of an execrable control structure.
The idea is that operators such as addition and
multiplication work not only on simple variables representing single
numbers, but on whole multi-dimensional arrays. Since arrays are more
complex objects a much larger range of operators is required.
Included are such things as array restructuring, subscript
permutation (generalized transpose), element shuffling, subarray
extraction, generalized cross and dot products, etc.. In general an
operator combines one or more arrays and produces a value which is a
new array, often of a different size and shape.
Compounding of such operators substitutes for a large amount
of explicit loop control, and results in substantially more compact
source code. On conventional computers such code also runs much
faster because most the run time is spent in carefully hand coded
implementations of the basic operators, which do a great deal at each
invokation if the arrays are large. Typical equivalent Algol
programs execute second rate compiler produced code almost
exclusively. An operator set of this kind provides the same
capabilities as a hypothetical parallel FOR loop construct, with
greater clarity. Conditionals can be handled by first selecting out
subarrays according to the condition, and then applying the
appropriate operations.\.
\J
It is desirable in our system for large arrays to be spread
out among many machines, so that array operations can be carried out
in parallel where possible. An often occurring process is the
application of an arithmetic or other operator to one or more arrays
of a given size, producing a result of the same size. This could be
made maximally efficient if corresponding elements of the arrays
resided in the same processor. Assigning a processor to an array
element by means of a hash function of both the array dimensions and
the indices of the element will cause arrays of different size to be
stored in different places, but will put corresponding elements of
arrays of the same size in the same place.
Now comes the question of how a gaggle of processors managing
a given array gets word that an operation concerning the array is to
be executed. The initiator of the operation is the single processor
running the code requesting it. The fastest way to propagate such
information is to initiate a "chain letter". The originating
processor can consider the array as two equal (or almost equal)
pieces, and send a message to a member of each piece. Each recipient
then divides the piece of which it is a member into two, and sends a
message to a representative of each of those smaller subsets, etc.
When the subset size becomes one, the operation is performed.
It may be more efficient to store larger pieces of arrays in
individual processors. This adds a slight serial component to the
run times, but saves message handling time.
How are programs communicated from processor to processor, as
the number of executing instruction streams grows? It would be an
extravagant waste of memory to duplicate the entire source program in
every processor (besides, it might not fit). A software caching
scheme is called for. When a processor initiates a subtask, it
obtains a free processor and passes to it a moderate amount of the
code needed for the task. When the new processor runs out, it sends
a message back to the first machine for more. This machine trys to
obtain it from its own cache, and if it too is out, sends a message
further up in the hierarchy, to the machine which had initiated the
task which it is running, and so on until the requested code is
obtained. At top level the entire runnable program is assumed to be
available from some sort of file, maintained by the combined storage
of a number of processors.\.
\FDA little operating systems:\F0
\J
Assume that most input/output devices are connected to
individual processors, and that, to maximize the bandwidth, each of
these processors has only one, or a most a very small number, of
devices associated with it. Included among these are communications
lines to user terminals, so that each console talks directly to a
dedicated processor with sorting net access to the entire system.
If we expect to support more than one user on such a system,
it will be necessary to have some type of protection scheme, to
prevent one user's processes from accidentally interfering with
another's.
If each processor contains a monitor, then message sending
can handled by system calls. The monitor can then check for validity,
testing if the requested destination is within the set of processors
assigned to the user. This monitor, which every machine assigned to
a user must run, can be flexible enough to time share the machine it
runs on, to provide multiple simulated processors.
Controlling the state of the individual machines' monitors is
the task of a global system monitor, operated by several machines,
which maintains a pool of free processors, and parcels them out on
request, and which also handles file system requests (bulk storage
would be connected to a handful of the processors), and allocation of
other devices.
Processes belonging to a single user will be initiated by a
particular master machine, probably the one connected to his console.
This master can create a tree of subprocesses, possibly
intercommunicating, running on different machines. It should be
possible, for example, to have one subset configured as an array
processor for efficient implementation of low level vision
operations, while another is running an Algol/APL for the less
structured analytic geometry needed to interpret the image, and yet a
third is operating a Lisp system doing abstract reasoning about the
scene. Many existing systems permit this kind of organization, but
they are hampered by having an absurdly small amount of computing
power.\.
\J
How is a system of this kind initialized, and how does one
abort an out of control process taking place in part of it without
affecting the rest? A possibility is to have an "executive" class of
messages (perhaps signalled by a particular bit in the data portion),
which user jobs are not permitted to emit. Reception of such
messages might cause resetting of the processor, loading of memory
locations within it, and starting execution at a requested locations.
A single externally controllable machine can be used to get things
going, fairly quickly if emits a self replicating "chain letter".
Now consider reliability. The system can obviously tolerate
any reasonable number of inoperable processors, by simply declaring
them unavailable for use. Failures in the communication net are much
more serious, and under most situations will require the system to
stop operating normally. It is possible to write diagnostic programs
which can track down defective comparator elements or broken data
wires. If something should happen to the clock signals to a given
level it would be necessary to wheel out an oscilloscope. If
reliability were a critical issue it would be possible to include a
duplicate net, to run things the while other was being debugged.\.
\FDDisclaimer:\F0
\J
The software outlines are obviously only partly baked. This
is mostly due to the limited amount of thought and work that has gone
into them. On the other hand, it is my belief that even well thought
out designs at this point will look naive in the light of experience
with a working version of such a machine. Many of the fundamental
decisions depend on things difficult to estimate, including the
number of processors, communication speed compared with processor
speed, memory size, and most importantly the kinds, sizes and mix of
operations that people will tend to run (these will surely differ
from what is being done now, the limitations being so different).
This is not to say that more thought isn't called for. The
nature of the memory protection, interrupt scheme, word size,
instruction set, I/O structure, etc. of the individual machines
should be tailored to permit convenient implementation of the
individual processor operating systems, and typical message types.
What these are can best be determined by trying to write some
monitors, interpreters, compilers and array crunchers.
Simulation of the system on a conventional machine is next to
useless. Since the total amount of memory envisaged is larger than
conventional machines carry, and since a simulation will have to jump
around from simulated core image to simulated core image, in order to
keep a realistic message synchrony, there would be an enormous amount
of disk accessing. The slowness of this, combined with the inherent
speed difference, would allow only the most trivial (and therefore
unrepresentative) things to be tried, and not many of these.
Construction of a moderate size version is a better use of
manpower. Undoubtedly there will be mistakes made in the first (or
first few) models, which will become apparent after a bit of
experience.
The flexibility offered should make this design much more
attractive to a large class of programmers than current essentially
special purpose architectures such as ILLIAC IV. Making it out of
existing processors with proven machine languages will help too. I,
for one, can hardly wait to start programming even a flawed version
of a machine that can process and generate real time television with
programs written in Algol, and simultaneously jump over tall game and
proof trees in a single bound.\.
\FB
Bombast:
\F0\J
The enormous shortage of ability to compute is distorting our
work, creating problems where there are none, making others
impossibly difficult, and generally causing effort to be misdirected.
Shouldn't this view be more widespread, if it is as obvious as I
claim?
In the early days of AI the thought that existing machines
might be much too small was widespread, but there was hope that
clever mathematics and advancing computer technology could soon make
up the difference. Since then computers have improved by a factor of
ten every five years, but, in spite of reasonably diligent work by a
reasonable number of people, the results have been embarrassingly
sparse. The realization that available compute power might still be
vastly inadequate has since been swept under the rug, due to wishful
thinking and a feeling that there was nothing to be done about it
anyway and that voicing such an opinion could cause AI to be
considered impractical, resulting in reduced funding.
There is also an element of scientific snobbery. Many of the
most influential names in the field seem to feel that AI should be
like the theoretical side of physics, the essential problem being to
find the laws of universe relating to intelligence. Once these are
known, the thinking goes, construction of efficient intelligent
machines will be trivial. Suggestions that the problems are
essentially engineering ones of scale and complexity, and can be
solved by incremental improvements and occasional insights into
sub-problems, are treated with disdain.
This attitude is a variant of the philosophical notion that
all truth can be arrived at by pure thought, and is unfounded and
harmful. One wonders what state space travel would be in if the
Goddards and von Brauns had spent their time trying to find the
universal laws of rocket construction before trying to build space
ships. AI needs a stronger experimental base. Like other branches of
endeavor (notably physics, aeronautics and meteorology), we should
realize our desperate need for more computing, and do things about
it.
\.
\C\fFc