COMMENT ’μί   VALID 00036 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00004 00002	\|\\@macros.pox[doc,hpm]\’άκ=0\‘ͺ\’άκ=34\‘΅\F1\
C00005 00003	\FB
C00006 00004	\FB
C00009 00005	\FB
C00011 00006	\FB
C00012 00007	\FBSection 1:  The Natural History of Intelligence
C00024 00008	\FDUnifying principles:\F0
C00031 00009	\FDHarangue:\F0
C00035 00010	\FDReferences:\F0
C00038 00011	
C00040 00012	
C00041 00013	\FBSection 2:  Measuring Processing Power
C00046 00014	\FDEntropy measurement:\F0
C00051 00015	\FDA representative computer:\F0
C00057 00016	\FDA typical nervous system:\F0
C00062 00017	\FDThermodynamic efficiency:\F0
C00070 00018	\FDReferences:\F0
C00073 00019	\FBSection 3:  The Growth of Processing Power
C00085 00020	\FDReferences:\F0
C00088 00021	\FBSection 4:  Interconnecting Processors
C00093 00022	\F2\’άκ=25\‘΅
C00098 00023	\FDCommunication scheme organization:\F0
C00103 00024	\F0\J
C00111 00025	\F2\’άκ=25\‘΅
C00116 00026	\FDPackage counts:\F0
C00123 00027	\F0\J
C00128 00028	\FDSpeed calculations:\F0
C00131 00029	\FDPossible refinements:\F0
C00136 00030	\FDReferences:\F0
C00137 00031	
C00140 00032	\FDA little Lisp:\F0
C00154 00033	\FDA little Algol:\F0
C00165 00034	\FDA little operating systems:\F0
C00171 00035	\FDDisclaimer:\F0
C00175 00036	\FB
C00179 ENDMK
C’μί;
\|\\@macros.pox[doc,hpm];\’άκ=0;\‘ͺ\’άκ=34;\‘΅\F1\
\’δΗ'010000;\’δΗ'1000002;\
\MBBDI40;
\MDNONMBI;
\MEMS25;
\MFCRTURZ;
\’άκ'0;\’άν#\’ΰϋEVERYPAGE[\’άκ#\+=1;\’άν#\oEP{0    \D#}\WEP,=1435;=150;]
\FB





\C\fFC




\CThe Role of

\CRAW POWER

\Cin

\CINTELLIGENCE

















\F1\RHans Moravec
\RMay 12, 1976
\FB


CONTENTS


  \F0 1\FB      Introduction

  \F0 2\FB      Acknowledgement

  \F0 3\FB      Section 1:  The Natural History of Intelligence
  \F0 3\FB         \F0Product lines\FB
  \F0 5\FB         \F0Unifying principles\FB
  \F0 6\FB         \F0Harangue\FB
  \F0 7\FB         \F0References\FB

  \F010\FB      Section 2:  Measuring Processing Power
  \F010\FB         \F0Low level vision\FB
  \F011\FB         \F0Entropy measurement\FB
  \F013\FB         \F0A representative computer\FB
  \F014\FB         \F0A typical nervous system\FB
  \F015\FB         \F0Thermodynamic efficiency\FB
  \F017\FB         \F0References\FB

  \F018\FB      Section 3:  The Growth of Processing Power
  \F020\FB         \F0References\FB

  \F021\FB      Section 4:  Interconnecting Processors
  \F021\FB         \F0\!jexp(Log,2); sorting net construction\FB
  \F023\FB         \F0Communication scheme organization\FB
  \F027\FB         \F0Package counts\FB
  \F029\FB         \F0Speed calculations\FB
  \F030\FB         \F0Possible refinements\FB
  \F031\FB         \F0References\FB

  \F032\FB      Section 5:  Programming Interconnected Processors
  \F033\FB         \F0A little Lisp\FB
  \F035\FB         \F0A little Algol\FB
  \F037\FB         \F0A little operating systems\FB
  \F038\FB         \F0Disclaimer\FB

  \F039\FB      Section 6:  Bombast
\’δΗ'100;
\FB








Introduction:


\F0\J
	This essay  is an  argument that  an essential ingredient  is
absent  in  many  current  conceptions  of  the  road  to  Artificial
Intelligence.\.
\J
	The first section discusses  natural intelligence, and  notes
two  major  branches  of  the  animal kingdom  in  which  it  evolved
independently,   and  several  offshoots.   The  suggestion  is  that
intelligence need not  be so difficult  to construct as  is sometimes
assumed.\.
\J
	The second part  compares the information  processing ability
of  present computers with  intelligent nervous systems,  and finds a
factor of one million difference. This abyss is interpreted as a major
distorting influence in current work, and  a reason for disappointing
progress.\.
\J
	Section three  examines the  development of electronics,  and
concludes  the state of  the art can  provide more power  than is now
available, and that the gap could be closed in a decade.\.
\J	
	Parts four and  five introduce hardware and  software aspects
of a system which is able to make use of the advancing technology.\.
\FB














Acknowledgement:


\F0\J
	The following  entities provided inspiration,  encouragement,
suggestions, data, slave labor, proof reading services, etc.:

	(in   carefully  randomized   order)   PDP-KA10,   Scientific
American,  Marc   Le  Brun,  Andy  Moorer,  Ed  Mcguire,  Electronics
magazine, Don  Gennery,  Bill Gosper,  John McCarthy,  Macsyma,  Mike
Farmwald,  Russ  Taylor,  Cart   Project,  Les  Earnest,  Pierre  van
Nypelseer,  Robert Maas,  Jeff Rubin,  Bruce Baumgart,  HAL-9000, Tom
Binford, Clem Smith, Tom Gafford, Brian Harvey, ...
\.
\FBSection 1:  The Natural History of Intelligence

\F0
\FDProduct lines:\F0
\J
	Natural evolution has produced a continuum of complexities of
behavior,  from the  mechanical simplicity of viruses to the magic of
mammals. In the higher animals most of the complexity resides  in the
nervous system.

	Evolution of the brain began in  early multi-celled animals a
billion  years   ago  with  the  development   of  cells  capable  of
transmitting electrochemical  signals.    Because  neurons  are  more
localized than hormones they allow a  greater variety of signals in a
given volume.  They also provide evolution with a more uniform medium
for experiments in complexity.

	The  advantages  of  implementing  behavioral  complexity  in
neural nets seem  to have been overwhelming, since all modern animals
more than a few cells in size have them [animal].

	Two major  branches in  the animal  kingdom, vertebrates  and
mollusks, contain species  which can be considered intelligent.  Both
stem from  one  of the  earliest  multi-celled organisms,  an  animal
something like a hydra made of a double layer of cells and possessing
a primitive nerve net.

	Most  mollusks   are   intellectually  unimpressive   sessile
shellfish, but  one branch, the cephalopods, possesses high mobility,
large  brains  and  imaging   eyes,  evolved  independently  of   the
corresponding   vertebrate  structures.      There  are   fascinating
differences.   The optic nerve connects to the back of the retina, so
there is  no  blind spot.    The brain  is  annular, forming  a  ring
encircling the esophagus.  The circulatory system, also independently
evolved, has three blood pumps,  a systemic heart pumping  oxygenated
blood to the tissues  and two gill hearts, each  pumping venous blood
to  one gill. The  oxygen carrier is  a green  copper compound called
hemocyanin,  evolved  from  an  earlier  protein  that  also   became
hemoglobin.\.
\J
	These  animals have  some unique  abilities.   Shallow  water
octopus  and squid are  covered by a  million individually controlled
color changing effectors called  chromatophores, whose functions  are
camouflage and  communication.  The capabilities  of this arrangement
have  been  demonstrated  by  a  cuttlefish  accurately  imitating  a
checkerboard it  was  placed upon,  and an  octopus  in flight  which
produced a  pattern like the  seaweed it was  traversing, coruscating
backward along the  length of its  body, diverting the  eye from  the
true motion.  Deep  sea squid have photophores capable  of generating
large quantities of multicolored light.  Some are as complex as eyes,
containing irises and lenses [squid].  The light show is modulated by
emotions in  major and subtle  ways. There  has been little  study of
these  matters, but  this must  provide means of  social interaction.
Since they also  have good  vision, there is  the potential for  high
bandwidth communication.

	Cephalopod   intelligence    has    not   been    extensively
investigated,  but   a  few  controlled  experiments  indicate  rapid
learning in  small octopus  [Boycott].   The Cousteau  film shows  an
octopus'  response to  a "monkey  and bananas"  problem.   A fishbowl
containing a lobster is sealed with a cork and dropped into the water
near  it.  The  octopus  is  attracted,  and  spends   a  long  while
alternately probing  the container in  various ways and  returning to
its lair in iridescent frustration.  On the final iteration it  exits
its  little  hole  in  the  ground  and  unhesitatingly  wraps  three
tentacles  around the bowl, and  one about the cork,  and pulls.  The
cork shoots to the surface and the octopus eats.  The  Time-Life film
contains a similar sequence, with a screw  top instead of a cork!  If
small octopus have  almost mammalian behavior, what might  giant deep
sea squid be capable of?

	Birds  are   more  closely   related  to   humans  than   are
cephalopods, their  common ancestor with us being  a 300 million year
old early  reptile.   Size-limited by  the dynamics  of flying,  some
birds  have reached an  intellectual level comparable  to the highest
mammals.

	Crows and ravens  are notable for frequently  outwitting man.
Their intuitive number sense (ability to  perceive the cardinality of
a set without counting) extends to seven, as opposed to three or four
in man.  Such a  sense is useful for  keeping track of the number  of
eggs in a nest. Experiments have shown [Stettner] that most birds are
more  capable of  high order  "reversal" and "learning  set" learning
than  all mammals  except  the higher  primates.   In  mammals  these
abilities  increase with increasing  cerebral cortex size.   In birds
the same functions depend on  areas not present in mammalian  brains,
forebrain  regions called  the "Wulst"  and the  hyperstriatum.   The
cortex is  small and relatively unimportant.  Clearly this is another
case of independent evolution of similar mental functions.

	It is interesting to speculate  whether penguins, now similar
to seals in behavior and habitat, will ever become fully aquatic, and
evolve analogously to the great whales.

	The cetaceans are related  to man through a small  30 million
year  old primitive  mammal. Some  species of  dolphin have  body and
brain masses  identical to ours,  and archaeology  reveals they  have
been this way several times as long.  They are as good as man at many
kinds  of problem  solving, and  perhaps at language.  The references
contain many  anecdotes, and describe  a few controlled  experiments,
showing  that  dolphins  can  grasp  and communicate  complex  ideas.
Killer whales have brains seven  times human size, and their  ability
to formulate plans is better than  dolphins', occasionally being used
to  feed on  them. Sperm  whales, not the  largest animals,  have the
world's largest brains.  There may be intelligence  mediated conflict
with large squid, their main food supply.

	Elephants have brains about six times human size, matriarchal
tribal  societies, and  complex  behavior. Indian  domestic elephants
learn 500 commands, limited by the range of tractor-like  tasks their
owners  need done,  and form  voluntary mutual  benefit relationships
with  their trainers,  exchanging labor  for baths.   They  can solve
problems such as  how to sneak  into a plantation  at night to  steal
bananas, after having been belled (answer: stuff mud into the bells).
And they remember  for decades.  Inconvenience and cost has prevented
modern science from doing much work with them.

	The apes are man's cousins.  Chimps and gorillas can learn to
use tools and  to communicate with human sign languages at a retarded
level.   As  chimps have  one third,  and  gorillas one  half,  human
brainsize,  similar  results should  be  achievable  with the  larger
brained,  but  less man-like  animals.  Though no  other  species has
managed to  develop cultures comparable  to modern  man's, it may  be
that  some of them  can be  made partners  in ours,  accelerating its
evolution by their unique capabilities.
\.
\FDUnifying principles:\F0
\J
	A feature  shared by all  living organisms whose  behavior is
complex enough to  place them near human in intelligence is a hundred
billion neuron  nervous system.   Imaging vision  requires a  billion
neurons, while a million is associated with fast and interesting, but
stereotyped,  behavior as in a  bee. A thousand  is adequate for slow
moving animals with minimal  sensory input, such as slugs  and worms.
A hundred runs most sessile animals.  The portions of nervous systems
for which  tentative wiring  diagrams have  been obtained,  including
much of the brain of the large neuroned sea slug, Aplysia, the flight
controller  of the  locust and  the early  stages of  some vertebrate
visual  systems,  reveal  that   the  neurons  are  configured   into
efficient, clever, assemblies.  This should not be too surprising, as
unnecessary  redundancy means unnecessary  metabolic load, a distinct
selective disadvantage.

	Evolution has stumbled  on many ways  of speeding up  its own
progress,  since species  that adapt  more  quickly have  a selective
advantage. Most of these, such as sex and individual death, have been
refinements of one of the oldest, the encoding of genetic information
in  the easily  mutated  and modular  DNA molecule.  In the  last few
million years the genetically evolved ability  of animals, especially
mammals,  to learn  a significant  fraction of  their  behavior after
birth has provided  a new medium  for growth of  complexity.   Modern
man, though probably not the most  individually intelligent animal on
the planet, is  the species in which this cultural evolution seems to
have had the  greatest effect, making human  culture the most  potent
force on the earth's surface.

	Cultural  and  technological evolution  proceeds  by  massive
interchange of  ideas and information, trial and  error guided by the
ability to  predict  the  outcome  of simple  situations,  and  other
techniques  mediated by  the  intelligence of  the participants.  The
process  is  self  reinforcing  because  its  consequences,  such  as
improved communication methods  and increased wealth and  population,
allow more experiments and faster cross fertilization among different
lines of inquiry.  Many of its techniques  have not been available to
biological  evolution.    The  effect  is  that  present  day  global
civilization is developing  capabilities orders of  magnitude faster.
Of course biological evolution has had a massive head start.\.
\J
	Although  cultural  evolution has  developed  methods  beyond
those of its  genetic counterpart, the overall process is essentially
the  same.   It  involves  trying  large  numbers  of  possibilities,
selecting the best ones, and combining successes from different lines
of investigation.  This requires time and other finite resources.

	Finding the optimum assembly of  particular type of component
which achieves  a desired function usually  requires examination of a
number of possibilities  exponential in the  number of components  in
the solution.  With fixed resources this implies a design time rising
exponentially  with complexity.   Alternatively the  resources can be
used in stages, to design subassemblies, which are then combined into
larger units, and so on,  until the desired result is achieved.  This
can be  much faster  since the  effort rises  exponentially with  the
incremental  size of  each  stage and  linearly  with the  number  of
stages,  with   an  additional  small  term,  for  overall  planning,
exponential in  the number  of stages.  The resulting construct  will
probably use  more of the basic component  and be less efficient than
an optimal design.

	Biological evolution is  affected by these  considerations as
much  as our  technology.   Presumably there  is a  way of  using the
physics of the universe to construct entities functionally equivalent
to human beings, but vastly  smaller and more efficient.  Terrestrial
evolution has  not had the time or space to develop such things.  But
by building  in the  sequence  atoms, amino  acids, proteins,  cells,
organs,  animal  (often concurrently),  it  produced a  technological
civilization out of inanimate matter in only two billion years.
\.
\FDHarangue:\F0
\J
	The existence  of several  examples of intelligence  designed
under these  constraints should give us great  confidence that we can
achieve the same in short order.

	The situation is analogous to the history of heavier than air
flight,  where  birds,  bats  and  insects  clearly demonstrated  the
possibility before our culture  mastered it. Flight without  adequate
power to  weight ratio  is heartbreakingly difficult  (vis. Langley's
steam powered  aircraft or current  attempts at man  powered flight),
whereas  with enough power (on  good authority!) a  shingle will fly.
Refinement  of the  aerodynamics  of  lift  and  turbulence  is  most
effectively tackled after some experience with suboptimal aeroplanes.
After  the  initial successes  our culture  was  able to  far surpass
biological flight in a few decades.

	Although there  are known  brute force  solutions to most  AI
problems,  current machinery makes  their implementation impractical.
Instead we are  forced to expend  our human resources trying  to find
computationally  less  intensive  answers,  even  where  there is  no
evidence that  they exist.  This is  honorable scientific  endeavor,
but, like trying to design optimal airplanes from first principles, a
slow way to get the job done.

	With more  processing power, competing  presently impractical
schemes  could be  compared by  experiment,  with the  outcomes often
suggesting incremental or revolutionary improvements. Computationally
expensive  highly optimizing  compilers would  permit efficient  code
generation  at less human  cost.  The expanded  abilities of existing
systems such as MACSYMA,  along with new experimental  results, would
accelerate theoretical development. Gains made this way would improve
the very systems being used, causing more  speedup.  The intermediate
results would be inefficient kludges busily contributing to their own
improvement.   The end result  is systems as efficient  and clever as
any designed by more theoretical approaches, but sooner, because more
of the labor has been done by machines.

	With enough power  anything will fly. The next  section tries
to decide how much is needed.
\.









\C\fFB
\FDReferences:\F0

[animal]

   JERISON, Harry J.,
   "Paleoneurology and the Evolution of Mind",
   Scientific American, Vol. 234, No. 1,
   January 1976, 90-101.

   BITTERMAN, M. E.,
   "The Evolution of Intelligence",
   Scientific American, Vol. 212, No. 1,
   January 1965, 92-100.

   GRIFFIN, Donald R., ed.
   "Animal Engineering",
   W.H. Freeman and Company,
   San Francisco, June 1974.

   BURIAN, Z. and Spinar, Z.V.,
   "Life Before Man",
   American Heritage Press, 1972.

   RIOPELLE, A.J., ed.
   "Animal Problem Solving",
   Penguin Books, 1967.

   GOODRICH, Edwin S.,
   "Studies on the Structure and Development of Vertebrates",
   Dover Publications Inc., New York, 1958.

   BUCHSBAUM, Ralph,
   "Animals without Backbones",
   The University of Chicago Press, 1948.

   FARAGO, Peter, and Lagnado, John,
   "Life in Action"
   Alfred A. Knopf, New York, 1972.

   BONNER, John Tyler,
   "Cells and Societies",
   Princeton University Press, Princeton, 1955.

[squid]

   COUSTEAU, Jacques-Yves and Diole, Philippe,
   "Octopus and Squid",
   Doubleday & Company, Inc., Garden City, New York, 1973.
   (also a televised film of the same name)

   "The Octopus",
   a televised film, Time-Life films.

   BOYCOTT, Brian B.,
   "Learning in the Octopus",
   Scientific American, Vol. 212, No. 3,
   March 1965, 42-50.

   LANE, Frank W.,
   "The Kingdom of the Octopus",
   Worlds of Science Book, Pyramid Publications Inc.
   October, 1962.

[bird]

   BAKKER, Robert T.,
   "Dinosaur Renaissance",
   Scientific American, Vol. 232, No. 4,
   April 1975.

   STETTNER, Laurence Jay and Matyniak, Kenneth A.
   "The Brain of Birds",
   Scientific American, Vol. 218, No. 6,
   June 1968, 64-76.

[whale]

   LILLY, John. C.,
   "The Mind of the Dolphin" &
   "Man and Dolphin",
   Doubleday and Company, New York, 1967.

   COUSTEAU, Jacques-Yves and Diole, Philippe,
   "The Whale",
   Doubleday & Company, Inc., Garden City, New York, 1972.

   FICHTELIUS, Karl-Erik and Sjolander, Sverre,
   "Smarter than Man?",
   Ballantine Books, New York, 1974.

   STENUIT, Robert,
   "The Dolphin, Cousin to Man",
   Bantam Books, New York, 1972.

   "Whales and Dolphins",
   A BBC produced film shown
   in the NOVA television series

   McINTYRE, Joan, ed.
   "Mind in the Waters",
   Charles Scribner's Sons, San Francisco, 1974.

[elephant]

   RENSCH, Bernhard,
   "The Intelligence of Elephants",
   Scientific American,
   February 1957, 44.

[primate]

   "The First Words of Washoe",
   Televised film shown in the NOVA series

   LeGROS CLARK, W.E.,
   "History of the Primates",
   The University of Chicago Press, Chicago 1966.

   PFEIFFER, John,
   "The Human Brain",
   Worlds of Science Books, Pyramid Publications Inc.,
   New York, 1962.
\FBSection 2:  Measuring Processing Power

\F0
\FDLow level vision:\F0
\J
	The visual system of  a few animals has been  studied in some
detail,  especially the layers  of the  optic nerve near  the retina.
The neurons  comprising  these  structures are  used  efficiently  to
compute  local   operations  like  high  pass   filtering  and  edge,
curvature, orientation and motion detection.

	Assuming the visual cortex (and  possibly the optic nerve  itself)
is as computationally intensive as the retina, successive layers producing
increasingly  abstracted  representations,  we  can  estimate  the   total
capability.  There are a million seperate fibers in a cross section of the
human optic  nerve. At  the retina  each  is connected  to several  of  10
million light  sensitive rods  and  cones. The  thickness of  the  optical
cortex is a thousand times the depth occupied by the neurons which apply a
single simple operation. The  eye is capable of  processing images at  the
rate of  ten per  second (flicker  at higher  frequencies is  detected  by
special operators).  This  means that  the human  visual system  evaluates
10,000 million pixel simple operators each second.

	A  tightly  hand  coded  simple   operator,  like  high  pass
filtering  by subtraction of  a local  average, applied to  a 256\f2x256
(1/16 million) pixel picture takes at least 10 seconds  when executed
on a  PDP-10 (KA),  not counting timesharing.  A million  pixel image
would require 160 seconds, if storage constraints permitted it. Since
the computer can evaluate only  one at a time, the effective  rate is
1/160 million pixel simple operators per second.

	Thus a hand  coded PDP-10 falls  short of being the  equal of
the human optical system by a speed factor of 1.6 million.

	It may  not be  necessary to  apply every  operator to  every
portion of every  picture, and a general purpose computer, being more
versatile than the optic nerve, can take advantage of this.   I grant
an order of magnitude for this  effect, reducing the optic nerve to a
mere 100,000 PDP-10 equivalents.

	The  size of  this  factor is  related  to having  chosen  to
implement  our algorithms  in machine  language. If  we had  opted to
disassemble a number of PDP-10's and reconfigure the components to do
the computation,  far fewer  would have been  required. On  the other
hand if  we had  run our algorithms  in interpreted  Lisp, 10 to  100
times as many would  be needed.  The tradeoff is that the design time
varies inversely with the execution efficiency.  A good  Lisp program
to compute  a given function is  easier to produce than  an efficient
machine  language  program,  or  an  equivalent  piece  of  hardware.
Compilers and automatic  design programs blur  the issue a little  by
passing some  of the effort of  implementation to a  machine, but the
essential character remains.
\.
\FDEntropy measurement:\F0
\J
	Is there a quantitative way in which  the processing power of
a  system, independent of  its detailed nature,  can be  measured?  A
feature of things which  compute massively is that they  change state
in complicated  and unexpected ways.  The reason for  believing that,
say,   a  stationary   rock  does  little   computing  is   its  high
predictability.  By this criterion the amount of computing done  by a
device  is in  the mind  of the  beholder. A  machine with  a digital
display which  flashed 1, 2, 3, 4 etc., at fixed intervals would seem
highly  predictable  to  an  adult  of  our  culture,  but  might  be
justifiably  considered to  be doing  an interesting,  nontrivial and
informative  computation  by  a  young  child.    Information  theory
provides a measure for this idea. If a system is in a given state and
can change to one  of a number of next states, the information in the
transition, which I will call the Compute Energy (CE), is given by
\.

\CCE  =  \!jsum(i=1,N); -p\!jsub(i); log p\!jsub(i);
\J
where  N  is  the  number of  next  states,  and  p\!jsub(i); is  the
probability  that the i'th state  will be occur next.  If the base of
the logarithm is two, the measure  is in binary digits, bits.  If  we
consider the  system in the long  run, considering all the  states it
might  ever eventually be  in, then this measure  expresses the total
potential variety of the system.

	A machine which can  accomplish a given thing faster  is more
powerful than a  slower one.  A measure for Compute Power is obtained
by dividing each term  in the above sum by  the expected time of  the
corresponding transition. This is
\.

\CCP  =  \!jsum(i=1,N); \!jdiv((-p\!jsub(i); log p\!jsub(i);)\N
,(t\!jsub(i);));
\J
and the units are bits/second.

	These measures are  highly analogous to the energy  and power
capacities of a battery.  Some properties follow:

They are linear,  i.e.  the compute power  and energy of a  system of
two or  more independent machines is the  sum of the individual power
and energies;

Speeding up a machine by a factor  of n increases the CP by the  same
factor;

A completely  predictable system has a  CP and CE of  zero;

A machine  with a  high short  term CP,  which can  reach a  moderate
number of states in a short time, can yet have a low CE, if the total
number of states attainable in the long run is not high;


	If the  probabilities and  times of  all the transitions  are
equal the measures simplify to
\.

\CCE   =   log\!jsub(2);N

\CCP   =   \!jdiv((log\!jsub(2);N),t);
\J
For N distinct outcomes and equal transition times the maximum CE and
CP is obtained if all the probabilities of occurrence are 1/N, so the
above simplifications  represent an upper  bound in  cases where  the
probabilities are variable or cannot be determined.\.
\FDA representative computer:\F0
\J
	For the KA-PDP10,  considering one instruction time,  we have
(roughly) that in one microsecond this machine is able to execute one
of 2\!jsup(5); different instructions, involving one of 2\!jsup(4);
accumulators and one of 2\!jsup(18); memory locations, most of these
combinations resulting in distinct next sates.  This corresponds to a
CP of
\.

\C\!jdiv((log\!jsub(2);(\!jexp(2,5); \f2x \!jexp(2,4); \f2x \!jexp(2,18);) bit),\
(\!jexp(10,-6); sec));    =    27 \f2x \!jexp(10,6); \!jdiv(bit,sec);
\J
This is  an extreme upper  bound, and  represents the most  efficient
code possible.   It cannot be maintained  for very long, because many
different sequences  of  instructions  have the  same  outcome.  Very
often, for instance,  the order in which a number  of instructions is
executed does not matter. Assuming for the moment that this is always
the case,  we can  calculate the effect  on the  CP, measured over  a
second.  The raw power says there are
\.

\C2\!jsup((27 \f2x 10\!jsup(6);));
\J
distinct possible states  after a second, or one  million instruction
times.  If all permutations result in the same outcome, this must
be reduced by a  factor of 1000000!. The  log of a quotient  is
the difference of the logs, so the adjusted CP is
\.

\C27 \f2x \!jexp(10,6); -  log\!jsub(2);(\!jexp(10,6);!)\N
  =  27 \f2x \!jexp(10,6); -  18.5 \f2x \!jexp(10,6);\N 
  =  8.5 \f2x \!jexp(10,6); \!jdiv(bit,sec);
\J
	The CP is  also limited by the  total compute energy.   If we
ignore external  devices, this is simply the  total amount of memory,
about 36\f2x2\!jsup(18);  =  9.4\f2x10\!jsup(6); bits.   The  PDP  10
could execute at its maximum  effectiveness for 9.4/8.5 = 1.1 seconds
before reaching a state which could have been arrived at more quickly
another way.

	Large external  storage devices such  as disks and  tapes can
increase the compute  energy indefinitely, because they are a channel
through which independently  obtained energy may be injected.  On the
other hand, they have only a moderate effect on the power.

	A disk  channel connected  to our  KA-10 has  a data rate  of
6.4\f2x10\!jsup(6);  bits/sec.    If run  at  full  speed, constantly
stuffing new information into memory, it would add slightly less than
that to the power, because it  uses memory cycles the processor might
want.   If, as is usually the case, the disk is used for both reading
and writing, the improvement is reduced by a factor  of two.  Further
reductions occur if  the use is intermittent.   The overall impact is
less than 10\!jsup(6); bits/sec,  about 10% of the  power of the  raw
processor.\.
\J
	Combining the above results, we  conclude that the processing
power of a  typical major AI center computer is at most 10\!jsup(7);
bits/sec.  Time sharing reduces  this to about 10\!jsup(6); b/s  per
user.  Programming in a moderately efficient high level language (vs.
optimal machine  code), costs another factor of 10, and running under
an interpreter  may  result in  a per  user power  of  a mere  10,000
bits/sec, if the  source code is efficient.   These reductions can be
explained in  the  light of  the  CP measure  by  noting that  the  a
compiler or  interpreter  causes the  executed code  to  be far  more
predictable  than   optimal  code,  eg.    by  producing  stereotyped
sequences of instructions for primitive high level constructs.\.
\FDA typical nervous system:\F0
\J
	We  now consider  the processing  ability  of animal  nervous
systems,  using humans as  an example.   Since the data  is even more
scanty than what we assumed  about the PDP-10, some not  unassailable
assumptions need to be made.   The first is that the processing power
resides in the  neurons and their  interconnections, and not  in more
compact  nucleic  acid or  other  chemical encodings.    There  is no
currently widely  accepted  evidence  for the  latter,  while  neural
mechanisms  for memory  and learning  are being  slowly revealed.   A
second  is that  the  neurons  are  used reasonably  efficiently,  as
detailed analysis  of small nervous systems and  small parts of large
ones  reveals (and  common  sense  applied  to  evolution  suggests).
Thirdly,  that neurons  are fairly  simple,  and their  state can  be
represented by a binary variable, "firing" or "not firing", which can
change about  once per  millisecond.   Finally we  assume that  human
nervous systems contain about 40 billion neurons.

	Considering the  space  of all  possible interconnections  of
these  40 billion  (treating this  as the  search space  available to
natural evolution in its  unwitting attempt to produce  intelligence,
in  the  same sense  that  the  space  of  all possible  programs  is
available to someone trying to create intelligence in a computer), we
note that there is no  particular reason why every neuron  should not
be  able   to  change  state  every  millisecond.     The  number  of
combinations   thus    reachable    from    a    given    state    is
\!jexp(2,(40\f2x10\!jsup(9);)); the   binary    log  of   which    is
40\f2x10\!jsup(9);.  This leads to a compute power of
\.

\C\!jdiv((40 \f2x \!jexp(10,9); bit),(\!jexp(10,-3); sec)); =\N
  40 \f2x \!jexp(10,12); \!jdiv(bit,sec);
\J
which is about a million times the maximum power  of the KA-10.

	Keep in mind that much of this difference  is due to the high
level  of interpretation in the  10, compared to what  we assumed for
the nervous system.  Rewiring  its gates or transistors for each  new
task would  greatly increase the  CP, but also the  programming time.
If the processor is made of 100,000 devices which can change state in
100  ns,  the  potential  CP  available  through  reconfiguration  is
10\!jsup(5); bits/10\!jsup(-7);  sec = 10\!jsup(12); b/s.   The CE
would be unaffected.   If  automatic design  and fabrication  methods
result in small quantity integrated circuit manufacture becoming less
expensive and more widely practiced, my calculations may prove overly
pessimistic.
\.
\FDThermodynamic efficiency:\F0
\J	
	Thermodynamics  and  information theory  provide  us  with  a
minimum amount of  energy per bit of information generated at a given
background temperature (the energy required to out shout  the thermal
noise).  This is approximately the Boltzmann constant,
\.

\C1.38 \f2x \!jexp(10,-16); \!jdiv(erg,deg molecule);
\J
for our purposes the units will be  rephrased as erg/(deg bit), since
for normal matter  molecules represent the limit of organization, due
to the quantum mechanical fuzziness  of things.  This measure  allows
us to  estimate the overall  energy efficiency of  computing engines.
For  instance, we determined  the computing rate of  the brain, which
operates at 300 degrees K, to be 40\f2x10\!jsup(12); bits/sec.  This
corresponds to a physical power of
\.

\C40 \f2x \!jexp(10,12); \!jdiv(bit,sec); \f2x 300 \N
deg \f2x 1.38 \f2x \!jexp(10,-16); \!jdiv(erg,deg bit);  =  \N
1.66 \!jdiv(erg,sec);  =  1.66 \f2x \!jexp(10,-7); watt
\J
	The brain runs on approximately 40 watts, so we conclude that
it is 10\!jsup(-8); times as efficient as the physical limits allow.

	Doing the same calculation for the KA10, again at 300 deg, we
see   that   a   CP   of   8.5\f2x10\!jsup(6);   bit/sec  is   worth
3.519\f2x10\!jsup(-14);  watts.    Since   this  machine  needs   10
kilowatts the efficiency is only  10\!jsup(-18);.  Conceivably a ten
watt, but otherwise equivalent, KA10 could be designed today, if care
were taken  to use the  best logic  for the  required speed in  every
assembly.  The efficiency would then still be only 10\!jsup(-15);.

	As noted previously,  there is a  large cost inherent  in the
organization of a  general purpose computer. We might investigate the
computing efficiency of  the logic gates of  which it is  constructed
(as was, in fact, done with the  brain measure).  A standard TTL gate
can  change state  in about  10ns, and consumes  10\!jsup(-3); watt.
The switching speed corresponds to a CP of  10\!jsup(8); bit/sec, or
a physical  power of 4.14\f2x10\!jsup(-13); watt.  So the efficiency
is 10\!jsup(-10);, only  one hundred times  worse than a  vertebrate
neuron.\.
\J
	The newer semiconductor logic families are even better. C-MOS
is twice  as efficient as TTL, and  Integrated Injection Logic is 100
times better, putting it on a par with neurons.

	Experimental   superconducting   Josephson   junction   logic
operates  at  4 deg  K,  switches in  10\!jsup(-11);  sec,  and uses
10\!jsup(-7); watts per gate.  This implies a physical compute power
of    5\f2x10\!jsup(-12);    watt,    and     an    efficiency    of
5\f2x10\!jsup(-5);,  1000  times  better  than  neurons.    At  room
temperature it  requires a refrigerator  that consumes  100 times  as
much  energy as  the logic,  to  pump the  waste heat  uphill from  4
degrees to 300.  Since the background  temperature of the universe is
about 4 degreees, this can probably eventually be done away with.

	It is  thus likely that  there exist ways  of interconnecting
gates  made  with known  techniques  which would  result  in behavior
effectively equivalent to  that of  human nervous systems.   Using  a
million I\!jsup(2);L gates,  or 10 thousand Josephson  junction gates,
and  a trillion  bits of  slower  bulk storage,  all running  at full
speed, such assemblies would consume as little as, or less  than, the
power needed to operate a brain of the conventional type.

	Past  performance indicates  that  the  amount of  human  and
electronic  compute power available  is inadequate to  design such an
assembly within the next few years.   The problem is much reduced  if
the components used are suitable  large subassemblies.  Statements of
good  high  level  computer languages  are  the  most  effective such
modularizations yet discovered,  and are probably the  quickest route
to human  equivalence, if the  necessary raw processing  power can be
accessed through  them.  This  section has  indicated that a  million
times the  power of typical existing machines  is required.  The next
suggests this should  be available  at reasonable cost  in about  ten
years.
\.
\FDReferences:\F0

WILLOWS, A.O.D.,
"Giant Brain Cells in Mollusks",
Scientific American, Vol. 224, No. 2,
February 1971, 69-75.

KANDEL, Eric R.,
"Nerve Cells and Behavior",
Scientific American, Vol. 223, No. 1,
July 1970, 57-70.

AGRANOFF, Bernard W.,
"Memory and Protein Synthesis",
Scientific American, Vol. 216, No. 6,
June 1967, 115-122.

KENNEDY, Donald,
"Small Systems of Nerve Cells",
Scientific American, Vol. 216, No. 5,
May 1967, 44-52.

BAKER, Peter F.,
"The Nerve Axon",
Scientific American, Vol. 214, No. 3,
March 1966, 74-82.

HUBEL, David H.,
"The Visual Cortex of the Brain",
Scientific American,
November 1963, 54-62.

WILLIAMS, Peter L., Warwick, Roger
"Functional Neuroanatomy of Man",
W.B. Saunders Company, Philadelphia, 1975.

"PDP-10 Reference Handbook",
Digital Equipment Corporation, Maynard Mass., 1971.

TRIBUS, Myron and McIrvine, Edward C.,
"Energy and Information",
Scientific American, Vol. 224, No. 3,
Setember 1971, 179-188.

GLASSTONE, Samuel, Lewis, David,
"Elements of Physical Chemistry",
D. Van Nostrand Co. Inc., New York, 1960.

MILLER, Richard T.,
"Super Switch", 284, in Science Year annual 1975,
Field Enterprises Educational Corp., 1975.

LANDAUER, Rolf,
"Fundamental Limitations in the Computational Process", 
IBM Research Report, Yorktown Heights N.Y., June 1976.
\FBSection 3:  The Growth of Processing Power

\F0\J
	The  references  below  present,  among   other  things,  the
following data points on a price curve:
\.
\F2\’άκ=25;\‘΅\’άκ=-7;\‘ͺ
\’άκ.\’άνA
             \F1Transistor price\F2
 .0001\fE$\’άκA\’άν.\N
          .01\fE$\’άκA\’άν.\N
                   1\fE$\’άκA\’άν.\N
                           \f1$1\’άκA\’άν.\N
                                  \f1$100
’δΗ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±’δΙ\F1   Year\F2
~                                       ~
‘΅‘±                                  O  ‘±‘ͺ\F1   1950\F2
~                                  #    ~\F1   1951   \f1$100 transistor\F2
~                                  #    ~\F1   1952   transistor hearing aid\F2
~                                 #     ~\F1   1953\F2
~                                 #     ~\F1   1954\F2
‘΅‘±                               #     ‘±‘ͺ\F1   1955   transistor radios\F2
~                                #      ~\F1   1956\F2
~                               O       ~\F1   1957   \f1$10 transistor\F2
~                              #        ~\F1   1958\F2
~                            #          ~\F1   1959\F2
‘΅‘±                          O          ‘±‘ͺ\F1   1960   \f1$1 transistor\F2
~                         #             ~\F1   1961\F2
~                        #              ~\F1   1962   \f1$100,000 small computer (IBM 1620)\F2
~                       #               ~\F1   1963\F2
~                      O                ~\F1   1964\F2
‘΅‘±                     #               ‘±‘ͺ\F1   1965   \f1$0.08 transistor (IC)\F2
~                      #                ~\F1   1966   \f1$1000 4 func calculator\F2
~                     #                 ~\F1   1967   \f1$6000 scientific calc.\F2
~                    #                  ~\F1   1968   \f1$10,000 small computer (PDP 8)\F2
~                   #                   ~\F1   1969\F2
‘΅‘±                  O                  ‘±‘ͺ\F1   1970   \f1$200 4 func calculator\F2
~                  #                    ~\F1   1971\F2
~                 #                     ~\F1   1972   1K RAMS (1 \fE$/bit)\F2
~               #                       ~\F1   1973\F2
~              #                        ~\F1   1974   \f1$1000 small computer (PDP 11)\F2
‘΅‘±            O                        ‘±‘ͺ\F1   1975   4K RAMS (.1 \fE$/bit)\F2
~            #                          ~\F1   1976   \f1$5 4 func calc (.05 \fE$/trans)\F2
~                                       ~\F1   1977\F2
~                                       ~\F1   1978\F2
~                                       ~\F1   1979\F2
%‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±$


\’άκ=34;\‘΅\’άκ=0;\‘ͺ
\J\F0
	The numbers indicate a remarkably stable evolution. The price
per electronic function  has declined by a steady factor of ten every
five  years,   if   speed  and   reliability  gains   are   included.
Occasionally there is a more precipitous drop, when a price threshold
which  opens  a  mass  market  is  reached.    This  makes  for  high
incentives, stiff  price competition and  mass production  economies.
It happened in the early sixties with transistor radios, and is going
on now  for  pocket calculators  and  digital wristwatches.    It  is
begining for microcomputers, as these  are incorporated into consumer
products  such as  stoves, washing  machines, televisions  and sewing
machines, and soon cars. During such periods the price can plummet by
a  factor  of  100 in  a  five  year  period.   Since  the  range  of
application  for  cheap  processors is  larger  than  for  radios and
calculators, the explosion will be more pronounced.

	The pace of these gains is in no  danger of slackening in the
forseeable future.  In the next decade the current period may seem to
be merely the flat portion  of an exponential rise. On the  immediate
horizon are the new semiconductor  techniques, I\!jsup(2);L, and super
fast D-MOS, CCD  for large sensors and fast bulk memory, and magnetic
bubbles for  mass storage.   The  new 16K  RAM designs  use a  folded
(thicker) cell structure  to reduce the area required  per bit, which
can   be  interpreted  as  the  first   step  towards  3  dimensional
integration, which  could vastly increase  the density of  circuitry.
The use of V-MOS, an IC technique that vertically stacks the elements
of a MOS transistor  is expanding.  In  the same direction,  electron
beam and X-ray lithography will permit smaller circuit elements.

	In the longer run we have ultra  fast and efficient Josephson
junction  logic, of  which small  IC's exist in  an IBM  lab, optical
communication   techniques,   currently   being   incorporated   into
intermediate  distance telephone  links, and  other  things now  just
gleams in the eye of some fledgling physicist or engineer.\.
\J
	My   favorite   fantasies  include   the   "electronics"   of
super-dense matter, either  made of muonic atoms, where the electrons
are  replaced  by  more  massive  negative  particles  or   of  atoms
constructed  of magnetic  monopoles which  (if they  exist) are  very
massive  and affect each  other more strongly  than electric charges.
The electronics and  chemistry of such  matter, where the  "electron"
orbitals are extremely close to the nucleus, would be more energetic,
and circuitry  built  of  it  should  be  astronomically  faster  and
smaller, and probably hotter.  Mechanically  it should exhibit higher
strength  to weight  ratios. The critical  superconducting transition
field strengths  and temperatures  would be  higher.   For  monopoles
there is  the possibility of combination  magnetic electric circuitry
which  can contain, among many other  goodies, DC transformers, where
an electric current induces a monopole current at right angles to it,
which  in turn  induces  another electric  current.   One  might also
imagine quantum DC transformers, matter composed of a chainlike  mesh
of alternating orbiting electric and magnetic charges.

	I interpret these things  to mean that the cost  of computing
will  fall  by  a  factor of  100  during  the  next  5  years, as  a
consequence of the processor explosion, and by the usual factor of 10
in the 5 years after that.   As an approximation to what is available
today, note  that in large quantities an LSI-11 sells for under $500.
This provides a moderately fast  16 bit processor with 4K  of memory.
Another $500 could  buy an additional 32K of memory,  if we bought in
quantity. The result  would be a  respectable machine, somewhat  less
powerful  than  the  KA-10,   for  $1000.  At  the  crude   level  of
approximation  employed in the previous  section, a million machines
of this type should permit human equivalence. A million dollars would
provide a thousand of them today (a  much better buy, in terms of raw
processing  power, than  a million dollar  large processor).   In ten
years a million  dollars should provide  the equivalent of a  million
such machines, in the form  of a smaller number of faster processors,
putting human equivalence within reach.

	We now have  the problem of what  to do with this  roomful of
isolated computers.   The next section describes what seems to be the
best known solution to the problem of interconnecting a  large number
of processors with maximum generality, at reasonable cost.
\.
\FDReferences:\F0

McWHORTER, Eugene W.
"The Small Electronic Calculator",
Scientific American, Vol. 234, No. 3,
March 1976, 88-98.

HODGES, David A.,
"Trends  in  Computer  Hardware Technology",
Computer Design, Vol. 15, No. 2,
February 1976, 77-85.

SCRUPSKI, Stephen E., et al.,
"Technology Update",
Electronics, Vol. 48, No. 21, McGraw-Hill,
October 16, 1975, 74-127.

VACROUX, Andre G.
"Microcomputers",
Scientific American, Vol. 232, No. 5,
May 1975, 32-40.

TURN, Rien
"Computers in the 1980's",
Columbia University Press,
Rand Corporation, 1974.

TIEN, P. K.
"Integrated Optics",
Scientific American, Vol. 230, No. 4,
April 1974, 28-35.

HITTINGER, William C.
"Metal-Oxide-Semiconductor Technology",
Scientific American, Vol. 229, No. 2,
August 1973, 48-57.

BOBECK, Andrew H. and Scovil, H. E. D.
"Magnetic Bubbles",
Scientific American, Vol. 224, No. 6,
June 1971, 78-90.

HEATH, F. G.
"Large-Scale Integration in Electronics",
Scientific American, Vol. 222, No. 3,
February 1970, 22-31.

RAJCHMAN, Jan A.
"Integrated Computer Memories",
Scientific American, Vol. 217, No. 1,
July 1967, 18-31.

HITTINGER, William C. and Sparks, Morgan
"Microelectronics",
Scientific American, Vol. 213, No. 5,
November 1965, 56-70.
\FBSection 4:  Interconnecting Processors


\F0\J
	The highest bandwidth and  most flexible way for a  number of
computers  to interact  is by  shared  memory.   Systems of  the size
considered here would require a large, but not  unreasonable, address
space of  100 billion  words (40  bits of  address). They  would also
demand  memories with  a  thousand to  a million  ports.   Although a
variant of the method below could be used to construct such monsters,
they would cost much more than the processors they served.

	Alternatively, we might consider how a mass of the usual kind
of  machine,  each with  a  moderate amount  of  local  memory, could
communicate through  their bandwidth  limited IO  busses.   Define  a
message  period as  the minimum  interval  in which  a processor  may
transmit  and  receive  one  message.    Assuming  that messages  are
addressed   to    particular   destination    machines,   an    ideal
interconnection system would allow every processor to issue a message
during each message period, and deliver the highest priority  message
for each processor to it, providing notices  of success or failure to
the originators.   The delay between transmission  and reception of a
message would be uniform and short, and the cost of the network would
be reasonable (on the order of the cost of the processors served).

	It  so happens  that  this specification  can  be met.    The
following  construction is a design  due to  K.E. Batcher  [Batcher],
extended considerably  to permit  pipelining of  messages and  higher
speed.
\.

\FDLog\!jsup(2); sorting net construction:\F0
\J
	Batcher's major  contribution  is a  method for  constructing
nets  that can sort  N numbers  in log(N)\!jsup(2); time,  using only
N\f2xlog\!jsup(2); N primitive two element sorters.  Two nets  plus a
little additional circuitry can accomplish the message routing task.

	The construction consists of a method for making a merger for
two  ordered  lists of  2N numbers  (call  it a  2N-merger)  from two
N-mergers and 2N additional  primitive elements.  A primitive  sorting
element is a 1-merger.

	An N-merger can thus be  constructed from N\f2xlog\!jsub(2); N
primitive  elements.  A  sorter is made  of a cascade  of ever larger
mergers, forming length two ordered lists from individual inputs with
1-mergers,  combining  these  pairwise   into  length  4  lists  with
2-mergers, and so on, until the entire sorted list of N numbers comes
out of a final N/2-merger. The number of  primitive elements involved
is N\f2xlog\!jsub(2);N\f2x(log\!jsub(2);N+1)/4.
\.
\F2\’άκ=25;\‘΅





                                                  Primitive
                                                   sorting
                                                  elements

                          ’δΗ‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
         A1 ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ        ‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±
                    ’δΗ‘±‘±‘±‘±‘±‘ͺ        ~               ’δΗ‘±‘±‘±’δΙ
         A2 ‘±‘±‘±‘±‘±‘±’δΙ ~’δΗ‘±‘±‘±‘±‘ͺ     3  ‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
                  ~ ~~    ~        ~  ’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
         A3 ‘±‘±‘±‘±‘±‘±‘Ύ‘±$~    %‘±‘±’δΙ  M  ‘΅‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ %‘±‘±‘±$
                  ~  ~       ~  E  ~  ~          ~
         A4 ‘±‘±‘±‘±‘±’δΙ~  ~    ’δΗ‘±‘±$  R  ‘΅‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ~
                 ~~  ~’δΗ‘±‘±‘±‘ͺ     G  ~  ~         ~~ ’δΗ‘±‘±‘±’δΙ
         A5 ‘±‘±‘±‘±‘±‘Ύ‘Ύ‘±‘±$~’δΗ‘±‘±‘ͺ     E  ‘΅‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±’δΙ~%‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
  I              ~~   ~~’δΗ‘±‘ͺ     R  ~  ~’δΗ‘±‘±‘±‘±‘±‘±‘±‘Ύ‘Ύ‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±     O
         A6 ‘±‘±‘±‘±’δΙ~~   ~~~ ~        ‘΅‘±‘±‘Ύ‘Ύ‘±‘±‘±‘±‘±‘±’δΙ~~  %‘±‘±‘±$
  N             ~~~   ~~~ %ԱԱԱԱԱԱԱԱ$  ~~      ~~~                 U
                ~~~   ~~~             ~~      ~~~
  P             ~~~   ~~~             ~~      ~~~  ’δΗ‘±‘±‘±’δΙ          T
                ~~~   ~~~             ~~      ~~%‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
  U             ~~~   ~~~             ~~’δΗ‘±‘±‘±‘±‘±‘Ύ‘Ύ‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±     P
                ~~~   ~~~             ~~~     ~~   %ԱԱԱ$
  T      B1 ‘±‘±‘±‘±‘Ύ‘Ύ‘Ύ‘±‘±‘±$~~ ’δΗ‘±‘±‘±‘±‘±‘±‘±‘±’δΙ  ~~~     ~~                  U
                ~~%‘±‘±‘±‘±‘Ύ‘Ύ‘±‘ͺ        ‘΅‘±‘±$~~     ~~
  S             ~%‘±‘±‘±‘±‘±‘Ύ‘Ύ‘±‘ͺ        ~   ~~     ~~   ’δΗ‘±‘±‘±’δΙ          T
         B2 ‘±‘±‘±’δΙ%‘±‘±‘±‘±‘±‘±‘Ύ‘Ύ‘±‘ͺ     3  ‘΅‘±‘±‘±$~     ~%‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
               ~       ~~ ~        ~    ~’δΗ‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
         B3 ‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±$~ %‘±‘±’δΙ  M  ‘΅‘±‘±‘±‘±$~    ~    %‘±‘±‘±$
               ~        ~    ~  E  ~     ~    ~
         B4 ‘±‘±’δΙ~        ~ ’δΗ‘±‘±$  R  ‘΅‘±‘±‘±‘±‘±$    ~    ’δΗ‘±‘±‘±’δΙ
              ~%‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘ͺ     G  ~          %‘±‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
         B5 ‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±$ ~     E  ‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ   ‘΅‘±‘±‘±‘±‘±
              %‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ     R  ~               %‘±‘±‘±$
         B6 ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ        ‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±
                          %ԱԱԱԱԱԱԱԱ$




\’άκ=34;\‘΅

\JFIG 1:   \F0An illustration of Batcher's  odd-even merger  construction,
     producing a  6-merger from  two  3-mergers  and  five  primitive
     elements. The first and last outputs undergo less delay than the
     intermediate  ones,  which  is  undesirable.  The  delays can be
     equalized by passing the first and last outputs through an extra
     primitive element.  Batcher provides  an alternate construction,
     which he  calls a  "bitonic sorter"  which merges,  has constant
     delay,  and  uses  the  same  number  of  elements  as the above
     construction with the extra element added.\F2\.
\FDCommunication scheme organization:\F0
\F2\’άκ=25;\‘΅





              ’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ   ’δΗ‘±‘±‘±‘±‘±‘±‘±’δΙ   ’δΗ‘±‘±‘±’δΙ   ’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
          1 ‘±’άν~         ~‘±‘±’άν~       ~‘±‘±’άν~   ~‘±‘±’άν~          ~‘±’άν 1
              ~         ~   ~       ~   ~   ~   ~          ~
          2 ‘±’άν~         ~‘±‘±’άν~       ~‘±‘±’άν~   ~‘±‘±’άν~          ~‘±’άν 2
Processor     ~    N    ~   ~       ~   ~   ~   ~          ~      Processor
          3 ‘±’άν~         ~‘±‘±’άν~       ~‘±‘±’άν~   ~‘±‘±’άν~          ~‘±’άν 3
Outgoing      ~  Input  ~   ~       ~   ~   ~   ~          ~     Acknowledge‘±
              ~         ~   ~       ~   ~   ~   ~          ~         ment
 Message  .   ~  Sorter ~ . ~       ~ . ~   ~ . ~          ~ .
          .   ~         ~ . ~       ~ . ~   ~ . ~          ~ .      Input
  Ports   .   ~ (Desti‘± ~ . ~       ~ . ~   ~ . ~    2N    ~ .
              ~  nation ~   ~       ~   ~   ~   ~          ~        Ports
              ~  field) ~   ~       ~   ~ E ~   ~   Input  ~
              ~         ~   ~    N  ~   ~ x ~   ~          ~
          N ‘±’άν~         ~‘±‘±’άν~       ~‘±‘±’άν~ c ~‘±‘±’άν~  Sorter  ~‘±’άν N
              %‘±‘±‘±‘±‘±‘±‘±‘±‘±$   %‘±’δΙ  M  ~   ~ h ~   ~          ~
                              ~  e  ~   ~ a ~   ~ (Source  ~   ԱԱԱԱԱԱԱԱԱԱԱԱԱԱ
                            ’δΗ‘±$  r  ~   ~ n ~   ~   field) ~
                       1 ‘±‘±’άν~    g  ~‘±‘±’άν~ g ~‘±‘±’άν~          ~‘±’άν 1
                            ~    e  ~   ~ e ~   ~          ~
                       2 ‘±‘±’άν~    r  ~‘±‘±’άν~ r ~‘±‘±’άν~          ~‘±’άν 2
                            ~       ~   ~   ~   ~          ~      Processor
              Dummy    3 ‘±‘±’άν~       ~‘±‘±’άν~   ~‘±‘±’άν~          ~‘±’άν 3
                            ~       ~   ~   ~   ~          ~      Incoming
             Message        ~       ~   ~   ~   ~          ~
                          . ~       ~ . ~   ~ . ~          ~ .     Message
            Injection     . ~       ~ . ~   ~ . ~          ~ .
                          . ~       ~ . ~   ~ . ~          ~ .      Ports
                            ~       ~   ~   ~   ~          ~
                            ~       ~   ~   ~   ~          ~
                       N ‘±‘±’άν~       ~‘±‘±’άν~   ~‘±‘±’άν~          ~‘±’άν N
                            %ԱԱԱԱԱԱԱ$   %ԱԱԱ$   %ԱԱԱԱԱԱԱԱԱԱ$

\’άκ=34;\‘΅
   
\JFIG 2:    \F0Block diagram of the multiprocessor interconnection scheme.
The cost per processor, and the delay in the net, grows as the
square of the log of N, the number of processors.\.

\F2\’άκ=25;\‘΅




’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~         Data         ~ Source Address  ~ 0 ~ Priority ~ Destination Ad  ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

FIG 3:   \F0Processor message format. High order bit is on the right.\F2



’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~        Unused        ~ Position Number ~ 1 ~   Zero   ~ Position Number ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

FIG 4:   \F0Dummy message format.\F2
\’άκ=34;\‘΅
\F0\J
	The interconnection scheme is diagrammed in Figs  2, 3 and 4.
Each processor is assigned a number, its "address", as indicated.  In
the sorters and the merger  the smaller numbers come out towards  the
top of the diagram.

	Messages are serial bit streams, and consist of a destination
processor  address, a  priority number  (invented by  the originating
computer), a  one  bit  "dummy"  flag  field (set  to  0  for  actual
messages),  the address  of  the  source processor  (i.e.   a  return
address),  and the  data to  be communicated.  A low  priority number
implies high priority. Zero is the highest priority.

	The  net is  assumed to  run  at 100%  duty  cycle, with  the
processors emitting successive synchronized waves of messages.  Every
processor emits  a message  every message  interval.   The  following
discussion examines a single message wave.

	The  first sorting  net orders  the  messages by  destination
address, and within a given destination by priority number.  Thus the
upper inputs of  the merger receive  a list  of messages, grouped  by
destination,  with the  highest  priority message  to each  processor
heading its group.

	The lower  inputs  of the  merger receive  N dummy  messages,
exactly  one for each destination  processor.  The  priority field is
the highest possible  (i.e. zero), the  dummy flag  is 1, the  source
address  is the  same as  the destination,  and the  data portion  is
unused.

	Merging these dummies  with the sorted list of  real messages
results  in a  list  still grouped  by destination,  with  each group
headed  by  a  dummy,  by  virtue  of  its  high  priority,  followed
immediately by  the highest priority  real message,  if any, for  the
destination.

	This list  is fed  to the  exchange  network, which  examines
adjacent  pairs  of  messages (considering  overlapping  pairs),  and
exchanges the  data portions of a pair if the first member happens to
be a dummy to  a given address, and the  second is a real  message to
the same  address (i.e.  it is the  highest priority real  message to
that destination).

	The  sorting  network  following  the  exchanger  sorts   the
messages by the field begining with the dummy flag, which acts as the
high  order bit, followed by  the source address. Since  there were N
real messages, one from each processor, and N dummies, also nominally
one from each processor, and since  real messages are sorted ahead of
dummy messages due to the high order bit, (i.e. the dummy flag) being
0, the second  sorter restores  the messages into  the same order  in
which they were introduced.\.
\J
	Each   processor   has   two   input   ports,   one   labeled
"acknowledgement",   and   the   other   "incoming   message".    The
acknowledgement input of processor i is connected  to output i of the
second sorter. The  incoming message input is connected to output N+i
(i.e. the i'th output of the lower half).

	In the absence  of the exchange  network, the i'th  processor
would receive its own message  back on its acknowledgement input, and
the i'th dummy message on the incoming message input.

	Because  of  the  exchanger, however,  if  the  message  that
processor i had sent  happened to be the highest  priority message to
the  requested destination, then the  data portion of  the message on
the acknowledgement  input would be  that of  the dummy  it had  been
swapped with  (signaling success).   Also, if  any messages  had been
addressed  to processor i,  the data portion of  the highest priority
one would arrive on the incoming message port, in  place of the dummy
message.

	Thus  a  processor  receives  the  highest  priority  message
addressed to  it on its incoming  message port, or a  dummy if nobody
wanted to talk to it. It receives a dummy on its acknowledgement port
if its message has gotten through, or the  message back if it hasn't,
due  to  the existence  of  a  higher priority  message  to  the same
destination.

	Actually  the  serial  nature   of  the  sorter  causes   the
destination and  priority  field to  be lost  in  the source  address
sorter (it tails after the previous wave of messages). In the case of
messages that fail to get delivered, this means that  the originating
processor  must remember  to whom  it  sent the  message (about  four
message times  ago in  a typical design,  due to  the latency of  the
net), if it wants to try again.  This is probably undesirable.  Also,
delivered messages contain no indication of who sent them, having had
their source address field exchanged with that of a dummy, unless the
source address is included in the data field.

	These shortcomings can be overcome  if the exchanger shuffles
the  destination address, source  address and priority  fields in the
manner suggested  by  Figs  5,  6  and 7.    Such  shuffling  can  be
accomplished with  an amount  of storage  at each  exchanger position
equal to the number of bits in the destination and priority fields.
\.
\F2\’άκ=25;\‘΅



\F0Before:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~        Unused        ~ Destination Ad  ~ 1 ~   Zero   ~ Destination Ad  ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~         Data         ~ Source Address  ~ 0 ~ Priority ~ Destination Ad  ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

\F0After:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±’δΙ
~ Priority ~ Source Address  ~          Data        ~ Destination Ad  ~ 1 ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±$

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±’δΙ
~   Zero   ~ Destination Ad  ~         Unused       ~ Source Address  ~ 0 ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±$


FIG 5:   \F0Rearrangements effected by the exchanger in an exchanged pair.\F2





\F0Before:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~         Data         ~ Source Address  ~ 0 ~ Priority ~ Destination Ad  ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

\F0After:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±’δΙ
~ Priority ~ Destination Ad  ~          Data        ~ Source Address  ~ 0 ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±$


FIG 6:   \F0Rearrangements in an unsuccessful message.\F2





\F0Before:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~        Unused        ~ Destination Ad  ~ 1 ~   Zero   ~ Destination Ad  ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$

\F0After:\F2

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±’δΙ
~   Zero   ~ Destination Ad  ~         Unused       ~ Destination Ad  ~ 1 ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±$


FIG 7:   \F0Rearrangements in an isolated dummy message.\F2
\’άκ=34;\‘΅
\FDPackage counts:\F0
\J
	If the  numbers to  be sorted  are sent  into such  a net  as
serial  bit streams, high order  bit first, then  a primitive sorting
element has  two output wires  labelled "smaller"  and "larger",  two
input wires and a reset and a clock line, and works as follows:

	Just before  the first bit  time the  element is reset.  Bits
then  stream into the input  terminals, and simply stream  out of the
output terminals until the first bit position in which the two inputs
differ  comes along.    At that  instant,  the input  with  the 0  is
connected  to  the  "smaller"  output, and  the  one  with  the  1 is
connected to "larger". This interconnection is latched by the element
and all subsequent bits stream  from the inputs to the outputs on the
basis of it, until the next reset.

	Such a  unit can be  built with  approximately 20 gates,  and
introduces one bit time of delay. Careful design should permit an ECL
version with a 100 or 200 MHz bit rate. These could be packed into 48
(say) pin LSI packages, 8 independent elements per package (the clock
and  reset  lines are  common),  16  per package,  configured  into 4
2-mergers, 24 per  package, arranged into  two 4-mergers, and 32  per
package containing a single 8-merger.

	Larger  mergers  can  be constructed  out of  these using  an
extension  of  the  bitonic  sorter  strategies  given in  [Batcher],
resulting in total  package counts (and  partial eight merger,  four,
two  and  single  element  package  counts)  shown   in  Fig.  8  (to
re-iterate, a merger size of N refers to one which takes two lists of
size N and produces a list of size 2N).
\.


\F2\’άκ=25;\‘΅
’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~  Merger size  ~                        Package counts              ~
~               ~       Total     Eights    Fours    Twos      Ones  ~
‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ
~               ~              ~         ~        ~        ~         ~
~          1    ~         1/8  ~         ~        ~        ~    1/8  ~
~          2    ~         1/4  ~         ~        ~   1/4  ~         ~
~          4    ~         1/2  ~         ~   1/2  ~        ~         ~
~          8    ~           1  ~      1  ~        ~        ~         ~
~         16    ~           4  ~      2  ~        ~        ~      2  ~
~         32    ~           8  ~      4  ~        ~     4  ~         ~
~         64    ~          16  ~      8  ~     8  ~        ~         ~
~        128    ~          32  ~     32  ~        ~        ~         ~
~        256    ~          96  ~     64  ~        ~        ~     32  ~
~        512    ~         192  ~    128  ~        ~    64  ~         ~
~      1,024    ~         384  ~    256  ~   128  ~        ~         ~
‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ
~      2,048    ~         768  ~    768  ~        ~        ~         ~
~      4,096    ~       2,048  ~   1536  ~        ~        ~    512  ~
~      8,192    ~       4,096  ~   3072  ~        ~  1024  ~         ~
~     16,384    ~       8,192  ~   6144  ~  2048  ~        ~         ~
~     32,768    ~      16,384  ~  16384  ~        ~        ~         ~
~     65,536    ~      40,960  ~  32768  ~        ~        ~   8192  ~
~    131,072    ~      81,920  ~  65536  ~        ~ 16384  ~         ~
~    262,144    ~     163,840  ~ 131072  ~ 32768  ~        ~         ~
~    524,288    ~     327,680  ~ 327680  ~        ~        ~         ~
~  1,048,576    ~     786,432  ~ 655360  ~        ~        ~ 131072  ~
~               ~              ~         ~        ~        ~         ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±$


FIG 8:   \F0Package counts for mergers\F2
\’άκ=34;\‘΅
\F0\J
	These counts  can  now be  used to  calculate  the number  of
packages required to build sorters of various sizes:
\.


\F2\’άκ=25;\‘΅

’δΗ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±¦Δ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’δΙ
~  Sorter size ~                            Package counts                 ~
~              ~           Total     Eights    Fours       Twos      Ones  ~
‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ
~              ~                 ~          ~         ~         ~          ~
~         2    ~             1/8 ~          ~         ~         ~     1/8  ~
~         4    ~             1/2 ~          ~         ~     1/4 ~     1/4  ~
~         8    ~             3/2 ~          ~     1/2 ~     1/2 ~     1/2  ~
~        16    ~               4 ~        1 ~       1 ~       1 ~       1  ~
~        32    ~              12 ~        4 ~       2 ~       2 ~       4  ~
~        64    ~              32 ~       12 ~       4 ~       8 ~       8  ~
~       128    ~              80 ~       32 ~      16 ~      16 ~      16  ~
~       256    ~             192 ~       96 ~      32 ~      32 ~      32  ~
~       512    ~             480 ~      256 ~      64 ~      64 ~      96  ~
~     1,024    ~           1,152 ~      640 ~     128 ~     192 ~     192  ~
~     2,048    ~           2,688 ~     1536 ~     384 ~     384 ~     384  ~
‘΅‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘Ύ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘ͺ
~     4,096    ~           6,144 ~     3840 ~     768 ~     768 ~     768  ~
~     8,192    ~          14,336 ~     9216 ~    1536 ~    1536 ~    2048  ~
~    16,384    ~          32,768 ~    21504 ~    3072 ~    4096 ~    4096  ~
~    32,768    ~          73,728 ~    49152 ~    8192 ~    8192 ~    8192  ~
~    65,536    ~         163,840 ~   114688 ~   16384 ~   16384 ~   16384  ~
~   131,072    ~         368,640 ~   262144 ~   32768 ~   32768 ~   40960  ~
~   262,144    ~         819,200 ~   589824 ~   65536 ~   81920 ~   81920  ~
~   524,288    ~       1,802,240 ~  1310720 ~  163840 ~  163840 ~  163840  ~
~ 1,048,576    ~       3,932,160 ~  2949120 ~  327680 ~  327680 ~  327680  ~
~ 2,097,152    ~       8,650,752 ~  6553600 ~  655360 ~  655360 ~  786432  ~
~              ~                 ~          ~         ~         ~          ~
%‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±’ΰΔ‘±‘±‘±‘±‘±‘±‘±‘±‘±‘±$


FIG 9:   \F0Package counts for sorters\F2
\’άκ=34;\‘΅

\F0\J
	An interconnection net for N processors involves an N sorter,
an  N merger  and  a 2N  sorter. Thus  a  128 processor  system would
require 304 48 pin  sorter packages, 2.3  for each processor. A  1024
processor  needs 4224,  roughly  four per  processor.  A size  16,384
system needs  7 for each  computer.  A  million processor  would have
12.75 per machine.

	It is likely  that the biggest  versions of this  system will
require  denser   packaging.    Remember,  though,  that  a  thousand
processor system is sufficient for human equivalence if  each machine
is big and fast enough. It will  probably not be necessary to build a
megaprocessor in the decade envisioned here.\.
\FDSpeed calculations:\F0
\J
	As outlined, a  message consists of a destination  address, a
priority,  a  bit, a  source  address  and a  data  portion.  The two
addresses must  be at  least large  enough to  uniquely specify  each
machine. Considering  the case  of a thousand  and a  million machine
system,  we note that the  address lengths are 10  and 20 bits. Let's
make  the  priority   field  the  same   length,  leaving  room   for
considerable flexibility in  priority assignment schemes. The message
portion should be fairly long,  to permit messages like memory  write
requests, which  require both  a memory  address and  the data.  Four
address  lengths, say.  This gives  us a  message length  7 addresses
long, 70 bits in a 1000 processor, 140 bits in a megaprocessor.

	The full time taken by  a message from start of  transmission
to completion of reception, in bit times, is the message length, plus
the  depth of the net (in primitive  elements), plus an address and a
priority time due to the buffering at the exchanger.

	The depth  of  a 1024  processor net  is  110 elements.  This
combines with the message and exchanger delays to result in a transit
time of 110+70+20 = 200  bit times. If the  bit rate is 100 MHz  then
messages are delivered  in two microseconds.  If the  bit rate is 200
MHz,  the time  is  1 microsecond.  The net  contains about  two full
message waves at any instant,  and a message time is  700 nanoseconds
for 100 MHz, or 350 nanoseconds for 200 MHz.

	The same calculation  for a megaprocessor reveals  a depth of
420  and a  total transit time  of 420+140+40  = 600  bit times. This
corresponds to 6 and 3 microsecond transits for  100 and 200 MHz data
rates.   Corresponding  message times  are  1.4 microseconds  and 700
nanoseconds.  The net contains slightly more than 3 message  waves at
a time.\.
\FDPossible refinements:\F0
\J
	The transit time of the net can be  decreased, at the cost of
increasing  the  message  times a  little,  by  running  some of  the
primitive sorter elements  asynchronously, clocking (and  introducing
bit  time delays)  at larger  intervals.   For instance,  an 8-merger
might  accept a bit  time worth  of inputs, which  would then trickle
through four stages of primitive sorters built without shift register
delays  between them. When  everything had  settled down,  the entire
merger would  be clocked,  and those  elements which  had decided  to
latch at this bit time would do so.  The delay of the unit is one bit
time rather  than four, which reduces the 110 or  the 420 term in the
transit calculations. The  settling time is  longer, however, so  the
bit rate would be slower. Perhaps a factor of two in transit time can
be gained in this way, at a cost of 1.5 in data rate.

	At an increase in gate count,  several levels of asynchronous
primitive sorters can be replaced by an equivalent circuit with fewer
gate delays.   This might enable a  given data rate to  be maintained
while the transit time was reduced.

	[Van  Voorhis] offers  some  slight reductions  in  primitive
sorter  count,  essentially by  substituting  special  case solutions
better than the systematic construction wherever possible in  a large
net.  Unfortunately  these invariably have an uneven  amount of delay
along the  various paths, making them almost worthless as synchronous
nets. They may be useful as designs for asynchronous subnets.

	I  have  generalizations  of  Batcher's  constructions  using
primitive  elements that  are M  sorters (as  opposed to  2 sorters).
These allow building a merger  which combines M sorted lists of  size
N\f2xM into a single sorted M\f2xM\f2xN length list, out of M mergers
each capable  of merging M size  N lists, and some  layers of extra M
sorter primitive elements to combine the output of the mergers (these
are analogous to the single  layer of 2-sorters following the mergers
in  the odd-even merger  of Fig 1.   The number of  such layers grows
(empirically) roughly  as  log(M)).   Although  the number  of  gates
needed  for a  given  size sorter  grows  slowly with  M, the  number
packages used shrinks.  This is because each package, being a  sorter
rather than  a merger,  contains more  logic.   My constructions  are
complete for M’β§8, and partially complete for general M.
\.


\J
	Well, now we have  a room full of not only  processors, but a
massive switching system as well. Can it be made to pay for its keep?
The  next  section   examines  some   programming  implications   and
opportunities.\.
\FDReferences:\F0

BATCHER, K.E.
"Sorting Networks and their Applications",
1968 Spring Joint Computer Conference Proceedings
April 1968, 307-314.

VAN VOORHIS, David C.
"An Economical Construction for Sorting Networks",
1974 National Computer Conference Proceedings
April 1974, 921-927.

KNUTH, D.E.
"Sorting and Searching",
The Art of Computer Programming, Vol. 3
Addison-Wesley, 1973.


\FBSection 5:  Programming Interconnected Processors


\F0\J
	A major  feature of the  scheme outlined is  its flexibility.
It can  function as any of the  fixed interconnection patterns of the
current lackluster multiprocessors, like Illiac IV, or as a hexagonal
mesh, or a  7 dimensional cubic  lattice, should that be  desired, or
the tree organization being considered in a Stanford proposal. It can
simulate programmed  pipeline machines, such  as CDC  Star, by  using
processors as arithmetic units. What is  more, it can do all of these
things  simultaneously, since messages within  one isolated subset of
the processors have no effect on messages in a disjoint subset.  This
permits a  very convenient kind  of "time" sharing,  where individual
users get and return processors as their demands change.

	Such  mimicry fails  to  take  advantage  of the  ability  to
reconfigure  the interconnection  totally every  message wave.  It is
easy to find particular  applications, such as tree searching,  where
the net could be used effectively by special purpose programs.  It is
clearly  desirable to  have  systems which  minimize the  user effort
needed to implement these algorithms.



	As an example  of a high  level language  well suited to  the
architecture, consider the following parallel variant of Lisp:\.
\FDA little Lisp:\F0
\J
	Take an existing  Lisp and purify  it to something  closer to
its lambda calculus  foundations.  Flush  RPLACA and RPLACD  and even
SETQ (a  pseudo SETQ can be introduced  later), and discourage PROGs,
which are inherently  sequential and do things  that could have  been
stated more clearly as recursive functions.  In this system recursive
programs are also more efficient.

	An evaluation is handled  as follows.  A free  processor (one
not currently  doing an evaluation) receives  a message demanding the
value of Expr with variable bindings  given in ALIST.  Call this  the
task [Expr,ALIST].  If the top level of Expr is F(exprA,exprB,exprC),
it    generates   the   tasks    [exprA,ALIST],   [exprB,ALIST]   and
[exprC,ALIST], and passes them to three other  free processors. These
evaluate them and  sooner or later return valA  (the value of exprA),
valB and valC to the original processor.  This processor then  refers
to the definition of F,  and in particular to the names  of the dummy
arguments  in the definition.   Suppose  these were A,  B and C.   It
combines the  ALIST it was  given with  the (name,  value) pairs  (A,
valA), (B, valB), (C, valC) to form  a new ALIST'.  Then it generates
the task [bodyF,ALIST'], where bodyF is the body of the definition of
F, which is passed to another free processor. On  receiving the value
of this  expression, it passes  it back to  whoever had given  it the
original task, and then declares itself free.

	If Expr  had  happened to  be an  atom,  the processor  would
simply have  looked it up in ALIST, and returned  its value. If F had
been a predefined system function (such as CAR or CONS)  the sequence
of  actions would  have  been  whatever the  machine  code for  those
functions (of which each processor would have a copy) specified.

	The parallelism comes from  the fact that the arguments  in a
multi-argument function can be evaluated simultaneously.  This causes
moderate  parallelism  when  functions  are  nested,  and  can  cause
explosive  parallelism when  a function  which at  some level  uses a
multi-argument  function, invokes  itself recursively  (as in  a tree
search).  Most non-trivial programs stated as  recursive functions do
this.

	The description  above implies that  the processor  waits for
the  results of  tasks it has  farmed out  to other machines.   These
waits can be arbitrarily long. Also, the number of tasks  spawned can
easily  become more  than the  number  of processors.  In that  case,
presumably,  a processor with a  task to farm out  would have to wait
until a machine becomes free.   Keeping a processor idle  under these
conditions is clearly undesirable.\.
\J
	If each  processor were  time shared,  pretending to be  many
machines, then when  the job being run becomes temporarily idle there
may be another  to switch to.  When a message  pertaining to a  given
waiting job arrives, the processor deals  with it, and if it provides
the information necessary to allow that job to resume, it is resumed.
This scheme  makes the  number of processors  available seem  larger,
perhaps enough to  make it possible to acquire a  processor for a new
task simply by picking  a machine at  random and asking  if it has  a
free job slot.  This  works well if the answer is  usually yes, (i.e.
if  there reasonably  more  slots than  tasks), and  replaces  a more
complicated free processor pool method that must be able to deal with
many requests simultaneously.

	Alternatively instead of each processor having its own little
pool of  running jobs, the whole Lisp  system can maintain a communal
pool,  which processors  refer  to  as  they  become  free.  In  this
organization a task that must pause  is placed into the pool (freeing
the  processor that was running  it), to be taken up  again by a free
processor when its requirements  are met.  Moderate  processing power
is required to manage the pool.

	The list structure is spread out among  all the processors in
the system. Pointers consist of a processor number and address within
processor field.  A machine evaluating an expression whose  top level
function is CONS creates the new cell in its own memory. A CAR or CDR
involves  sending  a message  to the  processor  which owns  the cell
involved, and waiting  for a  reply (thus a  CONS is usually  cheaper
than a CAR!).

	Since  RPLACA and  RPLACD  are eliminated,  circular  lists
cannot be created.  This permits garbage collection to be mediated by
reference counts.   A small fixed reference  count field (three  bits
wide, say)  is part  of each cell.   A processor  doing a  CONS sends
messages  to  the owners  of  the cells  that  the new  cell  will be
pointing  to,  indicating  that  their  reference   count  should  be
incremented. If a cell getting a message of this type has a reference
count that has already reached maximum  (7 in our example), it  sends
back  an  indignant   message  to  the  CONS'ing   processor  saying,
effectively,  "I'm full, you can't  point to me. But  here are my CAR
and my CDR, roll your own and point to that". This not only makes the
reference count  field size  manageable, but reduces  message traffic
jams  to processors containing very  popular cells.   It does make it
mandatory to use EQUAL rather  than EQ when comparing lists.   When a
processor  no longer requires  a cell  to which it  had a  pointer it
sends a message to  the owner saying so.  The reference count of  the
cell is decreased  by one, and if it  reaches zero, it is  freed, and
similar messages are sent to the cells pointed to by its CAR and CDR.
This  garbage   collection   process   goes   on   continuously   and
independently of the main computations.\.
\J
	How is the A-list  of bindings to be handled?   The canonical
representation,  using  a  simple  list  of  dotted  pairs,  is  very
inefficient, since it forces a sequential search. A hash table, as in
most current Lisp implementations is also undesirable, partly because
it requires  an incompatible data type (a relatively large contiguous
block of memory), but primarily because in this parallel system there
are   many  different   contexts  (in   different  branches   of  the
computation) active at one time.  The entire hash table would need to
be duplicated whenever new bindings are made (i.e.  at every level of
evaluation), since the older context is in use in another branch.

	A nice alternative  is to use  essentially an A-list,  but to
structure it as a balanced tree, with each node containing a binding,
a left subtree for those variables with names lexicographically (say)
less  than it, and a  right subtree for  those greater.   An entry in
such a structure can be  found in time proportional to the  logarithm
of the  number of  nodes, and  a version  of the  list with  an entry
added,  deleted  or modified  can  also  be constructed  in  log time
without affecting the  original in  any way, by  building a new  path
down to the element (and sometimes a few side paths, to keep the tree
reasonably balanced), which  points to  many of the  subtrees of  the
original.   Since  this  structure  can  be built  of  standard  list
elements,  it can be  managed via  the normal allocation  and garbage
collection mechanisms. Call this structure an A-tree.

	In general applications  a balanced tree is capable  of being
used  in this system more  effectively than a simple  list, because a
parallel process can get  to all the nodes  in log time, whereas  the
last node  takes linear time  in a linear  list.  This  suggests that
programmers  wanting maximum  efficiency will  be coerced  into using
them much more  frequently than is now  the case. This is  similar to
the pressure on present day Lisp programmers to use PROGS rather than
recursive functions, because compiled PROGS run faster.

	The [expr,ALIST]  task description  is conveyed  simply as  a
pair of pointers,  expr to an S expression version of expr, and ALIST
to the  head of  the  balanced A-tree.   The  result returned  in  an
evaluation is likewise a pointer.

	Since  the list  structure  is  only  built upon,  and  never
altered,  it is possible to  speed up the operation  of the system by
having individual  processors  maintain software  managed  caches  of
frequently referred  to cells  that reside  in remote  machines. This
cuts down on the message traffic, and generally speeds things up.\.
\FDA little Algol:\F0
\J
	Although  recursive functions  provide  an excellent  way  of
exploiting  a very  parallel architecture, there  are other  ways. An
algorithmic (iterative)  language can  be  made to  serve, if  a  few
features are present.

	Existing Algol unfortunately has few of  these.  Consider the
following pair of statements excerpted from an imaginary program:
\.
\F2
	A := 3 ;
	B := 4 ;
\F0\J
These say that first A  is set to 3, then B is assigned  the value 4.
There  is an implied  sequentialness. Presumably  the programmer knew
that the  order of  the assignments  was unimportant,  and that  they
could be done simultaneously. The  syntax of the language provided no
way for him to indicate this.

	A simple  construct which permits information of this type to
be conveyed is the "parallel semicolon", which we will  indicate by a
vertical bar "|". Statements  seperated by parallel semicolons may be
executed simultaneously. The following is an example of its use:
\.
\F2
	A := 3 |
	B := 4 |
	C := 6 ;
	D := A+B+C ;
\F0\J
The first three statements may be executed  together, the fourth must
wait  for their completion.  It is implied  that compound statements,
bracketed by BEGIN END pairs  can be similarly separated by  parallel
as well as sequential semicolons.

	If compound statements are permitted to return as a value the
value of their last component statement, then the structure
\.
\F2
         T :=  BEGIN
               INTEGER A,B,C ;
 
               A := expressionA |
               B := expressionB |
               C := expressionC ;

               expression involving A, B and C

               END;
\F0\J
is equivalent to  a lambda expression construct  in Lisp. A, B  and C
are  the  dummy  arguments, evaluated  in  parallel,  and the  fourth
expression, which  must wait  for A,  B and  C, is  the  body of  the
lambda.   The constructs  involved can  be used  in many  other ways,
unlike a real lambda expression.

	If the  Algol  also has  recursion, then  it  is possible  to
obtain massive  parallelism in the same way  it was achieved in Lisp,
namely  by  writing  functions  which  recurse  deeply,   and  invoke
themselves a few times in parallel at each level.

	A minor form  of parallelism already  automatically available
resides in arithmetic expressions. Different subexpressions of larger
formulas can be evaluated simultaneously, and then combined.  This is
analogous to the evaluation of Lisp expressions.

	More massive concurrency can be obtained if the data types of
Algol are  expanded to include a complete  set of array operators and
genuine dynamic allocation  of arrays. It  is this  type of data  and
operator set that  makes APL an extremely powerful  language in spite
of an execrable control structure.

	The   idea  is   that   operators   such  as   addition   and
multiplication work  not only on simple variables representing single
numbers, but on whole multi-dimensional arrays. Since arrays are more
complex  objects  a  much larger  range  of  operators  is  required.
Included   are  such   things   as  array   restructuring,  subscript
permutation  (generalized  transpose),  element  shuffling,  subarray
extraction, generalized cross and dot  products, etc..  In general an
operator combines  one or more arrays and produces a value which is a
new array, often of a different size and shape.

	Compounding of such operators substitutes  for a large amount
of explicit  loop control, and results  in substantially more compact
source code.  On  conventional computers  such  code also  runs  much
faster because  most the run  time is  spent in carefully  hand coded
implementations of the basic operators, which do a great deal at each
invokation  if  the arrays  are  large.    Typical  equivalent  Algol
programs   execute  second   rate  compiler   produced   code  almost
exclusively.  An  operator  set  of  this  kind  provides  the   same
capabilities  as a  hypothetical parallel  FOR  loop construct,  with
greater clarity.  Conditionals can be handled  by first selecting out
subarrays  according  to  the   condition,  and  then  applying   the
appropriate operations.\.
\J
	It is desirable in  our system for large arrays  to be spread
out among many  machines, so that array operations can be carried out
in parallel  where  possible.   An  often  occurring process  is  the
application of an arithmetic or other  operator to one or more arrays
of a  given size, producing a result of the same size.  This could be
made maximally  efficient  if corresponding  elements  of the  arrays
resided in  the same  processor.  Assigning  a processor to  an array
element by means of a hash function of both the array dimensions  and
the indices of the element will cause arrays  of different size to be
stored  in different places,  but will put  corresponding elements of
arrays of the same size in the same place.

	Now comes the question of how a gaggle of processors managing
a given array  gets word that an operation concerning the array is to
be executed. The initiator of  the operation is the single  processor
running the  code requesting it.   The fastest way to  propagate such
information  is  to  initiate  a  "chain  letter".   The  originating
processor can  consider  the array  as two  equal  (or almost  equal)
pieces, and send a message  to a member of each piece. Each recipient
then divides the piece of which it is a member into two, and sends  a
message to a  representative of each  of those smaller  subsets, etc.
When the subset size becomes one, the operation is performed.

	It may be more efficient to store  larger pieces of arrays in
individual  processors.  This adds  a slight serial  component to the
run times, but saves message handling time.

	How are programs communicated from processor to processor, as
the  number of executing instruction  streams grows?  It  would be an
extravagant waste of memory to duplicate the entire source program in
every  processor (besides,  it might  not fit).   A  software caching
scheme  is  called for.  When  a  processor initiates  a  subtask, it
obtains a free  processor and passes to  it a moderate amount  of the
code needed for the task.   When the new processor runs out, it sends
a message back to the first machine  for more.  This machine trys  to
obtain it from its own  cache, and if it too is out,  sends a message
further up  in the hierarchy, to the  machine which had initiated the
task which  it is  running, and  so on  until the  requested code  is
obtained. At top  level the entire runnable program is  assumed to be
available from  some sort of file, maintained by the combined storage
of a number of processors.\.
\FDA little operating systems:\F0
\J
	Assume  that  most  input/output  devices  are  connected  to
individual processors,  and that, to maximize  the bandwidth, each of
these processors has  only one,  or a  most a very  small number,  of
devices associated with  it. Included among these  are communications
lines  to user terminals,  so that each  console talks  directly to a
dedicated processor with sorting net access to the entire system.

	If we expect to support more than one  user on such a system,
it  will be  necessary to  have  some type  of protection  scheme, to
prevent  one  user's processes  from  accidentally  interfering  with
another's.

	If each  processor contains a  monitor, then  message sending
can handled by system calls. The monitor can then check for validity,
testing if the requested destination is within the set  of processors
assigned to the user.  This  monitor, which every machine assigned to
a user  must run, can be flexible enough to time share the machine it
runs on, to provide multiple simulated processors.

	Controlling the state of the individual machines' monitors is
the  task of a global  system monitor, operated  by several machines,
which maintains a pool  of free processors,  and parcels them out  on
request, and  which also handles  file system requests  (bulk storage
would be connected to a handful of the processors), and allocation of
other devices.

	Processes belonging to a  single user will be initiated  by a
particular master machine, probably the one connected to his console.
This  master   can   create   a  tree   of   subprocesses,   possibly
intercommunicating,  running  on  different machines.  It  should  be
possible,  for example,  to have  one subset  configured as  an array
processor  for   efficient  implementation   of   low  level   vision
operations,  while  another is  running  an  Algol/APL  for the  less
structured analytic geometry needed to interpret the image, and yet a
third is operating a  Lisp system doing abstract reasoning  about the
scene.  Many existing systems  permit this kind  of organization, but
they are hampered  by having  an absurdly small  amount of  computing
power.\.
\J
	How is a  system of this  kind initialized, and how  does one
abort  an out of control  process taking place in  part of it without
affecting the rest?  A possibility is to have an "executive" class of
messages (perhaps signalled by a particular bit in the data portion),
which  user  jobs  are not  permitted  to emit.    Reception  of such
messages might cause  resetting of the  processor, loading of  memory
locations within it, and starting execution at a requested locations.
A single externally controllable  machine can be  used to get  things
going, fairly quickly if emits a self replicating "chain letter".

	Now consider reliability.  The  system can obviously tolerate
any  reasonable number of inoperable  processors, by simply declaring
them unavailable for use. Failures in the communication net  are much
more serious,  and under most  situations will require the  system to
stop operating  normally. It is possible to write diagnostic programs
which can  track down  defective comparator elements  or broken  data
wires. If  something should  happen to the  clock signals to  a given
level it  would  be necessary  to  wheel  out an  oscilloscope.    If
reliability were a critical  issue it would be possible  to include a
duplicate net, to run things the while other was being debugged.\.
\FDDisclaimer:\F0
\J
	The software outlines  are obviously only partly  baked. This
is mostly due to the limited amount of thought and work that has gone
into them. On the other hand, it is my  belief that even well thought
out designs at  this point will look naive in the light of experience
with a working version  of such a machine.   Many of the  fundamental
decisions  depend on  things  difficult  to estimate,  including  the
number  of processors,  communication  speed compared  with processor
speed, memory size, and most importantly the kinds, sizes and  mix of
operations that  people will  tend to run  (these will  surely differ
from what is being done now, the limitations being so different).

	This is not to  say that more thought isn't called  for.  The
nature  of  the  memory  protection,  interrupt  scheme,  word  size,
instruction set,  I/O structure,  etc.   of  the individual  machines
should  be  tailored  to  permit  convenient  implementation  of  the
individual  processor operating  systems, and  typical message types.
What these  are  can  best be  determined  by  trying to  write  some
monitors, interpreters, compilers and array crunchers.

	Simulation of the system on a conventional machine is next to
useless.   Since the total amount of  memory envisaged is larger than
conventional machines carry, and since a simulation will have to jump
around from simulated core image to simulated core image, in order to
keep a realistic message synchrony, there would be an enormous amount
of disk accessing. The  slowness of this, combined with  the inherent
speed difference,  would allow  only the most trivial (and  therefore
unrepresentative) things to be tried, and not many of these.

	Construction of a  moderate size version  is a better  use of
manpower.  Undoubtedly there will be  mistakes made in  the first (or
first  few) models,  which  will  become  apparent  after  a  bit  of
experience.

	The flexibility  offered should  make this  design much  more
attractive to  a large class of  programmers than current essentially
special purpose architectures  such as  ILLIAC IV. Making  it out  of
existing processors with proven machine  languages will help too.  I,
for one, can hardly  wait to start programming even  a flawed version
of a machine  that can process and generate real time television with
programs written in Algol, and simultaneously jump over tall game and
proof trees in a single bound.\.
\FB
Bombast:


\F0\J
	The enormous shortage of ability to compute is distorting our
work,   creating  problems  where  there   are  none,  making  others
impossibly difficult, and generally causing effort to be misdirected.
Shouldn't this  view be  more widespread,  if it is  as obvious  as I
claim?

	In the early  days of AI  the thought that  existing machines
might  be much  too small  was  widespread, but  there was  hope that
clever mathematics and advancing computer technology could  soon make
up the difference. Since then computers  have improved by a factor of
ten every  five years, but, in spite of reasonably diligent work by a
reasonable number  of people,  the results  have been  embarrassingly
sparse.  The realization  that available compute power might still be
vastly inadequate has since been swept under the rug, due to  wishful
thinking and  a feeling that  there was nothing  to be done  about it
anyway  and  that  voicing  such an  opinion  could  cause  AI to  be
considered impractical, resulting in reduced funding.

	There is also an element of scientific  snobbery. Many of the
most  influential names in the  field seem to feel  that AI should be
like the theoretical side of physics, the essential problem  being to
find the  laws of universe  relating to intelligence. Once  these are
known,  the  thinking  goes,  construction  of  efficient intelligent
machines  will  be  trivial.    Suggestions  that  the  problems  are
essentially  engineering ones  of scale  and  complexity, and  can be
solved by  incremental  improvements  and  occasional  insights  into
sub-problems, are treated with disdain.

	This attitude  is a variant  of the philosophical  notion that
all  truth can be  arrived at by  pure thought, and  is unfounded and
harmful. One  wonders what  state space  travel would  be  in if  the
Goddards  and von  Brauns had  spent their  time trying  to find  the
universal  laws of rocket  construction before trying  to build space
ships. AI needs a stronger experimental base.  Like other branches of
endeavor  (notably physics, aeronautics  and meteorology),  we should
realize our desperate need for more computing, and do things about
it.
\.





\C\fFc