\|\\@macroz.pox[doc,hpm];\_=0;\\_=34;\\F1\N \MBMETLBI;\MDNONMBI;\MEMS25;\MFNONMI;\N \_'0;\#\EVERYPAGE[\P\_#\+=1;\#\oEP{2 \D#}\WEP,=1435;=150;\p]\N \FB\Y=700; \CIntelligent Machines: \CHow to get there from here \Cand \CWhat to do afterwards \Y=1600; \F1\_R\r\-=300;\l \cHans P. Moravec \cComputer Science Dept. \cStanford University \cSeptember 3, 1977 \cthis version \c\!DATE; \'100; \Y=660; \FDAcknowledgement:\F1 \F1\J The following entities provided inspiration, encouragement, suggestions, proofreading: (in carefully randomized order) HAL-9000, Bruce Baumgart, Marc Le Brun, Andy Moorer, Bill Gosper, PDP-KA10, Scientific American, Don Gennery, Mike Farmwald, Erik Gilbert, John McCarthy, Macsyma, Pierre van Nypelseer, Robert Maas, Ed Mcguire, Electronics magazine, Russ Taylor, Elaine Kant, Les Earnest, Arthur Thomas, Cart Project, Polle Zellweger, Jeff Rubin, Tom Binford, Clem Smith, Tom Gafford, Brian Harvey, ... \. \F1\JThis essay is an amalgam of "\FFThe Role of Raw Power in Intelligence\F1" and "\FFTodays Computers, Intelligent Machines and Our Future\F1".\. \_=34;\\_=0;\\N \FBIntroduction:\F1 \J The unprecedented opportunities for experiments in complexity presented by the first modern computers in the late 1940's raised hopes in early computer scientists (eg. John von Neumann and Alan Turing) that the ability to think, our greatest asset in our dealings with the world, might soon be understood well enough to be duplicated. Success in such an endeavor would extend mankind's mind in the same way that the development of energy machinery extended his muscles. In the thirty years since then computers have become vastly more capable, but the goal of human performance in most areas seems as elusive as ever, in spite of a great deal of effort. The last ten years, in particular, has seen thousands of people years devoted directly to the problem, referred to as Artificial Intelligence or AI. Attempts have been made to develop computer programs which do mathematics, computer programming and common sense reasoning, are able to understand natural languages and interpret scenes seen through cameras and spoken language heard through microphones and to play games humans find challenging. There has been some progress. Samuel's checker program can occasionally beat checker champions. Chess programs regularly play at good amateur level, and in March 1977 a chess program from Northwestern University, running on a CDC Cyber-176 (which is about 20 times as fast as previous computers used to play chess) won the Minnesota Open Championship, against a slate of class A and expert players. A ten year effort at MIT has produced a system, Macsyma, capable of doing symbolic algebra, trigonometry and calculus operations better in many ways than most humans experienced in those fields. Programs exist which can understand English sentences with restricted grammar and vocabulary, given the letter sequence, or interpret spoken commands from hundred word vocabularies. Some can do very simple visual inspection tasks, such as deciding whether or not a screw is at the end of a shaft. The most difficult tasks to automate, for which computer performance to date has been most disappointing, are those that humans do most naturally, such as seeing, hearing and common sense reasoning. A major reason for the difficulty has become very clear to me in the course of my work on computer vision. It is simply that the machines with which we are working are still a hundred thousand to a million times too slow to match the performance of human nervous systems in those functions for which humans are specially wired. This enormous discrepancy is distorting our work, creating problems where there are none, making others impossibly difficult, and generally causing effort to be misdirected. In the early days of AI the thought that existing machines might be much too small was widespread, but people hoped that clever mathematics and advancing computer technology could soon make up the difference. The idea that available compute power might still be vastly inadequate has since been swept under the rug, due to wishful thinking and a feeling that there was nothing to be done about it anyway and that voicing such an opinion could cause AI to be considered impractical, resulting in reduced funding. This attitude has had some bad effects, one of them being that AI research has been centered on computers less powerful than absolutely necessary.\. \J The first section of this essay discusses natural intelligence. It notes two major branches of the animal kingdom in which intelligence evolved independently, and suggests that it is easier to construct than is sometimes assumed. The second part compares the information processing ability of present computers with intelligent nervous systems. The factor of one million is derived in two different ways. Section three examines the development of electronics, and concludes the state of the art can provide more power than is now available, and that the one million gap could be closed in ten years. Parts four and five introduce some hardware and software aspects of a system which would be able to make use of the advancing technology, providing a means for achieving human equivalence, perhaps by the next decade. Part six considers the implications of the emergence of intelligent machines, and concludes that they are the final step in a revolution in the nature of life. Classical evolution based on DNA, random mutations and natural selection may be completely replaced by the much faster process of intelligence mediated cultural and technological evolution.\. \, \FBSection 1: The Natural History of Intelligence\F1 \FDProduct lines:\F1 \J Natural evolution has produced a continuum of complexities of behavior, from the mechanical simplicity of viruses to the magic of mammals. In the higher animals most of the complexity resides in the nervous system. Evolution of the brain began in early multi-celled animals a billion years ago with the development of cells capable of transmitting electrochemical signals. Because neurons are more localized than hormones they allow a greater variety of signals in a given volume. They also provide evolution with a more uniform medium for experiments in complexity. The advantages of implementing behavioral complexity in neural nets seem to have been overwhelming, since all modern animals more than a few cells in size have them [animal refs.]. Two major branches in the animal kingdom, vertebrates and mollusks, contain species which can be considered intelligent. Both stem from one of the earliest multi-celled organisms, an animal something like a hydra made of a double layer of cells and possessing a primitive nerve net. Most mollusks are intellectually unimpressive sessile shellfish, but one branch, the cephalopods, possesses high mobility, large brains and imaging eyes. These structures evolved independently of the corresponding equipment in vertebrates and there are fascinating differences. The optic nerve connects to the back of the retina, so there is no blind spot. The brain is annular, forming a ring encircling the esophagus. The circulatory system, also independently evolved, has three blood pumps, a systemic heart pumping oxygenated blood to the tissues and two gill hearts, each pumping venous blood to one gill. The oxygen carrier is a green copper compound called hemocyanin, evolved from an earlier protein that also became hemoglobin.\. \J These animals have some unique abilities. Shallow water octopus and squid are covered by a million individually controlled color changing effectors called chromatophores, whose functions are camouflage and communication. The capabilities of this arrangement have been demonstrated by a cuttlefish accurately imitating a checkerboard it was placed upon, and an octopus in flight which produced a pattern like the seaweed it was traversing, coruscating backward along the length of its body, diverting the eye from the true motion. Deep sea squid have photophores capable of generating large quantities of multicolored light. Some are as complex as eyes, containing irises and lenses [squid]. The light show is modulated by emotions in major and subtle ways. There has been little study of these matters, but this must provide means of social interaction. Since they also have good vision, there is the potential for high bandwidth communication. Cephalopod intelligence has not been extensively investigated, but a few controlled experiments indicate rapid learning in small octopus [Boycott]. The Cousteau film in the references shows an octopus' response to a problem requiring a two stage solution. A fishbowl containing a lobster is sealed with a cork and dropped into the water near it. The octopus is attracted, and spends a long while alternately probing the container in various ways and returning to its lair in iridescent frustration. On the final iteration it exits its little hole in the ground and unhesitatingly wraps three tentacles around the bowl, and one about the cork, and pulls. The cork shoots to the surface and the octopus eats. The Time-Life film contains a similar sequence, with a screw top instead of a cork! If small octopus have almost mammalian behavior, what might giant deep sea squid be capable of? The behavior of these large brained, apparently shy, animals has virtually never been observed. Birds are more closely related to humans than are cephalopods, their common ancestor with us being a 300 million year old early reptile. Size-limited by the dynamics of flying, some birds have reached an intellectual level comparable to the highest mammals. Crows and ravens are notable for frequently outwitting people. Their intuitive number sense (ability to perceive the cardinality of a set without counting) extends to seven, as opposed to three or four in us. Such a sense is useful for keeping track of the number of eggs in a nest. Experiments have shown [Stettner] that most birds are more capable of high order "reversal" and "learning set" learning than all mammals except the higher primates. In mammals these abilities increase with increasing cerebral cortex size. In birds the same functions depend on areas not present in mammalian brains, forebrain regions called the "Wulst" and the hyperstriatum. The cortex is small and relatively unimportant. Clearly this is another case of independent evolution of similar mental functions. Penguins, now similar to seals in behavior and habitat, might be expected to become fully aquatic, and evolve analogously to the great whales.\. \J The cetaceans are related to us through a small 30 million year old primitive mammal. Some species of dolphin have body and brain masses identical to ours, and archaeology reveals they have been this way several times as long. They are as good as us at many kinds of problem solving, and perhaps at language. The references contain many anecdotes, and describe a few controlled experiments, showing that dolphins can grasp and communicate complex ideas. Killer whales have brains seven times human size, and their ability to formulate plans is better than the dolphins', on whom they occasionally feed. Sperm whales, though not the largest animals, have the world's largest brains. There may be intelligence mediated conflict with large squid, their main food supply.\. \J Elephants have brains about five times human size, matriarchal tribal societies, and complex behavior. Indian domestic elephants usually learn 500 commands, limited by the range of tractor-like tasks their owners need done, and form voluntary mutual benefit relationships with their trainers, exchanging labor for baths. They can solve problems such as how to sneak into a plantation at night to steal bananas, after having been belled (answer: stuff mud into the bells). And they remember for decades. Inconvenience and cost has prevented more elephant research. The apes are our cousins. Chimps and gorillas can learn to use tools and to communicate with human sign languages at a retarded level. As chimps have one third, and gorillas one half, human brainsize, similar results should be achievable with the larger brained, but less human-like animals. Though no other species has managed to develop a technological culture, it may be that some of them can be made partners in ours, accelerating its evolution with their unique capabilities. \. \, \F0\_=25;\\_=-7;\ Time before present Representative Creatures Significant events 0 (you are here) ~ ~ ~ ~ ~ computers massive technology 2.5 million years ~ ~ ~ ~ ~ ~ 10 ~ ~ ~ ~ elephants ~ tool use ~ ~ ~ whales ~ primates 40 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 90 octopus squid ~ ~ ~ ~ ~ ~ ~ %$ 160 %$ birds mammals ~ ~ ~ learned behavior 250 early squid %$ warm bloodedness ~ reptiles 360 ~ ~ cephalopods fish ~ 490 ~ ~ amphibians land vertebrates % %$ 640 mollusks vertebrates ~ ~ 810 ~ ~ complex nerve centers %$ 1 billion years ~ invention of the neuron ~ old age death 1.21 ~ sex in animals perfected ~ 1.44 ~ multi-cellular animals animals 1.69 ~ plants ~ 1.96 ~ ~ oxygen to support animals % 2.25 ~ ~ 2.56 blue-green ~ nucleated cells algae ~ 2.89 % ~ DNA genetics? 3.24 ~ photosynthesis earliest cells reliable reproduction 3.61 ~ invention of the cell ~ inorganic protein microspheres 4 billion years non-living chemicals amino acid formation \_=34;\\_=0;\ \F1\JFIGURE: Highlights in the evolution of terrestrial intelligence. The distance along the edge of the tree is proportional to the square root of the time from the present. This seems to space things nicely.\. \_=34;\\_=0;\ \, \FDNervous System Size and Intelligence\F1 \J A feature shared by all living organisms whose behavior is complex enough to indicate near-human intelligence is a nervous system of a hundred billion neurons. Imaging vision requires a billion neurons. A million brain cells usually permits fast and interesting, but stereotyped, behavior as in a bee. A thousand is adequate for slow moving animals with minimal sensory input, such as slugs and worms. A hundred runs most sessile animals. The portions of nervous systems for which tentative wiring diagrams have been obtained (eg. much of the brain of the large neuroned sea slug, Aplysia, the flight controller of the locust and the early stages of some vertebrate visual systems) reveal that the neurons are configured into efficient, clever, assemblies. This should not be surprising, as unnecessary redundancy means unnecessary metabolic load, a distinct selective disadvantage.\. \, \J Evolution has stumbled on many ways of speeding up its own progress, since species that adapt more quickly have a selective advantage. Most of these speedups, such as sex and dying of old age, are refinements of one of the oldest, the encoding of genetic information in the easily mutated and modular DNA molecule. In the last few million years the genetically evolved ability of animals, especially mammals, to learn a significant fraction of their behavior after birth has provided a new medium for growth of complexity. Modern man, though perhaps not the most individually intelligent animal on the planet, is the species in which this cultural evolution seems to have had the greatest effect, making human culture the most potent force on the earth's surface. Our cultural and technological evolution has proceeded by massive interchange of ideas and information, trial and error guided by the ability to predict the outcome of simple situations, and other techniques mediated by our intelligence. The process is self reinforcing because its consequences, such as improved communication methods and increased wealth and population, allow more experiments and faster cross fertilization among different lines of inquiry. Many of its techniques have not been available to biological evolution. The effect is that present day global civilization is developing capabilities orders of magnitude faster. Of course biological evolution has had a massive head start.\. \J Although cultural evolution has developed methods beyond those of its genetic counterpart, the overall process is essentially the same. It involves trying large numbers of possibilities, selecting the best ones, and combining successes from different lines of investigation. This requires time and other finite resources. Finding the optimum assembly of particular type of component which achieves a desired function usually requires examination of a number of possibilities exponential in the number of components in the solution. With fixed resources this implies a design time rising exponentially with complexity. Alternatively the resources can be used in stages, to design subassemblies, which are then combined into larger units, and so on, until the desired result is achieved. This can be much faster since the effort rises exponentially with the incremental size of each stage and linearly with the number of stages, with an additional small term, for overall planning, exponential in the number of stages. The resulting construct will probably use more of the basic component and be less efficient than an optimal design. Biological evolution is affected by these considerations as much as our technology. If a device is so difficult to design that our technology cannot build it, then neither should we expect to find it in the biological world. Conversely, if we find some naturally evolved thing, we can rest assured that designing an equally good one one is not an impossibly difficult task. Presumably there is a way of using the physics of the universe to construct entities functionally equivalent to human beings, but vastly smaller and more efficient. Terrestrial evolution has not had the time or space to develop such things. But by building within the sequence atoms, amino acids, proteins, cells, organs, animal (often concurrently), it produced a technological civilization out of inanimate matter in only two billion years. \. \, \FDHarangue:\F1 \J The existence of several examples of intelligence designed under these constraints should give us great confidence that we can achieve the same in a time span similar to that of other technological accomplishments. The situation is analogous to the history of heavier than air flight, where birds, bats and insects clearly demonstrated the possibility before our culture mastered it. Flight without adequate power to weight ratio is heartbreakingly difficult (vis. Langley's steam powered aircraft or current attempts at man powered flight), whereas with enough power (on good authority!) a shingle will fly. Refinement of the aerodynamics of lift and turbulence is most effectively tackled after some experience with suboptimal aeroplanes. After the initial successes our culture was able to far surpass biological flight in a few decades. Although there are known brute force solutions to most AI problems, current machinery makes their implementation impractical. Instead we are forced to expend our human resources trying to find computationally less intensive answers, even where there is no evidence that they exist. This is honorable scientific endeavor, but, like trying to design optimal airplanes from first principles, a slow way to get the job done. With more processing power, competing presently impractical schemes could be compared by experiment, with the outcomes often suggesting incremental or revolutionary improvements. Computationally expensive highly optimizing compilers would permit efficient code generation at less human cost. The expanded abilities of existing systems such as Macsyma, the symbolic mathematics system from MIT, which can be used as a desk calculator for doing algebra and trigonometry as well as arithmetic, along with new experimental results, would accelerate theoretical development. Gains made this way would improve the very systems being used, causing more speedup. The intermediate results would be inefficient kludges busily contributing to their own improvement. The end result is systems as efficient and clever as any designed by more theoretical approaches, but sooner, because more of the labor has been done by machines. With enough power anything will fly. The next section examines how much is needed. \. \, \FBSection 2: Measuring Processing Power\F1 \J During the past ten years Digital Equipment Corporation's PDP-10 has become the standard computer for AI and related research, partly because it was designed with advanced techniques, such as time sharing and unusual computer languages, in mind. When first introduced, the PDP-10 was considered a large machine. By today's standards it is medium size. The PDP-10 dealt with in this section is the KA model, the standard until very recently. The very largest scientific computers, heavily used in physics, chemistry and other fields, made by companies such as Control Data Corp. and IBM, are about 100 times the speed of the KA. When it was new a KA system cost about half a million dollars. Large computers sell for around 10 million.\. \, \FDLow level vision:\F1 \J The visual system of a few animals has been studied in some detail, especially the layers of the optic nerve near the retina. The neurons comprising these structures are used efficiently to compute local operations like high pass filtering and edge, curvature, orientation and motion detection. Assuming the visual cortex (and possibly the optic nerve itself) is as computationally intensive as the retina, successive layers producing increasingly abstracted representations, we can estimate the total capability. There are a million separate fibers in a cross section of the human optic nerve. The thickness of the optical cortex is a thousand times the depth occupied by the neurons which apply a single simple operation. The eye is capable of processing images at the rate of ten per second (flicker at higher frequencies is detected by special operators). This means that the human visual system evaluates 10,000 million pixel simple operators each second.\. \, \F0\_=25;\\_=-7;\ 14~ sperm whale * 10 ~ ~ human * 13~ chimp * 10 ~ ~ human vision * 12~ 10 ~ ~ 11~ proposed NASA wind  10 ~ tunnel simulator ~ 10~ 10 ~ ~ 9 ~ Cray  10 ~ bee * ~ CDC 7600, IBM 360/195  8 ~ 10 ~ KL-10  CP ~ 7 ~ KA-10  bit 10 ~  ~ sec 6 ~ 10 ~ slug * ~ 5 ~ 10 ~ * sponge (alive) ~ 4 ~ 10 ~ ~ 3 ~ 10 ~  pocket calculator ~ % 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 CE (bits)\N \_=34;\\_=0;\ \F1\J FIGURE: Compute Power and Energy of various devices. Scales are logarithmic. The Cray machine is an extremely fast and large scientific computer. The NASA simulator would probably be a general purpose computer 100 times as powerful as the biggest existing machines. It has not been designed yet.\. \, \_=34;\\_=0;\\N \J A tightly hand coded simple operator, like high pass filtering by subtraction of a local average, applied to a million pixel picture takes at least 160 seconds when executed on a PDP-10, not counting timesharing. Since the computer can evaluate only one at a time, the effective rate is 1/160 million pixel simple operators per second. Thus a hand coded PDP-10 falls short of being the equal of the human visual system by a speed factor of 1.6 million. It may not be necessary to apply every operator to every portion of every picture, and a general purpose computer, being more versatile than the optic nerve, can take advantage of this. I grant an order of magnitude for this effect, reducing the optic nerve to a mere 100,000 PDP-10 equivalents.\. \J The size of this factor is related to having chosen to implement our algorithms in machine language. If we had opted to disassemble a number of PDP-10's and reconfigure the components to do the computation, far fewer (perhaps only one!) would have been required. On the other hand if we had run our algorithms in interpreted Lisp, 10 to 100 times as many would be needed. The tradeoff is that the design time varies inversely with the execution efficiency. A good Lisp program to compute a given function is much easier to produce than an efficient machine language program, or an equivalent piece of hardware.\. \J As a practical example of the kind of problem this gap poses in current research, consider my work. The task is to construct a program which can drive a vehicle sensing the world with a TV camera through terrain cluttered with obstacles, avoiding the obstacles and getting to a desired place. The programs are written efficiently and in the spirit of computing only as much as is actually required to track objects from one image to the next, and to judge their distance from the parallax caused by vehicle motion. In spite of this it takes a large program several minutes of computing to process each frame. Differences in performance caused by changes in the program can often be determined only after tens of images have been processed, implying a run time of hours. This greatly limits experimentation. Also, many ideas on how to significantly improve performance cannot reasonably be tried because they slow down the computation by another factor of 10 or 100, increasing typical runs to days and weeks! Many (such as taking pictures at much smaller intervals than the current two foot motions) require very little additional programming, and would be almost certain to improve things.\. \, \FDEntropy measurement:\F1 \J Is there a quantitative way in which the processing power of a system, independent of its detailed nature, can be measured? A feature of things which compute massively is that they change state in complicated and unexpected ways. The reason for believing that, say, a stationary rock does little computing is its high predictability. By this criterion the amount of computing done by a device is in the mind of the beholder. A machine with a digital display which flashed 1, 2, 3, 4 etc., at fixed intervals would seem highly predictable to an adult of our culture, but might be justifiably considered to be doing an interesting, nontrivial and informative computation by a young child. Information theory provides a measure for this idea. If a system is in a given state and can change to one of a number of next states with equal probability, the information in the transition, which I will call the Compute Energy (\F2CE\F1), is given by \. \C\F2CE = \!jsum(i=1,N); -p\!jsub(i); log p\!jsub(i);\F1 \Jwhere \f2N is the number of next states. The measure is in binary digits, bits. If we consider the system in the long run, considering all the states it might ever eventually be in, then this measure expresses the total potential variety of the system. A machine which can accomplish a given thing faster is more powerful than a slower one. A measure for Compute Power is obtained by dividing each term in the above sum by the time required for a transition. Thus:\. \C\F2CP = \!jsum(i=1,N); \!jdiv((-p\!jsub(i); log p\!jsub(i);)\N ,(t\!jsub(i);));\F1 \JThe units are bits/second. Slightly more complicated formulas, which give lower values, apply if the transitions probabilities and times are not all equal. These measures are highly analogous to the energy and power capacities of a battery. Some properties follow: They are linear, i.e. the compute power and energy of a system of two or more independent machines is the sum of the individual power and energies; Speeding up a machine by a factor of n increases the \F2CP\F1 by the same factor; A completely predictable system has a \F2CP\F1 and \F2CE\F1 of zero; A machine with a high short term \F2CP\F1, which can reach a moderate number of states in a short time, can yet have a low \F2CE\F1, if the total number of states attainable in the long run is not high. \. \J If the probabilities and times of all the transitions are equal the measures simplify to \. \C\F2CE = log\!jsub(2);N\F1 \C\F2CP = \!jdiv((log\!jsub(2);N),t);\F1 \J For \f2N distinct outcomes and equal transition times the maximum \F2CE\F1 and CP is obtained if all the probabilities of occurrence are \F21/N\F1, so the above simplifications represent an upper bound in cases where the probabilities are variable or cannot be determined.\. \, \FDA representative computer:\F1 \J For the KA-PDP10, considering one instruction time, we have (roughly) that in one microsecond this machine is able to execute one of \F22\!jsup(5);\F1 different instructions, involving one of \F22\!jsup(4);\F1 accumulators and one of \F22\!jsup(18);\F1 memory locations, most of these combinations resulting in distinct next sates. This corresponds to a \F2CP\F1 of\. \C\F2\!jdiv((log\!jsub(2);(\!jexp(2,5); \f0x \!jexp(2,4); \f0x \!jexp(2,18);) bit),\ (\!jexp(10,-6); sec)); = 27 \f0x \!jexp(10,6); \!jdiv(bit,sec);\F1 \J This is an extreme upper bound, and represents the most efficient code possible. It cannot be maintained for very long, because many different sequences of instructions have the same outcome. Very often, for instance, the order in which a number of instructions is executed does not matter. Assuming for the moment that this is always the case, we can calculate the effect on the \F2CP\F1, measured over a second. The raw power says there are \. \C\F22\!jsup((27 \f0x 10\!jsup(6);));\F1 \J distinct possible states after a second, or one million instruction times. If all permutations result in the same outcome, this must be reduced by a factor of \F21000000!\F1. The log of a quotient is the difference of the logs, so the adjusted \F2CP\F1 is \. \C\F227 \f0x \!jexp(10,6); - log\!jsub(2);(\!jexp(10,6);!)\N = 27 \f0x \!jexp(10,6); - 18.5 \f0x \!jexp(10,6);\N = 8.5 \f0x \!jexp(10,6); \!jdiv(bit,sec);\F1 \J The \F2CP\F1 is also limited by the total compute energy. If we ignore external devices, this is simply the total amount of memory, about \F236\f0x2\!jsup(18); = 9.4\f0x10\!jsup(6);\F1 bits. The PDP 10 could execute at its maximum effectiveness for \F29.4/8.5 = 1.1\F1 seconds before reaching a state which could have been arrived at more quickly another way. Large external storage devices such as disks and tapes can increase the compute energy indefinitely, because they are a channel through which independently obtained energy may be injected. On the other hand, they have only a moderate effect on the power. A disk channel connected to our KA-10 has a data rate of \F26.4\f0x10\!jsup(6);\F1 bits/sec. If run at full speed, constantly stuffing new information into memory, it would add slightly less than that to the power, because it uses memory cycles the processor might want. If, as is usually the case, the disk is used for both reading and writing, the improvement is reduced by a factor of two. Further reductions occur if the use is intermittent. The overall impact is less than \F210\!jsup(6);\F1 bits/sec, about 10% of the power of the raw processor.\. \J Overall, the processing power of a typical major AI center computer is at most \F210\!jsup(7);\F1 bits/sec. Time sharing reduces this to about \F210\!jsup(6);\F1 b/s per user. Programming in a moderately efficient high level language costs another factor of 10, and running under an interpreter may result in a per user power of a mere 10,000 bits/sec, if the source code is efficient. These reductions can be explained in the light of the CP measure by noting that the a compiler or interpreter causes the executed code to be far more predictable than optimal code, eg. by producing stereotyped sequences of instructions for primitive high level constructs.\. \, \FDA typical nervous system:\F1 \J We now consider the processing ability of animal nervous systems, using humans as an example. Since the data is even more scanty than what we assumed about the PDP-10, some not unassailable assumptions need to be made. The first is that the processing power resides in the neurons and their interconnections, and not in more compact nucleic acid or other chemical encodings. There is no currently widely accepted evidence for the latter, while neural mechanisms for memory and learning are being slowly revealed. A second is that the neurons are used reasonably efficiently, as detailed analysis of small nervous systems and small parts of large ones reveals (and common sense applied to evolution suggests). Thirdly, that neurons are fairly simple, and their state can be represented by a binary variable, "firing" or "not firing", which can change about once per millisecond. Finally we assume that human nervous systems contain about 40 billion neurons. Considering the space of all possible interconnections of these 40 billion (treating this as the search space available to natural evolution in its unwitting attempt to produce intelligence, in the same sense that the space of all possible programs is available to someone trying to create intelligence in a computer), we note that there is no particular reason why every neuron should not be able to change state every millisecond. The number of combinations thus reachable from a given state is \F2\!jexp(2,(40\f0x10\!jsup(9);));\F1 the binary log of which gives \F2CE = 40\f0x10\!jsup(9);\F1. This leads to a compute power of \. \C\F2CP = \!jdiv((40 \f0x \!jexp(10,9); bit),(\!jexp(10,-3); sec)); =\N 40 \f0x \!jexp(10,12); \!jdiv(bit,sec);\F1 \Jwhich is about a million times the maximum power of the KA-10. Keep in mind that much of this difference is due to the high level of interpretation in the KA, compared to what we assumed for the nervous system. Rewiring its gates or transistors for each new task would greatly increase the \F2CP\F1, but also the programming time. If the processor is made of 100,000 devices which can change state in 100 ns, the potential \F2CP\F1 available through reconfiguration is \F210\!jsup(5); \F1bits\F2/10\!jsup(-7);\F1 sec \F2= 10\!jsup(12);\F1 b/s. The \F2CE\F1 would be unaffected. If automatic design and fabrication methods result in small quantity integrated circuit manufacture becoming less expensive and more widely practiced, my calculations may prove overly pessimistic. \. \, \FDThermodynamic efficiency:\F1 \J Thermodynamics and information theory provide us with a minimum amount of energy per bit of information generated at a given background temperature (the energy required to out shout the thermal noise). This is approximately the Boltzmann constant,\. \C\F21.38 \f0x \!jexp(10,-16); \!jdiv(erg,deg variable); = \N 0.96 \f0x \!jexp(10,-16); \!jdiv(erg,deg bit);\F1 \JThe reduction is due to the theoretical fact that a "variable", also known as a degree of freedom, is worth \F2log\!jsub(2);e\F1 bits, about 1.44 bits. This measure allows us to estimate the overall energy efficiency of computing engines. For instance, we determined the computing power of the brain, which operates at 300 degrees K, to be \F240\f0x10\!jsup(12);\F1 bits/sec. This corresponds to a physical power of\. \C\F240 \f0x \!jexp(10,12); \!jdiv(bit,sec); \f0x 300 \N deg \f0x 0.96 \f0x \!jexp(10,-16); \!jdiv(erg,deg bit); = \N 1.15 \!jdiv(erg,sec); = 1.15 \f0x \!jexp(10,-7); watt\F1 \J The brain runs on approximately 40 watts, so we conclude that it is \F210\!jsup(-8);\F1 times as efficient as the physical limits allow. Doing the same calculation for the KA10, again at 300 deg, we see that a \F2CP\F1 of \F28.5\f0x10\!jsup(6);\F1 bit/sec is worth \F22.44\f0x10\!jsup(-14);\F1 watts. Since this machine needs 10 kilowatts the efficiency is only \F210\!jsup(-18);\F1. Conceivably a ten watt, but otherwise equivalent, KA10 could be designed today, if care were taken to use the best logic for the required speed in every assembly. The efficiency would then still be only \F210\!jsup(-15);\F1. As noted previously, there is a large cost inherent in the organization of a general purpose computer. We might investigate the computing efficiency of the logic gates of which it is constructed (as was, in fact, done with the brain measure). A standard TTL gate can change state in about 10ns, and consumes \F210\!jsup(-3);\F1 watt. The switching speed corresponds to a CP of \F210\!jsup(8);\F1 bit/sec, or a physical power of \F22.87\f0x10\!jsup(-13);\F1 watt. So the efficiency is \F210\!jsup(-10);\F1, only one hundred times worse than a vertebrate neuron.\. \J The newer semiconductor logic families are even better. C-MOS is twice as efficient as TTL, and Integrated Injection Logic is 100 times better, putting it on a par with neurons. Experimental superconducting Josephson junction logic operates at 4 deg K, switches in \F210\!jsup(-11);\F1 sec, and uses \F210\!jsup(-7);\F1 watts per gate. This implies a physical compute power of \F13.5\f0x10\!jsup(-12);\F1 watt, and an efficiency of \F27\f0x10\!jsup(-5);\F1, 1000 times better than neurons. At room temperature it requires a refrigerator that consumes 100 times as much energy as the logic, to pump the waste heat uphill from 4 degrees to 300. Since the background temperature of the universe is about 4 degreees, this can probably eventually be done away with. It is thus likely that there exist ways of interconnecting gates made with known techniques which would result in behavior effectively equivalent to that of human nervous systems. Using a million I\!jsup(2);L gates, or 10 thousand Josephson junction gates, and a trillion bits of slower bulk storage, all running at full speed, such assemblies would consume as little as, or less than, the power needed to operate a brain of the conventional type. Past performance indicates that the amount of human and electronic compute power available is inadequate to design such an assembly within the next few years. The problem is much reduced if the components used are suitable large subassemblies. Statements of good high level computer languages are the most effective such modularizations yet discovered, and are probably the quickest route to human equivalence, if the necessary raw processing power can be accessed through them. This section has indicated that a million times the power of typical existing machines is required. The next suggests this should be available at reasonable cost in about ten years.\. \, \FBSection 3: The Growth of Processing Power \F1\J The references show that the price per electronic switch has declined by a steady factor of ten every five years, if speed and reliability gains are included. Occasionally there is a more precipitous drop, when a price threshold which opens a mass market is reached. This makes for high incentives, stiff price competition and mass production economies. It happened in the early sixties with transistor radios, and is going on now for pocket calculators and digital wristwatches. It is begining for microcomputers, as these are incorporated into consumer products such as stoves, washing machines, televisions and sewing machines, and soon cars. During such periods the price can plummet by a factor of 100 in a five year period. Since the range of application for cheap processors is larger than for radios and calculators, the explosion will be more pronounced. The pace of these gains is in no danger of slackening in the forseeable future. In the next decade the current period may seem to be merely the flat portion of an exponential rise. On the immediate horizon are the new semiconductor techniques, I\!jsup(2);L, and super fast D-MOS, CCD for large sensors and fast bulk memory, and magnetic bubbles for mass storage. The new 16K RAM designs use a folded (thicker) cell structure to reduce the area required per bit, which can be interpreted as the first step towards 3 dimensional integration, which could vastly increase the density of circuitry. The use of V-MOS, an IC technique that vertically stacks the elements of a MOS transistor is expanding. In the same direction, electron beam and X-ray lithography will permit smaller circuit elements. In the longer run we have ultra fast and efficient Josephson junction logic, of which small IC's exist in an IBM lab, optical communication techniques, currently being incorporated into intermediate distance telephone links, and other things now just gleams in the eye of some fledgling physicist or engineer.\. \, \F0\_=25;\\_=-7;\\N \_.\A \F1Transistor price\F0 .0001\fE$\_A\.\N .01\fE$\_A\.\N 1\fE$\_A\.\N \f1$1\_A\.\N \f1$100 \F1 Year\F0 ~ ~  O \F1 1950\F0 ~ # ~\F1 1951 \f1$100 transistor\F0 ~ # ~\F1 1952 transistor hearing aid\F0 ~ # ~\F1 1953\F0 ~ # ~\F1 1954\F0  # \F1 1955 transistor radios\F0 ~ # ~\F1 1956\F0 ~ O ~\F1 1957 \f1$10 transistor\F0 ~ # ~\F1 1958\F0 ~ # ~\F1 1959\F0  O \F1 1960 \f1$1 transistor\F0 ~ # ~\F1 1961\F0 ~ # ~\F1 1962 \f1$100,000 small computer (IBM 1620)\F0 ~ # ~\F1 1963\F0 ~ O ~\F1 1964\F0  # \F1 1965 \f1$0.08 transistor (IC)\F0 ~ # ~\F1 1966 \f1$1000 4 func calculator\F0 ~ # ~\F1 1967 \f1$6000 scientific calc.\F0 ~ # ~\F1 1968 \f1$10,000 small computer (PDP 8)\F0 ~ # ~\F1 1969\F0  O \F1 1970 \f1$200 4 func calculator\F0 ~ # ~\F1 1971\F0 ~ # ~\F1 1972 1K RAMS (1 \fE$/bit)\F0 ~ # ~\F1 1973\F0 ~ # ~\F1 1974 \f1$1000 small computer (PDP 11)\F0  O \F1 1975 4K RAMS (.1 \fE$/bit)\F0 ~ # ~\F1 1976 \f1$5 4 func calc (.05 \fE$/trans)\F0 ~ ~\F1 1977\F0 ~ ~\F1 1978\F0 ~ ~\F1 1979\F0 %$ \_=34;\\_=0;\\F1 \, \J My favorite fantasies include the "electronics" of super-dense matter, either made of muonic atoms, where the electrons are replaced by more massive negative particles or of atoms constructed of magnetic monopoles which (if they exist) are very massive and affect each other more strongly than electric charges. The electronics and chemistry of such matter, where the "electron" orbitals are extremely close to the nucleus, would be more energetic, and circuitry built of it should be astronomically faster and smaller, and probably hotter. Mechanically it should exhibit higher strength to weight ratios. The critical superconducting transition field strengths and temperatures would be higher. For monopoles there is the possibility of combination magnetic electric circuitry which can contain, among many other goodies, DC transformers, where an electric current induces a monopole current at right angles to it, which in turn induces another electric current. One might also imagine quantum DC transformers, matter composed of a chainlike mesh of alternating orbiting electric and magnetic charges.\. \J I interpret these things to mean that the cost of computing will fall by a factor of 100 during the next 5 years, as a consequence of the processor explosion, and by the usual factor of 10 in the 5 years after that. As an approximation to what is available today, note that in large quantities an LSI-11 sells for under $500. This provides a moderately fast 16 bit processor with 4K of memory. Another $500 could buy an additional 32K of memory, if we bought in quantity. The result would be a respectable machine, somewhat less powerful than the KA-10, for $1000. At the crude level of approximation employed in the previous section, a million machines of this type should permit human equivalence. A million dollars would provide a thousand of them today (a much better buy, in terms of raw processing power, than a million dollar large processor). In ten years a million dollars should provide the equivalent of a million such machines, in the form of a smaller number of faster processors, putting human equivalence within reach. We now have the problem of what to do with this roomful of isolated computers. The next section describes what seems to be the best known solution to the problem of interconnecting a large number of processors with maximum generality, at reasonable cost. \. \, \FBSection 4: Mega Processing\F1 \J The highest bandwidth and most flexible way for a number of computers to interact is by shared memory. Systems of the size considered here would require a large, but not unreasonable, address space of 100 billion words (40 bits of address). They would also demand memories with a thousand to a million ports. Although a variant of the method below could be used to construct such monsters, they would cost much more than the processors they served. Alternatively, we might consider how a mass of the usual kind of machine, each with a moderate amount of local memory, could communicate through their bandwidth limited IO busses. Define a message period as the minimum interval in which a processor may transmit and receive one message. Assuming that messages are addressed to particular destination machines, an ideal interconnection system would allow every processor to issue a message during each message period, and deliver the highest priority message for each processor to it, providing notices of success or failure to the originators. The delay between transmission and reception of a message would be uniform and short, and the cost of the network would be reasonable (on the order of the cost of the processors served). It so happens that this specification can be met. The following construction is a design due to K.E. Batcher [Batcher], extended considerably to permit pipelining of messages and higher speed. It has the properties:\. \_L\+=100;\L\_R\-=100;\R\J Every processor may send a fairly long message to any other processor about every quarter of a microsecond. The messages from all the processors are emitted in synchronized waves. A wave takes one microsecond to filter through the interconnection net, causing there to be four waves in the net at one time. Each message includes a priority number introduced by the sending computer. The network delivers to each processor the message with the highest priority addressed to it, if any. The processor sending each delivered message receives an acknowledgement, the processors whose messages were blocked by higher priority ones receive notices of failure. The amount of network logic per processor is small, and grows as the square of the log of the number of processors. This low growth rate ensures that even in a system of a million processors the cost of the interconnection is no greater than the cost of the processors.\. \_L\-=100;\L\_R\+=100;\R \, \FDLog\!jsup(2); sorting net construction:\F1 \J Batcher's major contribution is a method for constructing nets that can sort \f2N numbers in \F2log(N)\!jsup(2);\F1 time, using only \F2N\f0xlog\!jsup(2); N\F1 primitive two element sorters. Two nets plus a little additional circuitry can accomplish the message routing task. The nets are constructed by wiring together primitive sorting elements. A primitive sorting element has two inputs, accepting a pair of numbers, and two outputs on which the numbers appear, sorted by their magnitude. If the numbers are represented as serial bit streams, high order bit first, such an element can be built with about 20 gates. Central to the construction is a method for making a merger for two ordered lists of \F22N\F1 numbers (call it a \F12N\F1-merger) from two \f2N-mergers and \F22N\F1 additional primitive elements. A primitive sorting element is itself a \f21-merger. A special case application of the method is shown in fig. 1. Batcher's paper contains proofs of its correctness. An \f2N-merger can thus be constructed from \F2N\f0xlog\!jsub(2); N\F1 primitive elements. A sorter is made of a cascade of ever larger mergers, forming length two ordered lists from individual inputs with \f21-mergers, combining these pairwise into length 4 lists with \f22-mergers, and so on, until the entire sorted list of N numbers comes out of a final \!jfrx(N,2);-merger. The number of primitive elements involved is \F2N\f0xlog\!jsub(2);N\f0x(log\!jsub(2);N+1)/4\F1.\. \, \F0\_=25;\\N Primitive sorting elements  A1    ~  A2  ~ 3   ~ ~~ ~ ~   A3 $~ % M  %$ ~ ~ ~ E ~ ~ ~ A4 ~ ~ $ R ~ ~~ ~ G ~ ~ ~~  A5 $~ E ~%  I ~~ ~~ R ~ ~  O A6 ~~ ~~~ ~ ~~ %$ N ~~~ ~~~ %$ ~~ ~~~ U ~~~ ~~~ ~~ ~~~ P ~~~ ~~~ ~~ ~~~  T ~~~ ~~~ ~~ ~~%  U ~~~ ~~~ ~~  P ~~~ ~~~ ~~~ ~~ %$ T B1 $~~  ~~~ ~~ U ~~% $~~ ~~ S ~% ~ ~~ ~~  T B2 % 3 $~ ~%  ~ ~~ ~ ~ ~  B3 $~ % M $~ ~ %$ ~ ~ ~ E ~ ~ ~ B4 ~ ~ $ R $ ~  ~% G ~ %  B5 $ ~ E   % R ~ %$ B6   %$ \_=34;\ \JFIG 1: \F1An illustration of Batcher's odd-even merger construction, producing a 6-merger from two 3-mergers and five primitive elements. The first and last outputs undergo less delay than the intermediate ones, which is undesirable. The delays can be equalized by passing the first and last outputs through an extra primitive element. Batcher provides an alternate construction, which he calls a "bitonic sorter" which merges, has constant delay, and uses the same number of elements as the above construction with the extra element added.\. \FDCommunication scheme organization:\F1 \F0\_=25;\     1 ~ ~~ ~~ ~~ ~ 1 ~ ~ ~ ~ ~ ~ ~ ~ 2 ~ ~~ ~~ ~~ ~ 2 Processor ~ N ~ ~ ~ ~ ~ ~ ~ Processor 3 ~ ~~ ~~ ~~ ~ 3 Outgoing ~ Input ~ ~ ~ ~ ~ ~ ~ Acknowledge ~ ~ ~ ~ ~ ~ ~ ~ ment Message . ~ Sorter ~ . ~ ~ . ~ ~ . ~ ~ . . ~ ~ . ~ ~ . ~ ~ . ~ ~ . Input Ports . ~ (Desti ~ . ~ ~ . ~ ~ . ~ 2N ~ . ~ nation ~ ~ ~ ~ ~ ~ ~ Ports ~ field) ~ ~ ~ ~ E ~ ~ Input ~ ~ ~ ~ N ~ ~ x ~ ~ ~ N ~ ~~ ~~ c ~~ Sorter ~ N %$ % M ~ ~ h ~ ~ ~ ~ e ~ ~ a ~ ~ (Source ~  $ r ~ ~ n ~ ~ field) ~ 1 ~ g ~~ g ~~ ~ 1 ~ e ~ ~ e ~ ~ ~ 2 ~ r ~~ r ~~ ~ 2 ~ ~ ~ ~ ~ ~ Processor Dummy 3 ~ ~~ ~~ ~ 3 ~ ~ ~ ~ ~ ~ Incoming Message ~ ~ ~ ~ ~ ~ . ~ ~ . ~ ~ . ~ ~ . Message Injection . ~ ~ . ~ ~ . ~ ~ . . ~ ~ . ~ ~ . ~ ~ . Ports ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ N ~ ~~ ~~ ~ N %$ %$ %$ \_=34;\ \JFIG 2: \F1Block diagram of the multiprocessor interconnection scheme. The cost per processor, and the delay in the net, grows as the square of the log of N, the number of processors.\. \F0\_=25;\  ~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~ %$ FIG 3: \F1Processor message format. High order bit is on the right.\F0  ~ Unused ~ Position Number ~ 1 ~ Zero ~ Position Number ~ %$ FIG 4: \F1Dummy message format.\F0 \_=34;\ \F1\J The interconnection scheme is diagrammed in Figs 2, 3 and 4. Each processor is assigned a number, its "address", as indicated. In the sorters and the merger the smaller numbers come out towards the top of the diagram. Messages are serial bit streams, and consist of a destination processor address, a priority number (invented by the originating computer), a one bit "dummy" flag field (set to 0 for actual messages), the address of the source processor (i.e. a return address), and the data to be communicated. A low priority number implies high priority. Zero is the highest priority. The net is assumed to run at 100% duty cycle, with the processors emitting successive synchronized waves of messages. Every processor emits a message every message interval. The following discussion examines a single message wave. The first sorting net orders the messages by destination address, and within a given destination by priority number. Thus the upper inputs of the merger receive a list of messages, grouped by destination, with the highest priority message to each processor heading its group. The lower inputs of the merger receive N dummy messages, exactly one for each destination processor. The priority field is the highest possible (i.e. zero), the dummy flag is 1, the source address is the same as the destination, and the data portion is unused. Merging these dummies with the sorted list of real messages results in a list still grouped by destination, with each group headed by a dummy, by virtue of its high priority, followed immediately by the highest priority real message, if any, for the destination. This list is fed to the exchange network, which examines adjacent pairs of messages (considering overlapping pairs), and exchanges the data portions of a pair if the first member happens to be a dummy to a given address, and the second is a real message to the same address (i.e. it is the highest priority real message to that destination). The sorting network following the exchanger sorts the messages by the field begining with the dummy flag, which acts as the high order bit, followed by the source address. Since there were N real messages, one from each processor, and N dummies, also nominally one from each processor, and since real messages are sorted ahead of dummy messages due to the high order bit, (i.e. the dummy flag) being 0, the second sorter restores the messages into the same order in which they were introduced.\. \J Each processor has two input ports, one labeled "acknowledgement", and the other "incoming message". The acknowledgement input of processor i is connected to output i of the second sorter. The incoming message input is connected to output N+i (i.e. the i'th output of the lower half). In the absence of the exchange network, the i'th processor would receive its own message back on its acknowledgement input, and the i'th dummy message on the incoming message input. Because of the exchanger, however, if the message that processor i had sent happened to be the highest priority message to the requested destination, then the data portion of the message on the acknowledgement input would be that of the dummy it had been swapped with (signaling success). Also, if any messages had been addressed to processor i, the data portion of the highest priority one would arrive on the incoming message port, in place of the dummy message. Thus a processor receives the highest priority message addressed to it on its incoming message port, or a dummy if nobody wanted to talk to it. It receives a dummy on its acknowledgement port if its message has gotten through, or the message back if it hasn't, due to the existence of a higher priority message to the same destination. Actually the serial nature of the sorter causes the destination and priority field to be lost in the source address sorter (it tails after the previous wave of messages). In the case of messages that fail to get delivered, this means that the originating processor must remember to whom it sent the message (about four message times ago in a typical design, due to the latency of the net), if it wants to try again. This is probably undesirable. Also, delivered messages contain no indication of who sent them, having had their source address field exchanged with that of a dummy, unless the source address is included in the data field. These shortcomings can be overcome if the exchanger shuffles the destination address, source address and priority fields in the manner suggested by Figs 5, 6 and 7. Such shuffling can be accomplished with an amount of storage at each exchanger position equal to the number of bits in the destination and priority fields. \. \F0\_=25;\ \F1Before:\F0  ~ Unused ~ Destination Ad ~ 1 ~ Zero ~ Destination Ad ~ %$  ~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~ %$ \F1After:\F0  ~ Priority ~ Source Address ~ Data ~ Destination Ad ~ 1 ~ %$  ~ Zero ~ Destination Ad ~ Unused ~ Source Address ~ 0 ~ %$ FIG 5: \F1Rearrangements effected by the exchanger in an exchanged pair.\F0 \F1Before:\F0  ~ Data ~ Source Address ~ 0 ~ Priority ~ Destination Ad ~ %$ \F1After:\F0  ~ Priority ~ Destination Ad ~ Data ~ Source Address ~ 0 ~ %$ FIG 6: \F1Rearrangements in an unsuccessful message.\F0 \F1Before:\F0  ~ Unused ~ Destination Ad ~ 1 ~ Zero ~ Destination Ad ~ %$ \F1After:\F0  ~ Zero ~ Destination Ad ~ Unused ~ Destination Ad ~ 1 ~ %$ FIG 7: \F1Rearrangements in an isolated dummy message.\F0 \_=34;\ \FDPackage counts:\F1 \J If the numbers to be sorted are sent into such a net as serial bit streams, high order bit first, then a primitive sorting element has two output wires labelled "smaller" and "larger", two input wires and a reset and a clock line, and works as follows: Just before the first bit time the element is reset. Bits then stream into the input terminals, and simply stream out of the output terminals until the first bit position in which the two inputs differ comes along. At that instant, the input with the 0 is connected to the "smaller" output, and the one with the 1 is connected to "larger". This interconnection is latched by the element and all subsequent bits stream from the inputs to the outputs on the basis of it, until the next reset. Such a unit can be built with approximately 20 gates, and introduces one bit time of delay. Careful design should permit an ECL version with a 100 or 200 MHz bit rate. These could be packed into 48 (say) pin LSI packages, 8 independent elements per package (the clock and reset lines are common), 16 per package, configured into 4 2-mergers, 24 per package, arranged into two 4-mergers, and 32 per package containing a single 8-merger. Larger mergers can be constructed out of these using an extension of the bitonic sorter strategies given in [Batcher], resulting in total package counts (and partial eight merger, four, two and single element package counts) shown in Fig. 8 (to re-iterate, a merger size of N refers to one which takes two lists of size N and produces a list of size 2N). \. \F0\_=25;\  ~ Merger size ~ Package counts ~ ~ ~ Total Eights Fours Twos Ones ~  ~ ~ ~ ~ ~ ~ ~ ~ 1 ~ 1/8 ~ ~ ~ ~ 1/8 ~ ~ 2 ~ 1/4 ~ ~ ~ 1/4 ~ ~ ~ 4 ~ 1/2 ~ ~ 1/2 ~ ~ ~ ~ 8 ~ 1 ~ 1 ~ ~ ~ ~ ~ 16 ~ 4 ~ 2 ~ ~ ~ 2 ~ ~ 32 ~ 8 ~ 4 ~ ~ 4 ~ ~ ~ 64 ~ 16 ~ 8 ~ 8 ~ ~ ~ ~ 128 ~ 32 ~ 32 ~ ~ ~ ~ ~ 256 ~ 96 ~ 64 ~ ~ ~ 32 ~ ~ 512 ~ 192 ~ 128 ~ ~ 64 ~ ~ ~ 1,024 ~ 384 ~ 256 ~ 128 ~ ~ ~  ~ 2,048 ~ 768 ~ 768 ~ ~ ~ ~ ~ 4,096 ~ 2,048 ~ 1536 ~ ~ ~ 512 ~ ~ 8,192 ~ 4,096 ~ 3072 ~ ~ 1024 ~ ~ ~ 16,384 ~ 8,192 ~ 6144 ~ 2048 ~ ~ ~ ~ 32,768 ~ 16,384 ~ 16384 ~ ~ ~ ~ ~ 65,536 ~ 40,960 ~ 32768 ~ ~ ~ 8192 ~ ~ 131,072 ~ 81,920 ~ 65536 ~ ~ 16384 ~ ~ ~ 262,144 ~ 163,840 ~ 131072 ~ 32768 ~ ~ ~ ~ 524,288 ~ 327,680 ~ 327680 ~ ~ ~ ~ ~ 1,048,576 ~ 786,432 ~ 655360 ~ ~ ~ 131072 ~ ~ ~ ~ ~ ~ ~ ~ %$ FIG 8: \F1Package counts for mergers\F0 \_=34;\ \, \F1\J These counts can now be used to calculate the number of packages required to build sorters of various sizes: \. \F0\_=25;\  ~ Sorter size ~ Package counts ~ ~ ~ Total Eights Fours Twos Ones ~  ~ ~ ~ ~ ~ ~ ~ ~ 2 ~ 1/8 ~ ~ ~ ~ 1/8 ~ ~ 4 ~ 1/2 ~ ~ ~ 1/4 ~ 1/4 ~ ~ 8 ~ 3/2 ~ ~ 1/2 ~ 1/2 ~ 1/2 ~ ~ 16 ~ 4 ~ 1 ~ 1 ~ 1 ~ 1 ~ ~ 32 ~ 12 ~ 4 ~ 2 ~ 2 ~ 4 ~ ~ 64 ~ 32 ~ 12 ~ 4 ~ 8 ~ 8 ~ ~ 128 ~ 80 ~ 32 ~ 16 ~ 16 ~ 16 ~ ~ 256 ~ 192 ~ 96 ~ 32 ~ 32 ~ 32 ~ ~ 512 ~ 480 ~ 256 ~ 64 ~ 64 ~ 96 ~ ~ 1,024 ~ 1,152 ~ 640 ~ 128 ~ 192 ~ 192 ~ ~ 2,048 ~ 2,688 ~ 1536 ~ 384 ~ 384 ~ 384 ~  ~ 4,096 ~ 6,144 ~ 3840 ~ 768 ~ 768 ~ 768 ~ ~ 8,192 ~ 14,336 ~ 9216 ~ 1536 ~ 1536 ~ 2048 ~ ~ 16,384 ~ 32,768 ~ 21504 ~ 3072 ~ 4096 ~ 4096 ~ ~ 32,768 ~ 73,728 ~ 49152 ~ 8192 ~ 8192 ~ 8192 ~ ~ 65,536 ~ 163,840 ~ 114688 ~ 16384 ~ 16384 ~ 16384 ~ ~ 131,072 ~ 368,640 ~ 262144 ~ 32768 ~ 32768 ~ 40960 ~ ~ 262,144 ~ 819,200 ~ 589824 ~ 65536 ~ 81920 ~ 81920 ~ ~ 524,288 ~ 1,802,240 ~ 1310720 ~ 163840 ~ 163840 ~ 163840 ~ ~ 1,048,576 ~ 3,932,160 ~ 2949120 ~ 327680 ~ 327680 ~ 327680 ~ ~ 2,097,152 ~ 8,650,752 ~ 6553600 ~ 655360 ~ 655360 ~ 786432 ~ ~ ~ ~ ~ ~ ~ ~ %$ FIG 9: \F1Package counts for sorters\F0 \_=34;\ \F1\J An interconnection net for N processors involves an N sorter, an N merger and a 2N sorter. Thus a 128 processor system would require 304 48 pin sorter packages, 2.3 for each processor. A 1024 processor needs 4224, roughly four per processor. A size 16,384 system needs 7 for each computer. A million processor would have 12.75 per machine. It is likely that the biggest versions of this system will require denser packaging. Remember, though, that a thousand processor system is sufficient for human equivalence if each machine is big and fast enough. It will probably not be necessary to build a megaprocessor in the decade envisioned here.\. \, \FDSpeed calculations:\F1 \J As outlined, a message consists of a destination address, a priority, a bit, a source address and a data portion. The two addresses must be at least large enough to uniquely specify each machine. Considering the case of a thousand and a million machine system, we note that the address lengths are 10 and 20 bits. Let's make the priority field the same length, leaving room for considerable flexibility in priority assignment schemes. The message portion should be fairly long, to permit messages like memory write requests, which require both a memory address and the data. Four address lengths, say. This gives us a message length 7 addresses long, 70 bits in a 1000 processor, 140 bits in a megaprocessor. The full time taken by a message from start of transmission to completion of reception, in bit times, is the message length, plus the depth of the net (in primitive elements), plus an address and a priority time due to the buffering at the exchanger. The depth of a 1024 processor net is 110 elements. This combines with the message and exchanger delays to result in a transit time of 110+70+20 = 200 bit times. If the bit rate is 100 MHz then messages are delivered in two microseconds. If the bit rate is 200 MHz, the time is 1 microsecond. The net contains about two full message waves at any instant, and a message time is 700 nanoseconds for 100 MHz, or 350 nanoseconds for 200 MHz. The same calculation for a megaprocessor reveals a depth of 420 and a total transit time of 420+140+40 = 600 bit times. This corresponds to 6 and 3 microsecond transits for 100 and 200 MHz data rates. Corresponding message times are 1.4 microseconds and 700 nanoseconds. The net contains slightly more than 3 message waves at a time.\. \, \FDPossible refinements:\F1 \J The transit time of the net can be decreased, at the cost of increasing the message times a little, by running some of the primitive sorter elements asynchronously, clocking (and introducing bit time delays) at larger intervals. For instance, an 8-merger might accept a bit time worth of inputs, which would then trickle through four stages of primitive sorters built without shift register delays between them. When everything had settled down, the entire merger would be clocked, and those elements which had decided to latch at this bit time would do so. The delay of the unit is one bit time rather than four, which reduces the 110 or the 420 term in the transit calculations. The settling time is longer, however, so the bit rate would be slower. Perhaps a factor of two in transit time can be gained in this way, at a cost of 1.5 in data rate. At an increase in gate count, several levels of asynchronous primitive sorters can be replaced by an equivalent circuit with fewer gate delays. This might enable a given data rate to be maintained while the transit time was reduced. [Van Voorhis] offers some slight reductions in primitive sorter count, essentially by substituting special case solutions better than the systematic construction wherever possible in a large net. Unfortunately these invariably have an uneven amount of delay along the various paths, making them almost worthless as synchronous nets. They may be useful as designs for asynchronous subnets. I have generalizations of Batcher's constructions using primitive elements that are M sorters (as opposed to 2 sorters). These allow building a merger which combines M sorted lists of size N\f0xM into a single sorted M\f0xM\f0xN length list, out of M mergers each capable of merging M size N lists, and some layers of extra M sorter primitive elements to combine the output of the mergers (these are analogous to the single layer of 2-sorters following the mergers in the odd-even merger of Fig 1. The number of such layers grows (empirically) roughly as log(M)). Although the number of gates needed for a given size sorter grows slowly with M, the number packages used shrinks. This is because each package, being a sorter rather than a merger, contains more logic. My constructions are complete for M8, and partially complete for general M. \. \J Well, now we have a room full of not only processors, but a massive switching system as well. Can it be made to pay for its keep? The next section examines some programming implications and opportunities.\. \, \FBSection 5: Programming Interconnected Processors\F1 \J A major feature of this scheme is its flexibility. It can function as any of the fixed interconnection patterns of current experimental multiprocessors, or as a hexagonal mesh, or a 7 dimensional cubic lattice, should that be desired, or the tree organization being considered in a Stanford proposal. It can simulate programmed pipeline machines, where numbers stream between units that combine and transform them. What is more, it can do all of these things simultaneously, since messages within one isolated subset of the processors have no effect on messages in a disjoint subset. This permits a very convenient kind of "time" sharing, where individual users get and return processors as their demands change. Such mimicry fails to take advantage of the ability to reconfigure the interconnection totally every message wave. There are many applications, such as searching a tree of possibilities in reasoning or game playing where this could be used very effectively. Several existing programming languages can be extended to make this capability conveniently available to programmers. \. \, \FDA little Lisp:\F1 \J Take an existing Lisp and purify it to something closer to its lambda calculus foundations. Flush RPLACA and RPLACD and even SETQ (a pseudo SETQ can be introduced later), and discourage PROGs, which are inherently sequential and do things that could have been stated more clearly as recursive functions. In this system recursive programs are also more efficient. An evaluation is handled as follows. A free processor (one not currently doing an evaluation) receives a message demanding the value of Expr with variable bindings given in ALIST. Call this the task [Expr,ALIST]. If the top level of Expr is F(exprA,exprB,exprC), it generates the tasks [exprA,ALIST], [exprB,ALIST] and [exprC,ALIST], and passes them to three other free processors. These evaluate them and sooner or later return valA (the value of exprA), valB and valC to the original processor. This processor then refers to the definition of F, and in particular to the names of the dummy arguments in the definition. Suppose these were A, B and C. It combines the ALIST it was given with the (name, value) pairs (A, valA), (B, valB), (C, valC) to form a new ALIST'. Then it generates the task [bodyF,ALIST'], where bodyF is the body of the definition of F, which is passed to another free processor. On receiving the value of this expression, it passes it back to whoever had given it the original task, and then declares itself free. If Expr had happened to be an atom, the processor would simply have looked it up in ALIST, and returned its value. If F had been a predefined system function (such as CAR or CONS) the sequence of actions would have been whatever the machine code for those functions (of which each processor would have a copy) specified. The parallelism comes from the fact that the arguments in a multi-argument function can be evaluated simultaneously. This causes moderate parallelism when functions are nested, and can cause explosive parallelism when a function which at some level uses a multi-argument function, invokes itself recursively (as in a tree search). Most non-trivial programs stated as recursive functions do this. The description above implies that the processor waits for the results of tasks it has farmed out to other machines. These waits can be arbitrarily long. Also, the number of tasks spawned can easily become more than the number of processors. In that case, presumably, a processor with a task to farm out would have to wait until a machine becomes free. Keeping a processor idle under these conditions is clearly undesirable.\. \J If each processor were time shared, pretending to be many machines, then when the job being run becomes temporarily idle there may be another to switch to. When a message pertaining to a given waiting job arrives, the processor deals with it, and if it provides the information necessary to allow that job to resume, it is resumed. This scheme makes the number of processors available seem larger, perhaps enough to make it possible to acquire a processor for a new task simply by picking a machine at random and asking if it has a free job slot. This works well if the answer is usually yes, (i.e. if there reasonably more slots than tasks), and replaces a more complicated free processor pool method that must be able to deal with many requests simultaneously. Alternatively instead of each processor having its own little pool of running jobs, the whole Lisp system can maintain a communal pool, which processors refer to as they become free. In this organization a task that must pause is placed into the pool (freeing the processor that was running it), to be taken up again by a free processor when its requirements are met. Moderate processing power is required to manage the pool. The list structure is spread out among all the processors in the system. Pointers consist of a processor number and address within processor field. A machine evaluating an expression whose top level function is CONS creates the new cell in its own memory. A CAR or CDR involves sending a message to the processor which owns the cell involved, and waiting for a reply (thus a CONS is usually cheaper than a CAR!). Since RPLACA and RPLACD are eliminated, circular lists cannot be created. This permits garbage collection to be mediated by reference counts. A small fixed reference count field (three bits wide, say) is part of each cell. A processor doing a CONS sends messages to the owners of the cells that the new cell will be pointing to, indicating that their reference count should be incremented. If a cell getting a message of this type has a reference count that has already reached maximum (7 in our example), it sends back an indignant message to the CONS'ing processor saying, effectively, "I'm full, you can't point to me. But here are my CAR and my CDR, roll your own and point to that". This not only makes the reference count field size manageable, but reduces message traffic jams to processors containing very popular cells. It does make it mandatory to use EQUAL rather than EQ when comparing lists. When a processor no longer requires a cell to which it had a pointer it sends a message to the owner saying so. The reference count of the cell is decreased by one, and if it reaches zero, it is freed, and similar messages are sent to the cells pointed to by its CAR and CDR. This garbage collection process goes on continuously and independently of the main computations.\. \J How is the A-list of bindings to be handled? The canonical representation, using a simple list of dotted pairs, is very inefficient, since it forces a sequential search. A hash table, as in most current Lisp implementations is also undesirable, partly because it requires an incompatible data type (a relatively large contiguous block of memory), but primarily because in this parallel system there are many different contexts (in different branches of the computation) active at one time. The entire hash table would need to be duplicated whenever new bindings are made (i.e. at every level of evaluation), since the older context is in use in another branch. A nice alternative is to use essentially an A-list, but to structure it as a balanced tree, with each node containing a binding, a left subtree for those variables with names lexicographically (say) less than it, and a right subtree for those greater. An entry in such a structure can be found in time proportional to the logarithm of the number of nodes, and a version of the list with an entry added, deleted or modified can also be constructed in log time without affecting the original in any way, by building a new path down to the element (and sometimes a few side paths, to keep the tree reasonably balanced), which points to many of the subtrees of the original. Since this structure can be built of standard list elements, it can be managed via the normal allocation and garbage collection mechanisms. Call this structure an A-tree. In general applications a balanced tree is capable of being used in this system more effectively than a simple list, because a parallel process can get to all the nodes in log time, whereas the last node takes linear time in a linear list. This suggests that programmers wanting maximum efficiency will be coerced into using them much more frequently than is now the case. This is similar to the pressure on present day Lisp programmers to use PROGS rather than recursive functions, because compiled PROGS run faster. The [expr,ALIST] task description is conveyed simply as a pair of pointers, expr to an S expression version of expr, and ALIST to the head of the balanced A-tree. The result returned in an evaluation is likewise a pointer. Since the list structure is only built upon, and never altered, it is possible to speed up the operation of the system by having individual processors maintain software managed caches of frequently referred to cells that reside in remote machines. This cuts down on the message traffic, and generally speeds things up.\. \, \FDA little Algol:\F1 \J Although recursive functions provide an excellent way of exploiting a very parallel architecture, there are other ways. An algorithmic (iterative) language can be made to serve, if a few features are present. Existing Algol unfortunately has few of these. Consider the following pair of statements excerpted from an imaginary program: \. \F0 A := 3 ; B := 4 ; \F1\J These say that first A is set to 3, then B is assigned the value 4. There is an implied sequentialness. Presumably the programmer knew that the order of the assignments was unimportant, and that they could be done simultaneously. The syntax of the language provided no way for him to indicate this. A simple construct which permits information of this type to be conveyed is the "parallel semicolon", which we will indicate by a vertical bar "|". Statements seperated by parallel semicolons may be executed simultaneously. The following is an example of its use: \. \F0 A := 3 | B := 4 | C := 6 ; D := A+B+C ; \F1\J The first three statements may be executed together, the fourth must wait for their completion. It is implied that compound statements, bracketed by BEGIN END pairs can be similarly separated by parallel as well as sequential semicolons. If compound statements are permitted to return as a value the value of their last component statement, then the structure \. \F0 T := BEGIN INTEGER A,B,C ; A := expressionA | B := expressionB | C := expressionC ; expression involving A, B and C END; \F1\J is equivalent to a lambda expression construct in Lisp. A, B and C are the dummy arguments, evaluated in parallel, and the fourth expression, which must wait for A, B and C, is the body of the lambda. The constructs involved can be used in many other ways, unlike a real lambda expression. If the Algol also has recursion, then it is possible to obtain massive parallelism in the same way it was achieved in Lisp, namely by writing functions which recurse deeply, and invoke themselves a few times in parallel at each level. A minor form of parallelism already automatically available resides in arithmetic expressions. Different subexpressions of larger formulas can be evaluated simultaneously, and then combined. This is analogous to the evaluation of Lisp expressions. More massive concurrency can be obtained if the data types of Algol are expanded to include a complete set of array operators and genuine dynamic allocation of arrays. It is this type of data and operator set that makes APL an extremely powerful language in spite of an execrable control structure. The idea is that operators such as addition and multiplication work not only on simple variables representing single numbers, but on whole multi-dimensional arrays. Since arrays are more complex objects a much larger range of operators is required. Included are such things as array restructuring, subscript permutation (generalized transpose), element shuffling, subarray extraction, generalized cross and dot products, etc.. In general an operator combines one or more arrays and produces a value which is a new array, often of a different size and shape. Compounding of such operators substitutes for a large amount of explicit loop control, and results in substantially more compact source code. On conventional computers such code also runs much faster because most the run time is spent in carefully hand coded implementations of the basic operators, which do a great deal at each invokation if the arrays are large. Typical equivalent Algol programs execute second rate compiler produced code almost exclusively. An operator set of this kind provides the same capabilities as a hypothetical parallel FOR loop construct, with greater clarity. Conditionals can be handled by first selecting out subarrays according to the condition, and then applying the appropriate operations.\. \J It is desirable in our system for large arrays to be spread out among many machines, so that array operations can be carried out in parallel where possible. An often occurring process is the application of an arithmetic or other operator to one or more arrays of a given size, producing a result of the same size. This could be made maximally efficient if corresponding elements of the arrays resided in the same processor. Assigning a processor to an array element by means of a hash function of both the array dimensions and the indices of the element will cause arrays of different size to be stored in different places, but will put corresponding elements of arrays of the same size in the same place. Now comes the question of how a gaggle of processors managing a given array gets word that an operation concerning the array is to be executed. The initiator of the operation is the single processor running the code requesting it. The fastest way to propagate such information is to initiate a "chain letter". The originating processor can consider the array as two equal (or almost equal) pieces, and send a message to a member of each piece. Each recipient then divides the piece of which it is a member into two, and sends a message to a representative of each of those smaller subsets, etc. When the subset size becomes one, the operation is performed. It may be more efficient to store larger pieces of arrays in individual processors. This adds a slight serial component to the run times, but saves message handling time. How are programs communicated from processor to processor, as the number of executing instruction streams grows? It would be an extravagant waste of memory to duplicate the entire source program in every processor (besides, it might not fit). A software caching scheme is called for. When a processor initiates a subtask, it obtains a free processor and passes to it a moderate amount of the code needed for the task. When the new processor runs out, it sends a message back to the first machine for more. This machine trys to obtain it from its own cache, and if it too is out, sends a message further up in the hierarchy, to the machine which had initiated the task which it is running, and so on until the requested code is obtained. At top level the entire runnable program is assumed to be available from some sort of file, maintained by the combined storage of a number of processors.\. \, \FDA little operating systems:\F1 \J Assume that most input/output devices are connected to individual processors, and that, to maximize the bandwidth, each of these processors has only one, or a most a very small number, of devices associated with it. Included among these are communications lines to user terminals, so that each console talks directly to a dedicated processor with sorting net access to the entire system. If we expect to support more than one user on such a system, it will be necessary to have some type of protection scheme, to prevent one user's processes from accidentally interfering with another's. If each processor contains a monitor, then message sending can handled by system calls. The monitor can then check for validity, testing if the requested destination is within the set of processors assigned to the user. This monitor, which every machine assigned to a user must run, can be flexible enough to time share the machine it runs on, to provide multiple simulated processors. Controlling the state of the individual machines' monitors is the task of a global system monitor, operated by several machines, which maintains a pool of free processors, and parcels them out on request, and which also handles file system requests (bulk storage would be connected to a handful of the processors), and allocation of other devices. Processes belonging to a single user will be initiated by a particular master machine, probably the one connected to his console. This master can create a tree of subprocesses, possibly intercommunicating, running on different machines. It should be possible, for example, to have one subset configured as an array processor for efficient implementation of low level vision operations, while another is running an Algol/APL for the less structured analytic geometry needed to interpret the image, and yet a third is operating a Lisp system doing abstract reasoning about the scene. Many existing systems permit this kind of organization, but they are hampered by having an absurdly small amount of computing power.\. \J How is a system of this kind initialized, and how does one abort an out of control process taking place in part of it without affecting the rest? A possibility is to have an "executive" class of messages (perhaps signalled by a particular bit in the data portion), which user jobs are not permitted to emit. Reception of such messages might cause resetting of the processor, loading of memory locations within it, and starting execution at a requested locations. A single externally controllable machine can be used to get things going, fairly quickly if it emits a self replicating "chain letter". Now consider reliability. The system can obviously tolerate any reasonable number of inoperable processors, by simply declaring them unavailable for use. Failures in the communication net are much more serious, and under most situations will require the system to stop operating normally. It is possible to write diagnostic programs which can track down defective comparator elements or broken data wires. If something should happen to the clock signals to a given level it would be necessary to wheel out an oscilloscope. If reliability were a critical issue it would be possible to include a duplicate net, to run things the while other was being debugged.\. \, \FDDisclaimer:\F1 \J The software outlines are obviously only partly baked. This is mostly due to the limited amount of thought and work that has gone into them. On the other hand, it is my belief that even well thought out designs at this point will look naive in the light of experience with a working version of such a machine. Many of the fundamental decisions depend on things difficult to estimate, including the number of processors, communication speed compared with processor speed, memory size, and most importantly the kinds, sizes and mix of operations that people will tend to run (these will surely differ from what is being done now, the limitations being so different). This is not to say that more thought isn't called for. The nature of the memory protection, interrupt scheme, word size, instruction set, I/O structure, etc. of the individual machines should be tailored to permit convenient implementation of the individual processor operating systems, and typical message types. What these are can best be determined by trying to write some monitors, interpreters, compilers and array crunchers. Simulation of the system on a conventional machine is next to useless. Since the total amount of memory envisaged is larger than conventional machines carry, and since a simulation will have to jump around from simulated core image to simulated core image, in order to keep a realistic message synchrony, there would be an enormous amount of disk accessing. The slowness of this, combined with the inherent speed difference, would allow only the most trivial (and therefore unrepresentative) things to be tried, and not many of these. Construction of a moderate size version is a better use of manpower. Undoubtedly there will be mistakes made in the first (or first few) models, which will become apparent after a bit of experience. The flexibility offered should make this design much more attractive to a large class of programmers than current essentially special purpose architectures such as ILLIAC IV. Making it out of existing processors with proven machine languages will help too. I, for one, can hardly wait to start programming even a flawed version of a machine that can process and generate real time television with programs written in Algol, and simultaneously jump over tall game and proof trees in a single bound.\. \, \FBSection 6: The Future \F1\J Suppose my projections are correct, and the hardware requirements for human equivalence are available in 10 years for about the current price of a medium large computer. Suppose further that software development keeps pace (and it should be increasingly easy, because big computers are great programming aids), and machines able to think as well as humans begin to appear in 10 years. If the cost of electronics continues to plummet beyond then (and the existence of increasingly cheaper and better robot labor, in addition to scientific and engineering improvements, should ensure that), an additional 15 years should bring human equivalence into the pocket calculator price range. I also assume that sensors and effectors for such devices will be able to match human performance, since even today's technology is able to supercede it in many areas. What then? Well, even if these machines are only as clever as human beings, they will have enormous advantages over humans in competitive situations. Their production and upkeep is vastly less expensive, so more of them can be put to work with given resources. They can be easily specialized for given tasks, and be programmed to work tirelessly. Because we are not constrained to use any particular type of component in building them, versions can be designed to work efficiently in environments in which sustaining humans is very expensive, such as deep in the oceans, and more importantly in boundless outer space. Most significantly of all, they can be put to work as programmers and engineers, with the task of optimizing the software and hardware which make them what they are. The successive generations of machines produced this way will be increasingly smarter and more cost effective. Of course, there is no reason to assume that human equivalence represents any sort of upper bound. When pocket calculators can out-think humans, what will a really big computer be like? Regardless of how benevolent these machines are made, homo sapiens will simply be outclassed.\. \J Societies and economies are as surely subject to evolutionary pressures as biological organisms. Failing social systems eventually wither and die, and are replaced by more successful competitors, and those that can sustain the most rapid expansion dominate sooner or later. I expect the human race to expand into space in the near future, and O'Neill's habitats for people will be part of this. But as soon as machines are able to match human performance, the economics against human colonies become very persuasive. Just as it was much cheaper to send Pioneer to Jupiter and Viking to Mars than men to the Moon, so it will be cheaper to build orbiting power stations with robot rather than human labor. A machine can be designed to live in free space and love it, drinking in unattenuated sunlight and tolerating hard radiation. And instead of expensive pressurized, gravitied, decorated human colonies, the machines could be put to work converting lunar material into orbiting automatic factories. The doubling time for a machine society of this type would be much shorter than for human habitats, and the productive capability would expand correspondingly faster. The first societies in space will be composed of co-operating humans and machines, but as the capabilities of the self-improving machine component grow, the human portion will function more and more as a parasitic drag. Communities with a higher ratio of machines to people will be able to expand faster, and will become the bulk of the intelligent activity in the solar system. In the long run the sheer physical inability of humans to keep up with these rapidly evolving progeny of our minds will ensure that the ratio of people to machines approaches zero, and that a direct descendant of our culture, but not our genes, inherits the universe. This may not be as bad as it sounds, since the machine society can, and for its own benefit probably should, take along with it everything we consider important, up to and including the information in our minds and genes. Real live human beings, and a whole human community, could then be reconstituted if an appropriate circumstance ever arose. Since biology has committed us to personal death anyway, with whatever immortality we can hope for residing only in our children and our culture, shouldn't we be happy to see that culture become as capable as possible? In fact, attempting to hobble its growth is an almost certain recipe for long term suicide. The universe is one random event after another. Sooner or later an unstoppable virus deadly to humans will evolve, or a major asteroid will collide with the earth, or the sun will go nova, or we will be invaded from the stars by a culture that didn't try to slow down its own evolution, or any number of other things. The bigger, more diverse and competent our offspring are, the more capable they will be of detecting and dealing with the problems that arise. For the egomaniacs among us there is another possibility. The main problem in keeping up with the machines is that we evolve by the old DNA + nucleated cell + sex + personal death method, while our machines evolve by the new improved intelligence + language + culture + science + technology technique, which is so very much faster that our biology seems to stand still in comparison. If we could somehow transfer our evolution to the faster form, we should be able hold our own.\. \J At first thought genetic engineering might seem to be the key. Successive generations of human beings could be designed by engineering mathematics and on the basis of computer simulations just like airplanes and computers are now. But this is just like building robots out of proteins instead of metal and plastic. Being made of protein is in fact a major drawback. That stuff is stable only in a narrow temperature and pressure range, sensitive to all sorts of high energy disturbances, and so on, and rules out many construction techniques and components. Is there some way to retain our essential humanness, at least temporarily until we think of something better, while transferring ourselves to a more malleable form? Imagine the following process (meant to suggest a variety of ways such a thing could be done). You are in an operating theater, and a brain surgeon (probably a machine) is in attendance. On a table next to yours is a potentially human equivalent computer, dormant now for lack of a program to run. Your skull, but not your brain, is under the influence of a local anaesthetic. You are fully conscious. Your brain case is opened, and the surgeon peers inside. Its attention is directed at a small clump of about 100 neurons somewhere near the surface. It examines, non-destructively, the three dimensional structure and chemical makeup of that clump with neutron tomography, phased array radio encephalography, etc., and derives all the relevant parameters. It then writes a program which can simulate the behavior of the clump as a whole, and starts it running on a small portion of the computer next to you. It then carefully runs very fine wires from the computer to the edges of the neuron assembly, to provide the simulation with the same inputs the neurons are getting. You and it both check out the accuracy of the simulation. After you are satisfied, it carefully inserts tiny relays between the edges of the clump and the rest of the brain, and runs another set of wires from the relays to the computer. Initially these simply transmit the clump's signals through to the brain, but on command they can connect the simulation instead. A button which activates the relays when pressed is placed in your hand. You press it, release it and press it again. There should be no difference. As soon as you are satisfied, the simulation connection is established firmly, and the now unconnected clump of neurons is removed.\. \J The process is repeated over and over for adjoining clumps, until the entire brain has been dealt with. Occasionally several clump simulations are combined into a single equivalent but more efficient program. Though you have not lost consciousness, or even your train of thought, your mind has been removed from the brain and transferred to the machine. A final step is the disconnection of the your old sensory and motor system, to be replaced by higher quality ones in your new home. This last part is no different than the installation of functioning artificial arms, legs, pacemakers, kidneys, ears and hearts and eyes being done or contemplated now. Advantages become apparent as soon as the process is complete. Somewhere in your machine is a control labelled "speed". It was initially set to "slow", to enable the simulations to remain synchronized with the rest of your old brain, but now the setting is changed to "fast". You can communicate, react and think at a thousand times your former rate. But this is only a minor first step.\. \J Major possibilities stem from the fact that the machine has a port which enables the changing program that is you to be read out, non-destructively, and also permits new portions of program to be read in. This allows you to conveniently examine, modify, improve and extend yourself in ways currently completely out of the question. Or, your entire program can be copied into a similar machine, resulting in two thinking, feeling versions of you. Or a thousand, if you want. And your mind can be moved to computers better suited for given environments, or simply technologically improved, far more conveniently than the difficult first transfer. The program can also be copied to a dormant information storage medium, such as magnetic tape. In case the machine you inhabit is fatally clobbered, a copy of this kind can be read into an unprogrammed computer, resulting in another you, minus the memories accumulated since the copy was made. By making frequent copies, the concept of personal death could be made virtually meaningless. Another plus is that since the essence of you is an information packet, it can be sent over information channels. Your program can be read out, radioed to the moon, say, and infused there into a waiting computer. This is travel at the speed of light. The copy that is left behind could be shut down until the trip is over, at which time the program representing you with lunar experiences is radioed back, and transfused into the old body. But what if the original were not shut down during the trip? There would then be two separate versions of you, with different memories for the trip interval.\. \J When the organization of the programs making up humans is adequately understood, it should become possible to merge two sets of memories. To avoid confusion, they would be carefully labelled as to which had happened where, just as our current memories are usually labelled with the time of the events they record. This technique opens another vast realm of possibilities. Merging should be possible not only between two versions of the same individual but also between different persons. And there is no particular reason why mergings cannot be selective, involving some of the other person's memories, and not others. This is a very superior form of communication, in which memories, skills, attitudes and personalities can be rapidly and effectively shared.\. \J The amount of memory storage an individual will typically carry will certainly be greater than humans make do with today, but the growth of knowledge will insure the impracticability of everybody lugging around all the world's knowledge. This implies that individuals will have to pick and choose what their minds contain at any one time. There will often be knowledge and skills available from others superior to a person's own. The incentive to substitute those talents for native ones will be overwhelming most of the time. This will result in a gradual erosion of individuality, and formation of an incredibly potent community mind.\. \J A pleasant possibility presents itself. Why should the mind transferral process be limited to human beings? Earthly life contains several species with brains as large as or larger than man's, from dolphins, our cephalic equals, to elephants and the large whales, and perhaps giant squid, with brains up to twenty times as big. If the technical problem of translation can be overcome, and it may be quite difficult for squid, in particular, since their minds are evolved entirely independently, then our culture could be fused with theirs, with each component used according to its value. In fact, a synthesis of all terrestrial life is desirable with the simpler organisms contributing only the information in their DNA, if that's all they have. In this way all the knowledge generated by terrestrial biological and cultural evolution will be retained in the data banks, available whenever needed. This is a far more secure form of storage than the present one, where genes and ideas are lost as species become extinct and individuals die. We now have a picture of a super-consciousness, the synthesis of terrestrial life, and perhaps jovian and martian life as well, constantly improving and extending itself, spreading outwards from the solar system, converting non-life into mind. There may be other such bubbles expanding from elsewhere. What happens when we meet another? Well, it's presumptuous of me to say at this tender stage of the evolution, but fusion of us with them is certainly a possibility, requiring only a translation scheme between the data representations. This process, possibly occuring now elsewhere, might convert the entire universe into an extended thinking entity.\. \_=34;\\_=0;\ \FBReferences \_L\+R\/=2;\A \FDSection 1: The Natural History of Intelligence\F3 [animal] JERISON, Harry J., \.RIOPELLE, A.J., ed. "Paleoneurology and the Evolution of Mind",\."Animal Problem Solving", Scientific American, Vol. 234, No. 1, \.Penguin Books, 1967. January 1976, 90-101. \.GOODRICH, Edwin S., BITTERMAN, M. E., \."Studies on the Structure and Development of Vertebrates", "The Evolution of Intelligence", \.Dover Publications Inc., New York, 1958. Scientific American, Vol. 212, No. 1, January 1965, 92-100. \.BUCHSBAUM, Ralph, \."Animals without Backbones", GRIFFIN, Donald R., ed. \.The University of Chicago Press, 1948. "Animal Engineering", W.H. Freeman and Company, \.FARAGO, Peter, and Lagnado, John, San Francisco, June 1974. \."Life in Action" \.Alfred A. Knopf, New York, 1972. BURIAN, Z. and Spinar, Z.V., "Life Before Man", \.BONNER, John Tyler, American Heritage Press, 1972. \."Cells and Societies", \.Princeton University Press, Princeton, 1955. \, [squid] COUSTEAU, Jacques-Yves and Diole, Philippe, \.BOYCOTT, Brian B., "Octopus and Squid", \."Learning in the Octopus", Doubleday & Company, Garden City, N.Y., 1973. \.Scientific American, Vol. 212, No. 3, (also a televised film of the same name) \.March 1965, 42-50. "The Octopus", \.LANE, Frank W., a televised film, Time-Life films. \."The Kingdom of the Octopus", \.Worlds of Science Book, Pyramid Publications Inc. \.October, 1962. [bird] BAKKER, Robert T., \.STETTNER, Laurence Jay and Matyniak, Kenneth A. "Dinosaur Renaissance", \."The Brain of Birds", Scientific American, Vol. 232, No. 4, \.Scientific American, Vol. 218, No. 6, April 1975. \.June 1968, 64-76. [whale] LILLY, John. C., \.STENUIT, Robert, "The Mind of the Dolphin" & \."The Dolphin, Cousin to Man", "Man and Dolphin", \.Bantam Books, New York, 1972. Doubleday and Company, New York, 1967. COUSTEAU, Jacques-Yves and Diole, Philippe,\."Whales and Dolphins", "The Whale", \.A BBC produced film shown Doubleday & Company, Garden City, N.Y., 1972.\.in the NOVA television series FICHTELIUS, Karl-Erik and Sjolander, Sverre,\.McINTYRE, Joan, ed. "Smarter than Man?", \."Mind in the Waters", Ballantine Books, New York, 1974. \.Charles Scribner's Sons, San Francisco, 1974. [elephant] RENSCH, Bernhard, "The Intelligence of Elephants", Scientific American, February 1957, 44. \, [primate] "The First Words of Washoe", \.PFEIFFER, John, Televised film shown in the NOVA series \."The Human Brain", Pyramid, N.Y. 1962. LeGROS CLARK, W.E., \.SAGAN, Carl, "History of the Primates", \."The Dragons of Eden", The University of Chicago Press, Chicago 1966.\.Random House, N.Y. 1977. \, \FDSection 2: Measuring Processing Power\F3 WILLOWS, A.O.D., \.HUBEL, David H., "Giant Brain Cells in Mollusks", \."The Visual Cortex of the Brain", Scientific American, Vol. 224, No. 2, \.Scientific American, February 1971, 69-75. \.November 1963, 54-62. KANDEL, Eric R., \.WILLIAMS, Peter L., Warwick, Roger "Nerve Cells and Behavior", \."Functional Neuroanatomy of Man", Scientific American, Vol. 223, No. 1, \.W.B. Saunders Company, Philadelphia, 1975. July 1970, 57-70. \."PDP-10 Reference Handbook", AGRANOFF, Bernard W., \.Digital Equipment Corporation, Maynard Mass., 1971. "Memory and Protein Synthesis", Scientific American, Vol. 216, No. 6, \.TRIBUS, Myron and McIrvine, Edward C., June 1967, 115-122. \."Energy and Information", \.Scientific American, Vol. 224, No. 3, KENNEDY, Donald, \.September 1971, 179-188. "Small Systems of Nerve Cells", Scientific American, Vol. 216, No. 5, \.GLASSTONE, Samuel, Lewis, David, May 1967, 44-52. \."Elements of Physical Chemistry", \.D. Van Nostrand Co. Inc., New York, 1960. BAKER, Peter F., "The Nerve Axon", \.MILLER, Richard T., Scientific American, Vol. 214, No. 3, \."Super Switch", 284, in Science Year annual 1975, March 1966, 74-82. \.Field Enterprises Educational Corp., 1975. \.LANDAUER, Rolf, \."Fundamental Limitations in the Computational \.Process", IBM, Yorktown Heights N.Y. 1976 \FDSection 3: The Growth of Processing Power\F3 McWHORTER, Eugene W. \.TIEN, P. K. "The Small Electronic Calculator", \."Integrated Optics", Scientific American, Vol. 234, No. 3, \.Scientific American, Vol. 230, No. 4, March 1976, 88-98. \.April 1974, 28-35. HODGES, David A., \.HITTINGER, William C. "Trends in Computer Hardware Technology", \.Metal-Oxide-Semiconductor Technology", Computer Design, Vol. 15, No. 2, \.Scientific American, Vol. 229, No. 2, February 1976, 77-85. \.August 1973, 48-57. SCRUPSKI, Stephen E., et al., \.BOBECK, Andrew H. and Scovil, H. E. D. "Technology Update", \."Magnetic Bubbles", Electronics, Vol. 48, No. 21, McGraw-Hill, \.Scientific American, Vol. 224, No. 6, October 16, 1975, 74-127. \.June 1971, 78-90. VACROUX, Andre G. \.HEATH, F. G. "Microcomputers", \."Large-Scale Integration in Electronics", Scientific American, Vol. 232, No. 5, \.Scientific American, Vol. 222, No. 3, May 1975, 32-40. \.February 1970, 22-31. TURN, Rien \.RAJCHMAN, Jan A. "Computers in the 1980's", \."Integrated Computer Memories", Columbia University Press, \.Scientific American, Vol. 217, No. 1, Rand Corporation, 1974. \.July 1967, 18-31. HITTINGER, William C. and Sparks, Morgan "Microelectronics", Scientific American, Vol. 213, No. 5, November 1965, 56-70. \, \FDSection 4: Mega Processing\F3 BATCHER, K.E. \.KNUTH, D.E. "Sorting Networks and their Applications", \."Sorting and Searching", 1968 Spring Joint Computer Conf. Proceedings\.The Art of Computer Programming, Vol. 3 April 1968, 307-314. \.Addison-Wesley, 1973. VAN VOORHIS, David C. \.BAKER, Henry G. and Hewitt, Carl "An Economical Construction for Sorting \."The Incremental Garbage Collection Networks", 1974 NCC Proceedings \.of Processes", AI*PL proceedings, April 1974, 921-927. \.Sigplan Notices, Aug 1977, 55-59.