Great Microprocessors of the Past and Present
Editor's Note:
John's Remote Copy may be more up-to-date.
Great Microprocessors of the Past and Present (V 11.7.0)
last major update: February 2000
last minor update: February 2000
Feel free to send me comments at (new email address):
john.bayko@sk.sympatico.ca
Laugh at my own amateur attempt at designing a processor architecture
at:
http://www.cs.uregina.ca/~bayko/design/design.html
Introduction: What's a "Great CPU"?
This list is not intended to be an exhaustive compilation of
microprocessors, but rather a description of designs that are either
unique (such as the RCA 1802, Acorn ARM, or INMOS Transputer), or
representative designs typical of the period (such as the 6502 or 8080,
68000, and R2000). Not necessarily the first of their kind, or the
best.
A microprocessor generally means a CPU on a single silicon chip,
but exceptions have been made (and are documented) when the CPU
includes particularly interesting design ideas, and is generally the
result of the microprocessor design philosophy. However, towards the
more modern designs, design from other fields overlap, and this
criterion becomes rather fuzzy. In addition, parts that used to be
separate (FPU, MMU) are now usually considered part of the CPU design.
Another note on terminology - because of the muddling of the
term "RISC" by marketroids, I've avoided using those terms here to
refer to architectures. And anyway, there are in fact four architecture
families, not two. So I use "memory-data" and "load-store" to refer to
CISC and RISC architectures.
This file is not intended as a reference work, though all attempts
(well, many attempts) have been made to ensure its accuracy. It
includes material from text books, magazine articles and papers,
authoritative descriptions and half remembered folklore from obscure
sources (and net.people who I'd like to thank for their many helpful
comments). As such, it has no bibliography or list of references.
In other words, "For entertainment use only".
Enjoy, criticize, distribute and quote from this list freely.
By: John Bayko (Tau).
Internet: john.bayko@sk.sympatico.ca
An explanation of the version numbers:
##.##.##
| | |
| | +-- small, usually 2 sentences or less.
| +--- changes a paragraph or more, or several descriptions
+---- CPU added or deleted.
Table of Contents
- Section One: Before the Great Dark Cloud.
- Section Two: Forgotten/Innovative Designs before the Great Dark Cloud
- Section Three: The Great Dark Cloud Falls:
- Section Four: Unix and RISC, a New Hope
- Part I: TRON, between the ages (1987)
- Part II: SPARC, an extreme windowed RISC (1987)
- Part III: AMD 29000, a flexible register set (1987)
- Part IV:Siemens 80C166, Embedded load-store with register windows.
- Part V: MIPS R2000, the other approach. (June 1986)
- Part VI: Hewlett-Packard PA-RISC, a conservative RISC (Oct 1986)
- Part VII: Motorola 88000, Late but elegant (mid 1988)
- Part VIII: Fairchild/Intergraph Clipper, An also-ran (1986)
- Part IX: Acorn ARM, RISC for the masses (1986)
- Part X: Hitachi SuperH series, Embedded, small, economical (1992)
- Part XI: Motorola MCore, RISC brother to ColdFire (Early 1998)
- Section Five: Born Beyond Scalar
- Section Six: Weird and Innovative Chips
- Appendices
Table of contents provided by Steve Simmons <scs@lokkur.dexter.mi.us>
Quick Index (in no particular order):
Processors:
- Intel 4004, 4040
- Intel 8008, 8080, 8085
- Intel 8048, 8051, 8052
- Intel 80x86, Pentium, AMD K5/K6, Cyrix M1, Nx586, IA-64
- Intel 80960
- Intel 80860
- Intel i432
- Motorola MC14500B
- Motorola 680x, 6809, Hitachi 6309
- Motorola 680x0, ColdFire
- Motorola 88000
- Motorola DSP96002/DSP56000
- Motorola MCore
- AMD 2901, 2903 (and 2910)
- AMD 9511 math processor
- AMD 29000
- Zilog Z-80, Z-280
- Zilog Z-8000, Z80000
- Fairchild F8
- Fairchild 9440
- Fairchild/Intergraph Clipper
- National Semiconductor SC/MP (and COP)
- National Semiconductor 320xx, Swordfish
- TI TMS1000 4-bit
- TI 9900 16-bit
- TI TMS320Cx0 DSP
- MIPS/SGI CPUs
- MOS Technologies 650x, Western Design Center 65816
- Microchip Technology PIC 16x
- RCA 1802
- Ferranti F100-L
- Western Digital MCP-1600
- Signetics 2650
- Hitachi 6301
- Signetics 8x300
- Siemens 80C166
- MISC M17
- Rekursiv
- AT&T CRISP/Hobbit
- INMOS Transputer T-212, T-414, T-800, T-9000
Architectures:
Virtual Machines:
Definitions And Explanations
Section One: Before the Great Dark Cloud.
Part I: The Intel 4004, the first (Nov 1971)
.
.
The first single chip CPU was the Intel 4004, a 4-bit processor
meant for a calculator. It processed data in 4 bits, but its
instructions were 8 bits long. Program and Data memory were separate,
1K data memory and a 12-bit PC for 4K program memory (in the form of a
4 level stack, used for CALL and RET instructions). There were also
sixteen 4-bit (or eight 8-bit) general purpose registers.
The 4004 had 46 instructions, using only 2,300 transistors in a
16-pin DIP. It ran at a clock rate of 740kHz (eight clock cycles per
CPU cycle of 10.8 microseconds) - the original goal was 1MHz, to
allow it to compute BCD arithmetic as fast (per digit) as a 1960's era
IBM 1620.
The 4040 (1972) was an enhanced version of the 4004, adding 14
instructions, larger (8 level) stack, 8K program space, and interrupt
abilities (including shadows of the first 8 registers).
[for additional information, see Appendix E]
- Intel Corporation:
- http://www.intel.com/
- Intel 25th Anniversary of the Microprocessor:
-
http://www.intel.com/intel/museum/25anniv/index.htm
Part II: TMS 1000, First microcontroller (1974)
.
Texas Instruments followed the Intel 4004/4040
closely with the 4-bit TMS 1000,
which was the first microprocessor to include enough RAM,
and space for a program ROM, to allow it to operate without multiple
external support chips. It also featured an innovative feature to add
custom instructions to the CPU.
It included a 4-bit accumulator, 4-bit
Y register and 2 or 3-bit X register, which combined to create a 6 or
7 bit index register for the 64 or 128 nybbles of on chip RAM. A 1-bit
status register was used for various purposes in different contexts. The
6-bit PC combined with a 4 bit page register and an optional 1 bit bank
('chapter') register to produce 10 or 11 address bits to 1KB or
2KB of on-chip program ROM. There was also a 6-bit subroutine return
register and 4-bit page buffer, used as the destination on a branch,
or exchanged with the PC and page registers for a subroutine (amounting
to a 1-element stack, branches could not be performed within a
subroutine).
An interesting feature of the PC is it was incremented
using a feedback shift register, not a counter, so instructions were
not consecutive in memory, but since all memory was internal, this was
not a problem. Instructions were 8 bits with twelve hardwired, and with
a 31X16 element PLA allowing 31 custom microprogrammed instructions. All
hardwired instructions were single cycle, and no interrupts were
allowed.
It gained fame in the movie "ET: The Extraterrestrial" as the brains
in the Texas Instruments "Speak and Spell" educational toy.
- Texas Instruments:
- http://www.ti.com/
- TMS 1000 One-Chip Microcomputers:
-
http://www.ti.com/corp/docs/history/tms.htm
Part III: The Intel 8080 (April 1974)
.
.
.
The 8080 was the successor to the 8008 (April 1972, intended as a
terminal controller, and similar to the 4040).
While the 8008 had 14 bit PC and
addressing, the 8080 had a 16 bit address bus and an 8 bit data bus.
Internally it had seven 8 bit registers (A-E, H, L - pairs BC, DE and HL
could be combined as 16 bit registers), a 16 bit stack pointer to memory
which replaced the 8 level internal stack of the 8008, and a 16 bit
program counter. It also had several I/O ports - 256 of them, so I/O
devices could be hooked up without taking away or interfering with the
addressing space, and a signal pin that allowed the stack to occupy a
separate bank of memory.
The 8080 was used in the Altair 8800, the first widely-known
personal computer (though the definition of 'first PC' is fuzzy. Some
claim that the 12-bit
LINC
(Laboratory INstruments Computer) was the first
'personal computer'. Developed at MIT (Lincoln Labs) in 1963
using DEC components, it inspired DEC to design its own
PDP-8 in 1965, also considered an early 'personal
computer'). 'Home computer' would probably be a better term
here, though).
Intel updated the design with the 8085 (1976), which added two
instructions to enable/disable three added interrupt pins (and the
serial I/O pins), and simplified hardware by only using +5V power, and
adding clock generator and bus controller circuits on-chip.
- Intel Corporation:
- http://www.intel.com/
- Intel 25th Anniversary of the Microprocessor:
-
http://www.intel.com/intel/museum/25anniv/index.htm
Part IV: The Zilog Z-80 - End of an 8-bit line (July 1976)
.
.
.
.
The Z-80 was intended to be an improved 8080
(designed by ex-Intel engineers), and it was - vastly
improved. It also used 8 bit data and 16 bit addressing, and could
execute all of the
8080
(but not
8085) op codes, but included 80 more,
instructions (1, 4, 8 and 16 bit operations and even block
move and block I/O). The register set was doubled, with two
banks of data registers (including A and F) that could be switched between.
This allowed fast operating system or interrupt context switches. The
Z-80 also added two index registers (IX and IY) and 2 types of relocatable
vectored interrupts (direct or via the 8-bit I register).
Clock speeds ranged from the original Z-80 2.5MHz to the Z80-H
(later called Z80-C) at 8MHz, and later a CMOS version at 10MHz.
Like many processors (including the 8085),
the Z-80 featured many
undocumented instructions. In some cases, they were a by-product of
early designs (which did not trap invalid op codes, but tried to
interpret them as best they could), and in other cases chip area near
the edge was used for added instructions, but fabrication made the
failure rate high. Instructions that often failed were just not
documented, increasing chip yield. Later fabrication made these more
reliable.
But the thing that really made the Z-80 popular in designs was the
memory interface - the CPU generated its own RAM refresh signals,
which meant easier design and lower system cost, the deciding factor
in its selection for the TRS-80 Model 1. That and its
8080
compatibility, and CP/M, the first standard microprocessor operating
system, made it the first choice of many systems.
Embedded varients of the Z-80 were also produced. Hitachi produced
the 64180 (1984) with added components (two
16 bit timers, two DMA controllers, three serial ports, and a
segmented MMU mapping a 20 bit (1M) address
space to any three variable sized segments in the 16 bit (64K) Z-80
memory map), a design Zilog and Hitachi later refined to produce the
Z-180 and HD64180Z (1987?) which were compatible with Z-80 peripheral
chips, plus variants (Z-181, Z-182). The Z-280 was a 16 bit
version introduced about July, 1987 (loosely based on the ill-fated
Z-800), with a paged (like Z-180) 24 bit (16M) MMU (8 or 16 bit bus
resizing), user/supervisor modes and features for multitasking, a 256
byte (4-way) cache, 4 channel DMA, and a huge number of new op codes
tacked on (total of almost 3,500, including previously undocumented
Z-80 instructions), though the size made some very slow.
Internal clock could be run at twice the external clock (ex. 16MHz CPU
with a 8MHz bus), and additional on-chip components were available. A
16/32 bit Z-380 version also exists (1994) with added 32-bit linear
addressing mode (not Z-80 compatible).
The Z-8
(1979) was an embedded processor with
on-chip RAM (actually a set of 124 general and 20 special purpose
registers) and ROM (often a BASIC interpreter), and is available in a
variety of custom configurations up to 20MHz. Not actually related to
the Z-80.
- Zilog Corporation:
- http://www.zilog.com/
- 20th Anniversary of the TRS-80:
-
http://www.radioshack.com/trs_80/
Part V: The 650x, Another Direction (1975)
.
.
.
Shortly after Intel's 8080,
Motorola introduced the
6800. Some of the
designers left to start MOS Technologies (later bought by
Commodore), which introduced the 650x
series which included the 6501 (pin compatible with the
6800, taken off the market almost immediately for
legal reasons) and the 6502 (used in early Commodores, Apples and Ataris).
Like the
6800 series, varients were produced which
added features like I/O ports (6510 in the Commodore 64) or
reduced costs with smaller address buses (6507 13-bit 8K address bus
in the Atari 2600).
The 650x was little endian (lower address byte could be added to an
index register while higher byte was fetched) and had a completely
different instruction set from the big endian
6800.
Apple designer Steve Wozniak
described it as the first chip you could get for less than a hundred
dollars (actually a quarter of the
6800 price) -
it became the CPU of choice for many early home computers (8 bit
Commodore and Atari products).
Unlike the 8080
and its kind, the 6502 (and 6800
had very few registers. It
was an 8 bit processor, with 16 bit address bus. Inside was one 8 bit
data register, two 8 bit index registers, and an 8 bit stack pointer
(stack was preset from address 256 ($100 hex) to 511 ($1FF)). It used
these index and
stack registers effectively, with more addressing modes, including a
fast zero-page mode that accessed memory addresses from address 0 to
255 ($FF) with an 8-bit address that speeded operations (it didn't have
to fetch a second byte for the address).
Back when the 6502 was introduced, RAM was actually faster than
microprocessors, so it made sense to optimize for RAM access rather than
increase
the number of registers on a chip. It also had a lower gate count (and
cost) than its competitors.
The 650x also had undocumented instructions.
The CMOS 65C02/65C02S fixed some original 6502 design flaws, and
the 65816 (officially W65C816S, both designed by Bill Mensch of Western
Design Center Inc.) extended the 650x to 16 bits internally, including
index and stack registers, with a 16-bit direct page register (similar
to the 6809), and 24-bit address bus
(16 bit registers plus 8 bit data/program bank registers).
It included an 8-bit emulation mode. Microcontroller versions of both
exist, and a 32-bit version (the 65832) is planned. Various licensed
versions are supplied by GTE (16 bit G65SC802 (pin compatible with 6502),
and G65SC816 (support for VM, I/D cache, and multiprocessing)) and Rockwell
(R65C40), and Mitsubishi has a redesigned compatible version.
The 6502 remains surprisingly popular largely because of the variety of
sources and support for it.
The 6502-based Apple II line (not backwards compatible with the
Apple I) was among the first microcomputers introduced and became the
longest running PC line, eventually including the 65816-based Apple IIgs
The 6502 was also used in the Nintendo entertainment system (NES),
and the 65816 is in the 16-bit successor, the Super NES, before Nintendo
switched to MIPS embedded processors.
- The Western Design Center, Inc.:
- http://www.wdesignc.com
Part VI: The 6809, extending the 680x (1977)
.
.
.
.
.
.
.
.
Like the 6502,
the 6809 was based on the Motorola 6800 (August 1974), though the 6809 expanded
the design significantly. The 6809 had two 8 bit accumulators (A & B)
and could combine them into a single 16 bit register (D). It also
featured two index registers (X & Y) and two stack pointers (S & U),
which allowed for some very advanced addressing modes (The 6800 had
A & B (and D) accumulators, one index register and one stack register).
The 6809 was source compatible with the
6800, even though the 6800 had 78 instructions and the 6809 only had
around 59. Some instructions were replaced by more general ones which
the assembler would translate, and some were even replaced by
addressing modes. While the 6800 and
6502 both had a fast 8 bit mode to address the first
256 bytes of RAM, the 6809 had an 8 bit Direct Page register to locate
this fast address page anywhere in the 64K address space.
Other features were one of the first multiplication instructions of
the time, 16 bit arithmetic, and a special fast interrupt. But it was
also highly optimized, gaining up to five times the speed of the 6800
series CPU. Like the 6800, it included the undocumented HCF (Halt Catch
Fire) instruction to incrementally strobe the address lines for bus
testing ("jump to accumulator (A or B)" in the
6800, implemented and documented as $00 in the 68HC11 which is described
below).
The 6800 and 6809, like the 6502
series, used a single clock cycle (the base cycle, plus a cycle rotated
90 degrees out of phase) to generate the timing for four internal
execution stages, so that there were instructions which executed in one
external 'cycle' (this is different from clock-doubling, which uses a
phase-locked-loop to generate a faster internal clock which is
synchronised with an external clock). Most CPUs, such as the
8080, used the external clock directly, so an
equivalent instruction would take four cycles, meaning a 2MHz 6809
would be roughly equivalent to a 8MHz 8080.
The 680x and 650x
only accessed memory every other cycle, allowing a peripheral (such as
video, or even a second cpu) to access the same memory without conflict.
Motorola later produced CPUs in this line with a standard four-cycle
clock.
The 6800 lived on as well, becoming the 6801/3, which included ROM,
some RAM, a serial I/O port, and other goodies on the chip (as an
embedded controller, minimizing part counts - but expensive at 35,000
transistors. The 6805 was a cheaper 6801/3, dropping seldom used
instructions and features). Later the 68HC11 version (two 8 bit/one
16 bit data register, two 16 bit index, and
one 16 bit stack register, and an expanded instruction set with 16 bit
multiply operations) was extended to 16 bits as the 68HC16 (additional
16-bit accumulator E, three index registers IX, IY, IZ, plus extension
registers to add 4 bits to addresses and accumulator E for a 1M address
space, plus 16-bit multiply registers HR and IR and 36-bit AM
accumulator), and a lower cost 16 bit 68HC12 (May 1996). It remains a
popular embedded processor (with over 2 billion 6800 varients sold), and
radiation hardened versions of the 68HC11 have been used in
communications satellites. But the 6809 was a very fast and flexible
chip for its time, particularly with the addition of the OS-9 operating
system.
Of course, I'm a 6809 fan myself...
As a note, Hitachi produced a version called the 6309. Compatible
with the 6809, it added 2 new 8-bit registers (E and F) that could be
combined to form a second 16 bit register (W), and all four 8-bit
registers could form a 32 bit register (Q). It also featured hardware
division, and some 32 bit arithmetic, a zero register (always 0 on
read), block move, and was generally 30% faster in native mode. ALso,
unlike the 6809, the 6309 could trap on an illegal instruction. These
enhancements, surprisingly, never appeared in official Hitachi
documentation.
- Motorola:
- http://www.mot.com/
- Motorola Microcontrollers:
-
http://www.mcu.motsps.com/mc.html
- TRS-80 Color Computer Homepage (has 6809/6309 links):
-
http://www.sfn.saskatoon.sk.ca/~ab594/coco/coco.html
Part VII: Advanced Micro Devices Am2901, a few bits at a time
.
.
Bit slice processors were modular processors. Mostly, they
consisted of an ALU of 1, 2, 4, or 8 bits, and control lines (including
carry or overflow signals usually internal to the CPU). Two 4-bit ALUs
could be arranged side by side, with control lines between them, to
form an ALU of 8-bits, for example. A sequencer would execute a program
to provide data and control signals.
The Am2901, from Advanced Micro Devices, was a popular 4-bit-slice
processor. It featured sixteen 4-bit registers and a 4-bit ALU, and
operation signals to allow carry/borrow or shift operations and such to
operate across any number of other 2901s. An address sequencer (such as
the 2910) could provide control signals with the use of custom
microcode in ROM.
The Am2903 featured hardware multiply.
Legend holds that some Soviet clones of the
PDP-11 were assembled from Soviet clones of the
Am2901.
Since it doesn't fit anywhere else in this list, I'll mention it
here...
AMD also produced what is probably the first floating point
"coprocessor" for microprocessors, the AMD 9511 "arithmetic circuit"
(1979), which performed 32 bit (23 + 7 bit floating point) RPN-style
operations (4 element stack) under CPU control - the 64-bit 9512 (1980)
lacked the transcendental functions. It was based on a 16-bit ALU,
performed add, subtract, multiply, and divide (plus sine and cosine),
and while faster than software on microprocessors of the time (about 4X
speedup over a 4MHz
Z-80),
it was much slower (at 200+ cycles for
32*32->32 bit multiply) than more modern math coprocessors are.
It was used in some CP/M (Z-80)
systems, and possibly on a S-100 bus math card for NorthStar systems
(I've heard conflicting information about
whether it used an AMD unit). Calculator circuits (such as the National
Semiconductor MM57109 (1980), actually a 4-bit NS COP400 processor with
floating point routines in ROM) were also sometimes used, with emulated
keypresses sent to it and results read back, to simplify programming
rather than for speed.
Part VIII: Intel 8051, Descendant of the 8048.
.
.
.
Initially similar to the Fairchild F8,
the Intel 8048 was also designed as a microcontroller rather than a
microprocessor - low cost and small size was the main goal. For
this reason, data is stored on-chip, while program code is external
(a true
Harvard architecture).
The 8048 was eventually replaced by the very popular but bizarre
8051 and 8052.
While the 8048 used 1-byte instructions, the 8051 has a more
flexible 2-byte instruction set. It has eight 8-bit registers, plus an
accumulator A.
Data space is 128 bytes accessed directly or indirectly by a
register, plus another 128 above that in the 8052 which can only be
accessed indirectly (usually for a stack). External memory occupies
the same address space, and can be accessed directly (in a 256 byte
page via I/O ports) or through the 16 bit DPTR address register much
like in the RCA 1802. Direct data above location
32 is bit-addressable. Although complicated, these memory models allow
flexibility in embeded designs, making the 8051 very popular (over 1
billion sold since 1988).
The Siemens 80C517 adds a math coprocessor to the CPU which
provides 16 and 32 bit integer support plus basic floating point
assistance (32 bit normalise and shift), reminiscent of the old
AMD 9511. The Texas Instruments TMS370 is similar
to the 8051, Adding a B accumulator and some 16 bit support.
As a side note, the 4-bit Texas Instruments TMS1000 was the first
CPU to integrate RAM (32 bytes), ROM (1K), a clock, and I/O support on
a single chip, making it the first microcontroller.
- Intel Corporation:
- http://www.intel.com/
- Intel MCS (R) 51 Family:
-
http://support.intel.com/oem_developer/mcs51.htm
Part IX: Microchip Technology PIC 16x/17x, call it RISC (1975)
.
.
.
The roots of the PIC originated at Harvard university (see
Harvard Architecture) for a Defense Department
project, but was beaten by a simpler (and more reliable at the time)
single memory design from Princeton. Harvard Architecture was first used
in the
Signetics 8x300,
and was adapted by General Instruments for use as a peripheral interface
controller (PIC) which was designed to compensate for poor I/O in
its 16 bit CP1600 CPU. The microelectronics division was eventually spun
off into Arizona Microchip Technology (around 1985), with the PIC as its
main product.
The PIC has a large register set (from 25
to 192 8-bit registers, compared to the Z-8's
144). There are up to 31 direct registers, plus an accumulator W,
though R1 to R8 also have special functions - R2 is the PC (with
implicit stack (2 to 16 level)), and R5 to R8 control I/O ports. R0 is
mapped to the register R4 (FSR) points to (similar to the ISAR in the
F8, it's the only way to access R32 or above).
The 16x is very simple and RISC-like (but less so than the
RCA 1802 or the more recent 8-bit Atmel AVR
microcontroller which is a canonical simple load-store design - 16-bit
instructions, 2-stage pipeline, thirty-two 8-bit data registers (six
usable as three 16-bit X, Y, and Z address registers), load/store
architecture (plus data/subroutine stack)).
It has only 33 fixed length 12-bit
instructions, including several with a skip-on-condition flag to skip
the next instruction (for loops and conditional branches), producing
tight code important in embedded applications. It's marginally
pipelined (2 stages - fetch and execute) - combined with single cycle
execution (except for branches - 2 cycles), performance is very good
for its processor catagory.
The 17x has more addressing modes (direct, indirect,
and relative - indirect mode instructions take 2 execution cycles),
more instructions (58 16-bit), more registers (232 to 454),
plus up to 64K-word program space (2K to 8K on chip). The high end
versions also have single cycle 8-bit unsigned multiply instructions.
The PIC 16x is an interesting look at an 8 bit design made with
slightly newer design techniques than other 8 bit CPUs in this list
- around 1978 by General Instruments (the 1650, a
successor to the more general 1600). It lost out to
more popular CPUs and was later sold to Microchip Technology, which
still sells it for small embedded applications. An example of this
microprocessor is a small PC board called the BASIC Stamp,
consisting of 2 ICs - an 18-pin PIC 16C56 CPU (with a BASIC interpreter
in 512 word ROM (yes, 512)) and 8-pin 256 byte serial EEPROM (also
made by Microchip) on an I/O port where user programs (about 80
tokenized lines of BASIC) are stored.
- Microchip Technology:
- http://www.microchip.com/
- Atmel:
- http://eu.atmel.com/
Section Two: Forgotten/Innovative Designs before the Great Dark Cloud
Part I: RCA 1802, weirdness at its best (1974)
.
The RCA 1802 was an odd beast, extremely simple and fabricated in
CMOS, which allowed it to run at 6.4 MHz (at 10V, but very fast for
1974) or suspended with the clock stopped. It was a single chip
version of the previous two-chip 1801, an 8 bit processor,
with 16 bit addressing, but the major features were its extreme
simplicity, and the flexibility of its large register set. Simplicity
was the primary design goal, and in that sense it was one of the first
"RISC" chips.
It had sixteen 16-bit registers, which could be accessed as
thirty-two 8 bit registers, and an accumulator D used for arithmetic
and memory access - memory to D, then D to registers, and vice versa,
using one 16-bit register as an address. This led to one person
describing the 1802 as having 32 bytes of RAM and 65535 I/O ports. A
4-bit control register P selected any one general register as the
program counter, while control registers X and N selected registers for
I/O Index, and the operand for current instruction. All instructions
were 8 bits - a 4-bit op code (total of 16 operations) and 4-bit
operand register stored in N.
There was no real conditional branching (there were conditional
skips which could implement it, though), no subroutine support, and no
actual stack, but clever use of the register set allowed these to be
implemented - for example, changing P to another register allowed jump
to a subroutine. Similarly, on an interrupt P and X were saved, then R1
and R2 were selected for P and X until an RTI restored them.
A later version, the 1805, was enhanced, adding several
Forth
language primitives (Forth is commonly used in control applications).
Apart from the COSMAC microcomputer kit, the 1802 saw action in
some video games from RCA and Radio Shack, and the chip is the heart of
the Voyager, Viking and Galileo (along with some
AMD 2900 bit slice
processors) probes. One reason for this is that a version of the 1802
used silicon on sapphire (SOS) technology, which leads to radiation and
static resistance, ideal for space operation. It is still available
from Harris Semiconductors.
- Harris Semiconductors:
- http://www.semi.harris.com/
- Microcontroller Primer FAQ:
-
http://www.hitex.com/FAQ/primer/
Part II: Fairchild F8, Register windows
.
The F8 was an 8 bit processor. The processor itself didn't have an
address bus - program and data memory access were contained in separate
units, which reduced the number of pins, and the associated cost. It
also featured 64 registers, accessed by the ISAR register in cells
(windows) of eight, which meant external RAM wasn't always needed for
small applications. In addition, the 2-chip processor didn't need
support chips, unlike others which needed seven or more. The F8
inspired other similar CPUs, such as the Intel 8048.
The use of the ISAR register allowed a subroutine to be entered
without saving a bunch of registers, speeding execution - the ISAR
would just be changed. Special purpose registers were stored in the
second cell (regs 9-15), and the first eight registers were accessed
directly (globally).
The windowing concept was useful, but only the register pointed to
by the ISAR could be accessed - to access other registers the ISAR was
incremented or decremented through the window.
Fairchild ended up as part of National Semiconductor, before being
spun off again in 1997.
- Fairchild Semiconductor:
-
http://www.national.com/fairchild/
Part III: SC/MP, early advanced multiprocessing (April 1976)
.
.
.
.
The National Semiconductor SC/MP (Single Chip/Micro Processor,
nicknamed "Scamp") was a typical
8 bit processor intended for control applications (a simple BASIC 2.5K
ROM was added to one version). It featured 16 bit addressing, with 12
address lines and 4 lines borrowed from the data bus (it was common to
borrow lines (sometimes all of them) from the data bus for addressing -
however only the lower 12 index register/PC bits were incremented (4K
pages), special instructions modified upper 4 bits). Internally, it
included four index registers (P1 to P3, plus the PC/P0) and two 8 bit
registers. It had no stack pointer or subroutine instructions (though
they could be emulated with index registers). During interrupts, the PC
and P3 were swapped. It was meant for embedded control, and many
features were omitted for cost reasons. It was also bit serial
internally to keep it cheap.
The unique feature was the ability to completely share a system bus
with other processors. Most processors of the time assumed they were
the only ones accessing memory or I/O devices. Multiple SC/MPs (as well
as other intelligent devices, such as DMA controllers) could be hooked
up to the bus. A control line (ENOUT (Enable Out) to ENIN) could be
chained along the processors to allow cooperative processing. This was
very advanced for the time, compared to other CPUs, but the bit-serial
CPU was slow (even simple instruction took 5-7 cycles, while memory
access was 2 cycles, which allowed them to share a memory bus without
saturating it, as opposed to a 6502
which could share memory with at most one other CPU, and only then
because of the way the CPU clock was used). However this feature was
almost never used for multiprocessing.
In addition to I/O ports like the 8080,
the SC/MP also had
instructions and one pin for serial input and one for output.
National Semiconductor eventually replaced the SCMP with the COP4
(4 bit) and COP8 (8 bit) embedded controllers,
with only two index registers, but adding stack support.
- National Semiconductor:
- http://www.national.com/
- National Semiconductor Microcontroller Technology:
-
http://www.national.com/appinfo/mcu/
Part IV: F100-L, a self expanding design
.
The Ferranti F100-L was designed by a British company for the
British Military. It was an 8 bit processor, with 16 bit addressing,
but it could only access 32K of memory (1 bit for
indirection).
The unique feature of the F100-L was that it had a complete control
bus available for a coprocessor that could be added on. Any instruction
the F100-L couldn't decode was sent directly to the coprocessor for
processing. Applications for coprocessors at the time were limited, but
the design is still used in some modern processors, such as the
National Semiconductor 320xx
series (the predecessor of the Swordfish
processor, described later), which included FPU, MMU, and other
coprocessors that could just be added to the CPU's coprocessor bus in a
chain. Other units not foreseen could be added later.
Ferranti, which built the Ferranti Mark 1 (Britain's first commercial
electronic computer), no longer makes microprocessors.
Part V: The Western Digital 3-chip CPU (June 1976)
.
The Western Digital MCP-1600 was probably the most flexible
processor available. It consisted of at least four separate chips,
including the control circuitry unit, the ALU, two or four ROM chips
with customisable microcode
(like the old 4-bit
Texas Instruments TMS 1000),
and timing circuitry. It doesn't really count as a
microprocessor, but neither do bit-slice processors
(
AMD 2901).
The ALU chip contained twenty six 8 bit registers and an 8 bit ALU,
while the control unit supervised the moving of data, memory access,
and other control functions. The ROM allowed the chip to function as
either an 8 bit chip or 16 bit, with clever use of the 8 bit ALU. Even
more, microcode
allowed the addition of floating point routines (40 + 8
bit format), simplifying programming (and possibly producing a Floating
Point Coprocessor).
Two standard microcode
ROMS were available. This flexibility was
one reason it was also used to implement the
DEC LSI-11 processor as well as the
WD Pascal Microengine.
Part VI: Intersil 6100, old design in a new package
.
.
.
The IMS 6100 was a single chip design of the PDP-8 minicomputer
(1965) from DEC (low cost successor to the PDP-5 (1963)).
The old PDP-8 design was very strange, and if it hadn't been
so popular, an awkward CPU like the 6100 would have never had a reason
to exist.
The 6100 was a 12 bit processor, which had exactly three registers
- the PC, AC (an accumulator), and MQ. All 2 operand instructions read
AC and MQ, and wrote back to AC. It had a 12 bit address bus, limiting
RAM to only 4K. Memory references were 7 bit (128 word) offset either
from address 0, or the PC.
It had no stack. Subroutines stored the PC in the first word of the
subroutine code itself, so recursion wasn't possible without fancy
programming.
4K RAM was pretty much hopeless for general purpose use. The 6102
support chip (included on chip in the 6120) added 3 address lines,
expanding memory to 32K the same way that the PDP-8/E expanded the
PDP-8. Two registers, IFR and DFR, held the page for instructions and
data respectively (IFR always used until a data address was detected).
At the top of the 4K page, the PC wrapped back to 0, so the last
instruction on a page had to load a new value into the IFR if execution
was to continue.
The PDP-8 itself was succeeded by the PDP-11
(though a version called the PDP-12 was produced, it was part of the
PDP-8 series, not a replacement). The IMS 6120 was
used in the DECmate (1980), DEC's original competition for the
IBM PC, but lacked the processor and RAM capacity
(a Z-80 or 8086
card could be added (reducing the 6120 to an I/O coprocessor) but lacked
IBM PC compatability). DEC also tried competing
with the 8086
based Rainbow, and the PDP-11
based PRO-325 personal computers, but none
caught on.
Intersil was eventually bought by Harris Semiconductors, which
produces versions of the 8088 and 8086, 1802, and 68HC05.
- PDP-8 Models and Options:
-
http://www.cis.ohio-state.edu/hypertext/faq/usenet/dec-faq/pdp8-models/faq.html
- Harris Semiconductors:
- http://www.semi.harris.com/
Part VII: NOVA, another popular adaptation
.
.
.
.
Like the PDP-8,
the Data General Nova was also copied, not just in
one, but two implementations - the Data General MN601 (MicroNova),
and Fairchild 9440 (used in the Nova 4 series). However, the NOVA (1969)
was a more mature design (by
PDP-8
designer Edson DeCastro, who came to Data General from DEC).
The NOVA had four 16-bit accumulators, AC0 to AC3. There were also
15-bit system registers - Program Counter, Stack pointer, and Stack
Frame pointer (the last two were only on the MicroNova and Nova 3, not
the original Nova or Fairchild CPU). AC2 and AC3 could be used for
indexed addresses. Apart from the small register set, the NOVA was an
ordinary CPU design.
Another CPU, the National Semiconductor PACE, was based on the NOVA
design, but featured 16 bit addressing, more addressing modes, and a 10
level stack (like the 8008),
but lacked hardware multiply and divide.
The 32 bit ECLIPSE (pre 1983) was Data General's successor to the 16
bit Nova. Like the Nova, the ECLIPSE had four 32 bit integer accumulators,
added four stack registers, and four 64 bit floating
point registers (in the MV series). There are twelve special purpose
registers. The ECLIPSE was eventually implemented in a microprocessor
form as well.
Data General later switched architectures and became an early
supporter of the
Motorola 88K
series load-store microprocessor in the AViiON Unix based systems (designers
originally wanted to call it the Nova II, but that idea was rejected, so
instead they reversed the name and inserted the II in the middle,
switching upper and lower case). Unfortunately, Motorola didn't keep
up with competing CPUs (eventually switching its main support to the
PowerPC),
forcing Data General to invest heavily in multiprocessing to boost
performance, until the company gave up on Motorola and switched to
Intel Pentium CPUs (as
Intergraph did).
This has nothing to actually do with the Nova CPU, but is a little
bit interesting anyway.
- Data General:
- http://www.dg.com/
- Data General Nova:
-
http://www.dg.com/about/html/data_general_nova.html
Part VIII: Signetics 2650, enhanced accumulator based (1978?)
.
Superficially similar to the PDP-8
(and
IMS 6100), the Signetics
2650 was based around a set of 8 bit registers with R0 used as an
accumulator, and six other registers arranged in two sets (R1A-R3A and
R1B-R3B) - a status bit determined which register bank was active. The
other registers were generally used for address calculations (ex.
offsets) within the 15 bit address range. This kept the instruction set
simple - all loads/stores to registers went through R0.
It also had a subroutine stack of eight 15 bit elements, with no
provision for spilling over into memory.
Signetics was bought by Valvo, which was later bought by
Phillips.
Part IX: Signetics 8x300, Early cambrian DSP ancestor (1978)
.
.
Originally developed by a company called SMS, the 8x300 was bought
and became a product of Signetics. Presented as a microcontroller, it
had many DSP-like
features (plus a bipolar fabrication) that made if very fast at the
time, for some applications at least, but lacked many standard features
and was slighly out
of step with some conventions of the time (for example, bits were
numbered in reverse, bit 0 as MSB and 7 as LSB).
The 8x300 could address sixteen registers, but some register
addresses were used to specify non-register operations, so only
eight 8-bit general purpose registers were available - R0 (AUX, the
auxiliary register), R1 to R6, and R9. Register R8 was a single carry
bit. In addition, an 8-bit I/O buffer register IVB was available, and
is the only way to access data memory (similar to the D register in
the RCA 1802) - all data was through 8-bit I/O
ports, organised as two banks (left and right) of 256 each plus an
address indicating which port of that bank I/O operations would use.
The exact operation is specified by the source or destination
register field - if it's not an actual register, then it signifies an
operation. Ports could be attached directly to memory,
or two ports could be used to generate an address,
with another as a data buffer if more storage was needed.
The CPU consisted of multiple units strung together in a pipeline
(one instruction at a time, no overlapping stages as in modern CPUs).
The first operand could be taken from the IVB (as an I/O operation
from an I/O port in the left or right bank) or
any of the general registers, the second operand (if any) came from the
AUX register. The first operand would be processed through a rotate
unit, then a mask unit, to the ALU which performed ony four operations
- MOV, ADD, AND and XOR. The result would be returned either to a
general register, or could would be processed through a shifter and
merge unit to the IVB register (and output to the appropriate left or
right I/O port) - this would allow a subfield of the IVB to be
replaced by bits from the result, instead of the whole register.
The design was also limited with no interrupt support, no stack or
index registers (though the port addresses could function as such),
and no subroutine support (the XEC instruction would execute an
instruction without incremeting the PC, and could be used to implement
subroutines). Data values couldn't be accessed from program memory.
- IDaSS: ASIC lay-out of a 'Peripheral Control Cell' processor core:
-
http://www.eb.ele.tue.nl/proj/idass8x3.html
Part X: Hitachi 6301 - Small and microcoded (1983)
.
The HD6301 was an 8-bit CPU designed using microcode to bring the
simpler design techniques of 16 and 32 bit CPUs at the time down to
8-bit designs. Inspired by the Motorola 6800, the
6301 featured A and B accumulators, one stack and one index register.
These, along with the PC, were mapped to a bank of sixteen 8-bit
registers (R0L, R0H, R1L etc. up to R7H), which along with two data
buffer registers (DBR and DBL), and memory address registers (MARL and
MARH), were accessed by the microcode to execute the CPU instructions
using one 8-bit ALU and a simpler 8-bit arithmatic unit. A simple
2-stage pipeline was used.
Part XI: Motorola MC14500B ICU, one bit at a time
.
Probably the limit in small processors was the 1 bit 14500B from
Motorola. It had a 4 bit instruction, and controlled a single data
read/write line, used for application control. It had no address bus -
that was an external unit that was added on. Another CPU could be used
to feed control instructions to the 14500B in an application.
It had only 16 pins, less than a typical RAM chip, and ran at 1
MHz.
Section Three: The Great Dark Cloud Falls:
IBM's Choice.
Part I: DEC PDP-11, benchmark for the first 16/32 bit generation. (1970)
.
.
.
.
The DEC PDP-11 was the most popular in the PDP (Programmed Data
Processors) line of minicomputers, a successor to the previously
popular PDP-8
(The
PDP-8
continued for a while in certain applications, while the PDP-10 (1967)
was a higher capacity 36-bit mainframe-like system (sixteen general
registers and floating point operations), much adored and
rumoured to have
souls),
and remained in production until the decision to discontinue the line
as of September 30, 1997 (over 25 years - see note on the
DEC Alpha intended lifetime). Many of the PDP-11
features have been carried forward to newer processors because the
PDP-11 was the basis for the C programming language, which became the
most prolific programming language in the world (in terms of variety of
applications, not number) and which includes several low level processor
dependant features which were useful to replicate in newer CPUs for this
reason.
The PDP-11 had eight general purpose 16-bit registers (R0 to R7 - R6
was also the SP and R7 was the PC). It featured powerful register
oriented (little-endian, byte addressable) addressing modes. Since the
PC was treated as a general
purpose register, constants were loaded using an indirect mode on R7
which had the effect of loading the 16 bit word following the current
instruction, then incrementing the PC to the next instruction before
fetching. The SP could be accessed the same way (and any register
could be used for a user stack (useful for FORTH)).
A CC (or PSW) register held results from every instruction that
executed.
Adjascent registers could be implicitly grouped into a 32 bit
register for multiply and divide results (Multiply result stored in
two registers if destination is an even register, not if it's odd.
Divide source must be grouped - quotient is stored in high order (low
number) register, remainder in low order).
A floating point unit could be added which contains six 64 bit
accumulators (AC0 to AC5, can also be used as six 32-bit registers -
values can only be loaded or stored using the first four registers).
PDP-11 addresses were 16 bits, limiting program space to 64K,
though an MMU could be used to expand total address space (18-bits and
22-bits in different PDP-11 versions).
The LSI-11 (1975-ish) was a popular microprocessor implementation
of the PDP-11 using the Western Digital MCP1600
microprogrammable CPU, and the architecture influenced the
Motorola 68000,
NS 320xx, and
Zilog Z-8000 microprocessors in particular. There
was also a 32-bit PDP-11 plan as far back as its 1969 introduction. The
PDP-11 was finally replaced by the VAX architecture,
(early versions included a PDP-11 emulation mode, and were called
VAX-11).
- PDP-11 FAQ:
-
http://www.village.org/pdp11/faq.pages/faq.html
Part II: TMS 9900, first of the 16 bits (June 1976)
.
.
One of the first true 16 bit microprocessors was the TMS 9900, by
Texas Instruments (the first are probably National Semiconductor IMP-16
or AMD 2901
bit slice processors in 16 bit configuration). It was
designed as a single chip version of the TI 990 minicomputer series,
much like the
Intersil 6100
was a single chip
PDP-8,
and the
Fairchild 9440
and
Data General mN601
were both one chip versions of
Data General's Nova.
Unlike the
IMS 6100,
however, the TMS 9900 had a
mature, well thought out design.
It had a 15 bit address space and two internal 16 bit registers.
One unique feature, though, was that all user registers were actually
kept in memory - this included stack pointers and the program counter.
A single workspace register pointed to the 16 register set in RAM, so
when a subroutine was entered or an interrupt was processed, only the
single workspace register had to be changed - unlike some CPUs which
required a dozen or more register saves before acknowledging a context
switch.
This was feasible at the time because RAM was often faster than the
CPUs. A few modern designs, such as the
INMOS Transputers, use this
same design using caches or rotating buffers, for the same reason of
improved context switches. Other chips of the time, such as the
650x
series had a similar philosophy, using index registers, but the TMS
9900 went the farthest in this direction. Later versions added a
write-through register buffer/cache.
That wasn't the only positive feature of the chip. It had good
interrupt handling features and very good instruction set. Serial I/O
was available through address lines. In typical comparisons with the
Intel 8086,
the TMS9900 had smaller and faster programs. The only
disadvantage was the small address space and need for fast RAM.
Despite the very poor support from Texas Instruments, the TMS 9900
had the potential at one point to surpass the 8086
in popularity. TI also produced an embedded version, the TMS 9940.
- TMS9900 Information Sheet:
-
http://www.nashscene.com/~jmoses/contemplate/technico/9900_datasheet.html
Part III: Zilog Z-8000, another direct competitor
.
.
.
.
The Z-8000 was introduced not long after the 8086,
but had superior
features. It was basically a 16 bit processor, but could address up to
23 bits in some versions by using
segment
registers (to supply the
upper 7 bits). There was also an unsegmented version, but both could be
extended further with an additional MMU that used 64 segment registers.
The Z-8070 was a memory mapped FPU.
Internally, the Z-8000 had sixteen 16 bit registers, but register
size and use were exceedingly flexible - the first eight Z-8000
registers could be used as sixteen 8 bit subregisters (identified RH0,
RL0, RH1 ...), or all sixteen could be grouped into eight 32 bit
registers (RR0, RR2, RR4 ...), or four 64 bit registers. They were all
general purpose registers - the stack pointer was typically register
15, with register 14 holding the stack segment (both accessed as one 32
bit register (RR14) for painless address calculations). The instruction
set included 32-bit multiply (into 64 bits) and divide.
The Z-8000 was one of the first to feature two modes, one for the
operating system and one for user programs. The user mode prevented the
user from messing about with interrupt handling and other potentially
dangerous stuff (each mode had its own stack register).
Finally, like the Z-80,
the Z-8000 featured automatic RAM refresh
circuitry. Unfortunately the processor was somewhat slow, but the
features generally made up for that.
A later version, the Z-80000, was introduced about at the beginning
of 1986, at about the same time as the 32 bit
MC68020 and Intel 80386
CPUs, though the Z-80000 was quite a bit more advanced. It was fully
expanded to 32 bits internally, giving it sixteen 32 bit physical
registers (the 16 bit registers became subregisters), doubling the
number of 32 bit and 64 bit registers (sixteen 8-bit and 16-bit
subregisters, 32-bit physical registers, eight 64-bit double registers).
The system stack remained in RR14.
In addition to the addressing modes of the Z-8000, larger 24 bit
(16Mb) segment addressing was added, as well as an integrated MMU
(absent in the 68020
but added later in the 68030)
which included an on
chip 16 line 256-byte fully associated write-through cache (which could
be set to cache only data, instructions, or both, and could also be
frozen by software once 'primed' - also found on later versions of the
AMD 29K). It also featured multiprocessor
support by defining some memory pages to be exclusive and others to be
shared (and non-cacheable), with separate memory signals for each
(including GREQ (Global memory REQuest) and GACK lines). There was also
support for coprocessors, which would monitor the data bus and identify
instructions meant for them (the CPU had two coprocessor control lines
(one in, one out), and would produce any needed bus transactions).
Finally, the Z-80000 was fully pipelined (six stages), while the
fully pipelined 80486
and 68040 weren't introduced until 1991.
But despite being technically advanced, the Z-8000 and Z-80000
series never met mainstream acceptance, due to initial bugs in the
Z-8000 (the complex design did not use
microcode - it used only 17,500
transistors) and to delays in the
Z-80000. There was a radiation resistant military version, and a CMOS
version of the Z-80000 (the Z-320). Zilog eventually gave up and became
a second source for the AT&T WE32000 32-bit (1986) CPU instead (a
VAX-like microprocessor derived from the Bellmac 32A
minicomputer, which also became obsolete).
The Z-8001 was used for Commodore's CBM 900 prototype, but the
Unix based machine was never released - instead, Commodore bought
Amiga, and released the 68000
based machine it was designing. A few
companies did produce Z-8000 based computers, with Olivetti being the
most famous, and the Plexus P40 being the last - the
68000 quickly became the processor of choice.
Part IV: Motorola 68000, a refined 16/32 bit CPU (September 1979)
.
.
.
.
.
.
.
.
.
The initial 8MHz 68000 was actually a 32 bit architecture
internally, but had only a 16 bit data bus and 24 bit address bus to fit
in a 64 pin package (address and data shared a bus in the 40 pin
packages of the 8086
and
Z-8000).
Later the 68008 reduced the data bus to 8 bits and address to 20 bits,
and the 68020 was fully 32 bit externally. Addresses were computed as
32 bits (without using segment registers) - unused upper bits in the
68000 or 68008 bits were ignored, but some programmers stored type tags
in the upper 8 bits, causing compatibility problems with the 68020's 32
bit addresses. Lack of forced segments made programming the 68000 easier
than some competing processors, without the 64K size limit on directly
accessed arrays or data structures.
Looking back it was a logical design decision, since most 8 bit
processors featured direct 16 bit addressing without segments.
The 68000 had sixteen 32-bit registers, split into eight data and
address
registers. One address register was reserved for the Stack Pointer.
Data registers could be used for any operation, including offset from
an address register, but not as the source of an address itself.
Operations on address registers were limited to move, add/subtract, or
load effective address.
Like the Z-8000,
the 68000 featured a supervisor and user mode (each with its own Stack
Pointer). The Z-8000
and 68000 were similar in
capabilities, but the 68000 was 32 bit units internally (16 bit
ALUs, making some 32-bit operations slower than 16-bit - two in parallel
for 32-bit data, one for addresses), making it
faster and eliminating forced segments. It was designed for
expansion, including specifications for floating point and string
operations (floating point was added in the 68040 (1991), with eight 80
bit floating point registers compatible with the 68881/2 coprocessors).
Like many other CPUs of the time, the 68000 could fetch the next
instruction during execution (a 2 stage pipeline).
The 68010 (1982) added virtual memory support (the 68000 couldn't
restart interrupted instructions) and a special loop mode - small
decrement-and-branch loops could be executed from the instruction
fetch buffer. The 68020 (1984) expanded external data and address bus
to 32 bits, simple 3-stage pipeline, and added a 256 byte cache (loop
buffer), while the 68030 (1987) brought the MMU
onto the chip (it supported two level pages (logical, physical) rather
than the segment/page mapping of the Intel 80386
and IBM S/360
mainframe). The 68040 (January 1991) added fully cached
Harvard busses (4K
each for data and instructions), 6 stage pipeline, and on chip FPU.
The 68060 (April 1994) expanded the design to a superscalar
version, like the Intel Pentium
and NS320xx (Swordfish) series before it.
Like the National Semiconductor Swordfish,
and later the Nx586, AMD K5,
and Intel's "Pentium Pro", the the third stage of the
10-stage 68060 pipeline translates the 680x0 instructions to a
decoded RISC-like form (stored in a 16 entry buffer in stage four).
There is
also a branch cache, and branches are folded into the decoded
instruction stream like the
AT&T Hobbit and other more recent processors,
then dispatched to two pipelines (three stages: Decode, addr gen,
operand fetch) and finally to two of three execution units -
2 integer, 1 floating point) before reaching two 'writeback' stages.
Cache sizes are doubled over the 68040.
The 68060 also also includes many innovative power-saving features
(3.3V operation, execution unit pipelines could actually be shut down,
reducing power consumption at the expense of slower execution, and the
clock could be reduced to zero) so power use is lower than the 68040
(4-6 watts vs. 3.9-4.9). Another innovation is that
simple register-register instructions which don't generate addresses
may use the the address stage ALU to execute 2 cycles early.
The embedded market became the main market for the 680x0 series
after workstation venders (and the Apple Macintosh) turned to faster
load-store processors, so a variety of embedded versions were introduced.
Later, Motorola designed a successor called Coldfire (early 1995), in
which complex instructions and addressing modes (added to the 68020)
were removed and the instruction set was recoded,
simplifying it at the expense of compatibility (source only, not binary)
with the 680x0 line.
The Coldfire 52xx (version 2 - the 51xx version 1 was a
68040-based/compatible
core) architecture resmbles a stripped (single pipeline) 68060,
The 5 stage pipeline is literally folded over itself - after two fetch
stages and a 12-byte buffer, instructions pass through the decode and
address generate stages, then loop back so the decode becomes the
operand fetch stage, and the address
generate becomes the execute stage (so only one ALU is required for
address and execution calculations). Simple (non-memory) instructions
don't need to loop back. There is no translator stage as in the 68060
because Coldfire instructions are already in RISC-like form.
The 53xx added a multiply-accumulate (MAC) unit and internal clock
doubling.
The 54xx adds branch and assignment folding with other instructions for
a cheap form of superscalar execution with little added complexity, and
uses a Harvard architecture for faster memory
access, plus enhancements to the instruciton set to improve code
density, performance, and to add fleximility to the MAC unit.
At a quarter the physical size and a fraction of the power
consumption, Coldfire is about as fast as a 68040 at the same clock rate,
but the smaller design allows a faster clock rate to be acheived.
Few people wonder why Apple chose the Motorola 68000 for the
Macintosh, while IBM's decision to use
Intel's 8088 for the
IBM PC has baffled many. It wasn't a
straightforward decision though. The
Apple Lisa was the predecessor to the Macintosh, and also used a 68000
(eventually -
8086
and slower
bitslice
CPUs (which Steve Wozniak thought were neat) were initially considered
before the 68000 was available).
It also included a fully multitasking, GUI based operating system,
highly integrated software, high capacity (but incompatible) 'twiggy'
5 1/4" disk drives, and a large workstation-like monitor. It was better
than the Macintosh in almost every way, but was correspondingly more
expensive.
The Macintosh was to include the best features of the Lisa, but at
an affordable price - in fact the original Macintosh came with only 128K
of RAM and no expansion slots. Cost was such a factor that the 8 bit
Motorola 6809 was the original design choice, and
some prototypes were built, but they quickly realised that it didn't
have the power for a GUI based OS, and they used the Lisa's 68000,
borrowing some of the Lisa low level functions (such as graphics toolkit
routines) for the Macintosh.
Competing personal computers such as the Amiga and Atari ST, and
early workstations by Sun, Apollo, NeXT and most others also used 680x0
CPUs (including one of the earliest workstations, the Tandy TRS-80
Model 16, which used a 68000 CPU and Z-80
for I/O and VM support).
- Motorola:
- http://www.mot.com/
- Motorola Microprocessors:
-
http://www.mot.com/SPS/MMTG/mp.html
- Absolute Mac US History:
-
http://www.absolutemac.com/US/HISTOIRE/history.html
- Amiga International:
-
http://www.amiga.de/
- Atari compatible Milan computers:
-
http://www.milan-computer.de/html_gb/produkt.html
- Atari compatible Medusa computers:
-
http://www.stud.ee.ethz.ch/~caschwan/medusa.html
- Atari compatible C-Lab computers:
-
http://www.ataricentral.com/st/c-lab/mkx.html
Part V: National Semiconductor 32032, similar but different
.
.
.
.
Like the 68000,
the 320xx family consisted of a CPU which was
32-bit internally, and 16 bits externally (later also 32 and 8),
as indicated by the first and last two digits (originally reversed, but
16032 just seemed less impressive). It appeared a little later than
the others here, and so was not really a choice for the
IBM PC, but is
still representative of the era.
Elegance and regular design was a main goal of this processor, as
well as completeness. It was similar to the 68000
in basic features, such as byte
addressing, 24-bit address bus in the first version, memory to memory
instructions, and so on (The 320xx also includes a string and array
instruction). Unlike the 68000,
the 320xx had eight instead of sixteen
32-bit registers, and they were all general purpose, not split into
data and address registers. There was also a useful scaled-index
addressing mode, and unlike other CPUs of the time, only a few
operations affected the condition codes (as in more modern CPUs).
Also different, the PC and stack registers were separate from the
general register set - they were special purpose registers, along with
the interrupt stack, and several "base registers" to provide
multitasking support - the base data register pointed to the working
memory of the current module (or process), the interrupt base register
pointed to a table of interrupt handling procedures anywhere in memory
(rather than a fixed location), and the module register pointed to a
table of active modules.
The 320xx also had a coprocessor bus, similar to the 8-bit Ferranti
F100-L
CPU, and coprocessor instructions. Coprocessors included an MMU,
and a Floating Point unit which included eight 32-bit registers, which
could be used as four 64-bit registers.
The series found use mainly in embedded applications, and was
expanded to that end, with timers, graphics enhancements, and even a
Digital Signal Processor unit in the Swordfish version (1991, also
known as 32732 and 32764). The Swordfish was among the first truely
supserscalar
microprocessors, with two 5-stage pipelines (integer A, and
B, which consisted of an integer and floating point pipeline - an
instruction dispatched to B would execute in the appropriate pipe,
leaving the other with an empty slot. The integer pipe could cycle
twice in the memory stage to synchronise with the result of the
floating point pipe, to ensure in-order completion when floating point
operations could trap. B could also execute branches).
This strategy was influenced by the Multiflow VLIW design.
Instructions were always fetched two at a time from the instruction
cache which partially decoded the instruction pairs and set a bit to
indicate whether they were dependent or could be issued simultaneously
(effectively generating two-word VLIWs in the
cache from an external stream of instructions). The cache decoder also
generated branch target addresses to reduce branch latency as in the
AT&T CRISP/Hobbit CPU.
The Swordfish implemented the NS32K instruction set using a
reduced instruction core - NS32K instructions were translated by the
cache decoder into either: one internal instruction, a pair of internal
instructions in the cache, or a partially decoded NS32K instruction
which would be fully decoded into internal instructions after being
fetched by the CPU. The Swordfish also had
dynamic bus resizing (8, 16, 32, or 64 bits, allowing 2 instructions to
be fetched at once) and clock doubling, 2 DMA channels, and in circuit
emulation (ICE) support for debugging.
The Swordfish was later simplified into a load-store design and
used to implement an instruction set called CompactRISC (also known as
Pirhana, an implementation independent instruction set supporting
designs from 8 to 64 bits).
It seems interesting to note that in the case of the NS320xx and
Z-80000,
non mainstream processors gained many advanced design features
well ahead of the more mainstream processors, which presumably had more
development resources available. One possible reason for this is the
greater importance of compatibility in processors used for computers
and workstations, which limits the freedom of the designers. Or perhaps
the non-mainstream processors were just more flexible designs to begin
with. Or some might not have made it to the mainstream because the more
ambitious designs resulted in more implementation bugs than competitors.
- National Semiconductor - CompactRISC:
-
Part VI: MIL-STD-1750 - Military artificial intelligence (February 1979)
.
The USAF created a draft standard for a 16-bit microprocessor meant
to be
used in all airborn computers and weapons systems, allowing software
developed for one such system to be portable to other similar
applications, similar to the intent behind the creation of Ada as the
standard high level programming language for the U.S Department of
Defense (MIL-STD-1815 accepted October 1979 - 1815 was the year
Ada Augusta, Countess of Lovelace and the world's first programmer, was
born).
Like other 16 bit designs of the time, 1750 was inspired by
the PDP-11, but differs significantly.
Sixteen 16-bit registers were specified, and any adjascent pairs (such
as R0+R1 or R1+R2)
could be used as 32-bit registers (the Z-8000 and
PDP-11 could only use even pairs, and the PDP-11 only for specific
uses) for integer or floating point (FP) values
(no separate FP registers), or triples for 48-bit extended precision FP
(with the mantissa concatenated after the exponent - eg. 32-bit FP was
[1s][23mantissa][8exp], 48-bit was [1s][23mantissa][8exp][16ext],
meaning any 48-bit FP was also a valid 32-bit FP, only losing the
extra precision). Also, only the upper four registers (R12 to R15)
could be used as an address base (2 instruction bits instead of 4), and
R0 can't be used as an index (using R0 implies no indexing, similar
to the PowerPC. R15 is used as an implicit
stack pointer, the program counter is not user accessible.
Address space is 16 bit word addressed (not bytes), but the design
allows for an MMU to extend this to 20 bits. In addition, program and
data memory can be separated using the MMU. A 4-bit Address State
field in the processor status word (PSW) selects one of sixteen page
groups, each containing sixteen registers for data memory and another
sixteen for program memory (16x16x2 = 512 total). The top 4 bits of an
address selects a register from the current AS group, which provides
the upper 8 bits of a 20 bit address.
Each page register also has a 4-bit access key. While other CPUs at
the time provided user and supervisor modes, the 1750 provided for
sixteen modes, from supervisor (mode 0, could access all pages),
fourteen user modes (1 to 14 can only access page with same key, or
key 15), and an unprivledged mode (mode 15 can only access page with
key 15). Program memory can occupy the same logical address space as
data, but will select from the program page registers. Pages can also
be write or execute protected.
Several I/O instructions are also included, and are used to access
processor state registers.
The 1750 is a very practical 16 bit design, and is still being
produced, mainly in expensive radiation resistant forms. It did not
achieve widespread acceptance, likely because of the rapid advance of
technology and the rise of the RISC paradigm.
- CPU Tech:
- http://www.cputech.com/
- CPU Tech MIL-STD-1750A Cores:
-
http://www.cputech.com/milstd.htm
- Public Ada Library (PAL):
-
http://wuarchive.wustl.edu/languages/ada/
- Origin of Ada:
-
http://wuarchive.wustl.edu/languages/ada/ajpo/pol-hist/history/holwg-93/1.htm
Part VII: Intel 8086, IBM's choice (1978)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Intel 8086 was based on the design of the
8080/8085 (source
compatible with the 8080)
with a similar register set, but was expanded
to 16 bits. The Bus Interface Unit fed the instruction stream to the
Execution Unit through a 6 byte prefetch queue, so fetch and execution
were concurrent - a primitive form of pipelining (8086 instructions
varied from 1 to 4 bytes).
It featured four 16 bit general registers, which could also be
accessed as eight 8 bit registers, and four 16 bit index registers
(including the stack pointer). The data registers were often used
implicitly by instructions, complicating register allocation for
temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and
fixed vectored interrupts. There were also four
segment registers that
could be set from index registers.
The segment registers allowed the CPU to access 1 meg of memory
through an odd process. Rather than just supplying missing bytes, as
most segmented processors, the 8086 actually added the segment
registers ( X 16, or shifted left 4 bits) to the address. As a strange
result of this unsuccessful attempt at extending the address space
without adding address bits, it was possible to have two pointers
with the same value point to two different memory locations, or two
pointers with different values pointing to the same location, and
limited typical data structures to less than 64K. Most
people consider this a brain damaged design (a better method might
have been that developed for the
MIL-STD-1750 MMU).
Although this was largely acceptable for assembly language, where
control of the segments was complete (it could even be useful then), in
higher level languages it caused constant confusion (ex. near/far
pointers). Even worse, this made expanding the address space to more
than 1 meg difficult. The 80286 (1982?) expanded the design to 32 bits
only by adding a new mode (switching from 'Real' to 'Protected' mode was
supported, but switching back required using a bug in the original
80286, which then had to be preserved) which greatly increased the
number of segments
by using a 16 bit selector for a 'segment
descriptor', which contained the location within a 24 bit address space,
size (still less than 64K), and attributes (for Virtual Memory support)
of a segment.
But all memory access was still restricted to 64K segments until the
80386 (1985), which included much improved addressing:
base reg + index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or
32 bit constant = 32 bit address) in the form of paged segments (using
six 16-bit segment registers), like the IBM S/360
series, and unlike the Motorola 68030).
It also had
several processor modes (including separate paged and segmented modes)
for compatibility with the previous awkward design. In fact, with
the right assembler, code written for the 8008
can still be run on the most recent Pentium Pro. The 80386 also added an
MMU, security modes (called "rings" of privledge - kernal, system
services, application services, applications) and new op codes in a
fashion similar to the
Z-80 (and Z-280).
The 80486 (1989) added full pipelines, single on chip 8K cache,
integrated FPU (based on the eight element 80-bit stack-oriented FPU in
the 80387 FPU), and clock doubling versions (like the
Z-280).
The Pentium (late 1993) was superscalar (up to two instructions at once
in dual integer units and single FPU) with separate 8K I/D caches.
The Pentium was the name Intel gave the 80586 version because it
could not legally protect the name "586" to prevent other companies
from using it - and in fact, the Pentium compatible CPU from NexGen
is called the Nx586 (early 1995). Due to its popularity, the 80x86 line
has been the most widely cloned processors, from the NEC V20/V30
(slightly faster clones of the 8088/8086 (could also run 8085 code)),
AMD and Cyrix clones of the
80386 and 80486, to versions of the Pentium within less than two years
of its introduction.
MMX (initially reported as MultiMedia eXtension, but later said by
Intel to mean Matrix Math eXtension) is very similar to the earlier
SPARC VIS or
HP-PA MAX,
or later MIPS MDMX
instructions - they perform
integer operations on vectors of 8, 16, or 32 bit words, using the 80 bit
FPU stack elements as eight 64 bit registers (switching between FPU and
MMX modes as needed - it's very difficult to use them as a stack and as
MMX registers at the same time). The P55C Pentium version (January 1997)
is the first Intel CPU to include MMX instructions, followed by the AMD
K6, and Pentium II. Cyrix also added these instructions in
its M2 CPU (6x86MX, June 1997), as well as IDT with its C6.
Interestingly, the old architecture is such a barrier
to improvements that most of the Pentium compatible CPUs (NexGen
Nx586/Nx686, AMD K5, IDT-C6), and even the "Pentium Pro"
(Pentium's successor, late 1995) don't clone the
Pentium, but emulate it with specialized hardware decoders like those
introduced in the
National Semiconductor Swordfish, which convert
Pentium instructions to RISC-like instructions which are executed on
specially designed superscalar RISC-style cores faster than the Pentium
itself. Intel also
used BiCMOS in the Pentium and Pentium Pro to achieve clock rates
competitive with CMOS load-store processors (the Pentium P55C (early 1997)
version is a pure CMOS design).
IBM had been developing hardware to translate Pentium
instructions for the PowerPC
in a similar manner as part of the PowerPC 615 CPU
(able to switch between instruction 80x86, 32-bit and 64-bit PowerPC
instruction sets in five cycles (to drain the execution pipeline)),
but the project was killed after significant development for marketing
reasons. Rumour has it that engineers who worked on the project went
on to Transmeta corporation to work on a VLIW
processor which can
dynamically translate a foreign (expected to be 80x86, but possibly
changable on the fly to implement others such as the
Java Virtual Machine) instruction set into a cache,
with the effects of reducing the decode complexity compared to Intel,
AMD, and other translation designs, while taking advantage of a VLIW
design in a way that also hides the traditional VLIW disadvantages.
The eventual product is expected to be fabricated and sold by IBM
Microelectronics.
The Cyrix 6x86 (early 1996), initially manufactured by IBM before
Cyrix merged with National Semiconductor, still directly executes 80x86
instructions (in two integer and one FPU pipeline), but partly
out of order, making it faster than a
Pentium at the same clock speed. Cyrix also sells an integrated
version with graphics and
audio on-chip called the MediaGX. MMX instructions were added to the
6x86MX, and 3DNow! graphics instructions to the 6x86MXi. The M3 (mid 1998)
turned to superpipelining (eleven stages compared to six (seven?) for the M2)
for a higher clock rate (partly for marketing purposes, as MHz is
often preferred to performance in the PC market), and provides dual
floating point/MMX/3DNow! units. Cyrix was purchased by PC chipset maker
Via.
The Pentium Pro (P6 execution core) is a 1 or 2-chip (CPU plus 256K
or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage
superpipelined processor. It uses extensive multiple branch prediction
and speculative execution
via register renaming.
Three decoders (one for
complex instructions, two for simpler ones (four or fewer micro-ops))
each decode one 80x86 instruction into micro-ops (one per simple
decoder + up to four from the complex decoder = three to six per
cycle). Up to five (usually three) micro-ops can be issued in parallel
and out of order (six units - FPU, 2 integer,
2 address, 1
load/store), but are held and retired (results written to registers
or memory) as a group to prevent an inconsistant state (equivalent to
half an instruction being executed when an interrupt occurs, for
example). 80x86 instructions may produce several micro-ops in CPUs like
this (and the Nx586 and AMD K5), so the actual instruction rate is
lower. In fact, due to problems handling instruction alignment in the
Pentium Pro, emulated 16-bit instructions execute slower than on a
Pentium. The Pentium II (April 1997) added MMX instructions to the P6
core, doubled cache to 32K, and was packaged in a processor
card instead of an IC package. The Pentium III added Streaming SIMD
Extensions (SSE) to the P6 core, which included eight 128-bit registers
which could be used as vectors of four 32-bit integer of floating point
values (like the PowerPC AltiVec
extensions, but with fewer operations or data types). Unlike MMX (and
like AltiVec), the SSE registers need to be saved seperately during
context switches, requiring OS modifications.
The AMD K5 translates 80x86 code to ROPs (RISC OPerations), which
execute on a RISC-style core based on the unproduced superscalar
AMD 29K.
Up to four ROPs can be dispatched to six units (two integer, one FPU, two
load/store, one branch unit), and five can be retired at a time. The
complexity led to low clock speeds for the K5, prompting AMD to buy
NexGen and integrate its designs for the next generation K6.
The NexGen/AMD Nx586 (early 1995) is unique by being able to execute
its micro-ops (called RISC86 code) directly, allowing optimised RISC86
programs to be written which are faster than an equivalent x86 program
would be, but this feature is seldom used. It also features two 16K I/D
L1 caches, a dedicated L2 cache bus (like that in the Pentium Pro 2-chip
module) and an off-chip FPU (either separate chip, or later as in 2-chip
module).
The Nx586 sucessor, the K6 (April 1997) actually has three
caches - 32K each for data and instructions, and a half-size 16K cache
containing instruction decode information. It also brings the FPU
on-chip and eliminates the dedicated cache bus of the Nx586, allowing it
to be pin-compatible with the P54C model Pentium. Another decoder is
added (two complex decoders, compared to the Pentium Pro's one complex
and two simple decoders) producing up to four micro-ops and issuing up
to six (to seven units - load, store, complex/simple integer, FPU,
branch, multimedia) and retiring four per cycle. It includes MMX
instructions, licensed from Intel, and AMD has designed and added 3DNow!
graphics extensions without waiting for Intel's SSE additions.
AMD aggressively pursued superscalar designs for the K7 (mid 1999),
decoding x86 instructions into 'MacroOps' (made up of one or
two 'micro-ops', a process similar to the branch folding in the
AT&T Hobbit or instruction grouping in the
T9000 Transputer and the
Sun microJava CPU)
in two decoders (one for simple and one for complex
instructions) producing up to three MacroOps per cycle. Up to nine
decoded operations per cycle can be issued in six MacroOps to six
functional units (three integer, each able to execute one simple integer
and one address op simultaneously, and three FPU/MMX/3DNow! instructions
with extensive stack and register renaming, and a separate integer
multiply unit which follows integer ALU 0, and can forward results to
either ALU 0 or 1). The K7 replaces the Intel-compatible bus of the
K6 with the high speed Alpha EV6
bus because Intel decided to prevent competitors from using its own
higher speed bus designs (Dirk Meyer was director of engineering for
the K7, as well as co-architect of the Alpha 21064 and 21264). This
makes it easier to use either Alpha or AMD K7 processors in a single
design. At introduction, the K7 managed to out-perform Intel's castest
CPU.
Centaur, a subsidiary of Integrated Device Technology, introduced
the IDT-C6 WinChip (May 1997), which uses
a much simpler (6-stage, 2 way integer/simple-FPU execution) desgn than
Intel and AMD translation-based designs by using micro-ops more closely
resembling 80x86 than RISC code, which allows for a higher clock
rate and larger L1 (32K each I/D) and TLB caches in a lower cost, lower
power consumption design. Simplifications include replacing branch
prediction (less important with a short pipeline) with an eight entry
call/return stack, depending more on caches. The FPU unit includes MMX
support. The C6+ version adds second FPU/MMX unit and 3D
graphics enhancements. The Centaur division was bought by PC chipset
maker Via (as with Cyrix).
Like Cyrix, IDT opted for a superpipelined
eleven-stage design for added performance, combined with
sophisticated early branch prediction in its WinChip 4. The design
also pays attention to supporting common code sequences - for example,
loads occur earlier in the pipeline than stores, allowing
load-alu-store sequences to be more efficient.
Intel, with partner Hewlett-Packard, has begun development of a
next generation 64-bit processor architecture called IA-64 (the 80x86
design was renamed IA-32). It is expected to be a variable length
instruction group (or what HP/Intel call EPIC
(Explicit Parallel Instruction Computing)) with instruction
dependencies grouped from 1 to 9+. IA-64 is expected to read 41-bit
instructions in 128 bit bundles of three plus five "template bits" which
indicate dependancies/types (Merced is expected to execute two at a
time, to two load/store, four integer, two dual-pipe FP units). Most
instructions are predicated,
a design very similar to the TI 320C6x DSP,
but with 128 general 64-bit and 128 floating point registers, and 64
predicate bits (a type of condition code). To reduce page faults,
speculative load instructions execute a load, but does not trap if
there is an exception until a second instruction completes it -
if the second instruction is predicated and never executes, then a
page fault is avoided, and loads can be rescheduled more flexibly.
It's expectged to be
compatible in some way with both the PA-RISC and 80x86 - it will
include 80x86
data and segment registers, with additional instructions to switch
between instruction/register sets and transfer
data between 80x86 and IA-64 registers. It is expected to
translate 80x86 instructions into VLIW instructions (or directly to
decoded instructions) the same way that
Intel P6 and AMD K5/K6/K7 CPUs do, but with a larger number of
instructions issued using the VLIW design, it should be faster. However,
if native IA-64 code is even faster, this may finally produce the
incentive to let the 80x86 architecture finally fade away.
On the other hand, the demand for compatibility will remain a strong
market force. AMD announced its intention to extend the K7 design to
produce a 64-bit 80x86 compatible K8, in competition with the Intel
Merced.
So why did IBM chose the 8-bit 8088 (1979) version of the 8086 for
the IBM 5150 PC (1981) when most of the alternatives
were so much better? Apparently IBM's own engineers wanted to use the
68000,
and it was used later in the forgotten IBM Instruments 9000
Laboratory Computer, but IBM already had rights to manufacture the
8086, in exchange for giving Intel the rights to its
bubble memory
designs. Apparently IBM was using 8086s in the IBM Displaywriter word
processor.
Other factors were the fact that the the 8-bit 8088 could use
existing low cost 8085-type
components, and allowed the computer to be based on
a modified 8085
design. 68000
components were not widely available, though it could use
6800 components to an extent. After the failure and
expense of the IBM 5100 (1974/5/6? - their first attempt at a peronal
computer - discrete random logic CPU with no bus, built in BASIC and APL
as the OS), cost was a large factor in the design of the PC.
The availability of CP/M-86 is also likely a factor, since
CP/M was the operating system standard for the computer industry at
the time. However Digital Research founder Gary Kildall was unhappy with
the legal demands of IBM, so Microsoft, a programming language company,
was hired instead to provide the operating system (initially known at
varying times as QDOS, SCP-DOS, and finally 86-DOS, it was purchased by
Microsoft from Seattle Computer Products and renamed MS-DOS).
Digital Research did eventually produce CP/M 68K for the
68000 series, making the operating system choice
less relevant than other factors.
Intel bubble memory
was on the market for a while, but faded away
as better and cheaper memory technologies arrived.
- Intel Corporation:
- http://www.intel.com/
- Intel Product Info:
-
http://www.intel.com/intel/product/index.htm
- AMD, Inc.:
- http://www.amd.com/
- AMD PC Processors Plus:
-
http://www.amd.com/products/cpg/cpg.html
- Cyrix:
- http://www.cyrix.com/
- Centaur:
- http://www.centtech.com/
- Hewlett-Packard:
-
http://www.hp.com/
- HP Microprocessors:
- http://CPUs.hp.com/
- Interview with Bill Gates:
-
http://innovate.si.edu/history/gates/gatestoc.htm
- Merced Facts and Speculations:
-
http://www.microprocessor.sscc.ru/Merced/
Section Four: Unix and RISC, a New Hope
Part I: TRON, between the ages (1987)
.
.
TRON stands for The Real-time Operating Nucleus, and was a grand
scheme devised by conceived by Prof.
Takeshi Sakamura of the University of Tokyo around 1984 to design a unified
architecture for computer systems from the CPU, to operating systems
and languages, to large scale networks. The TRON CPU was designed just
as load-store architectures
were set to rise, but retained the memory-data design philosophies - it could
be considered a last gasp, though that doesn't do justice to the intent
behind the design and its part in the TRON architecture.
The basic design is scalable, from 32 to 48 and 64 bit designs,
with 16 general purpose registers. It is a memory-data instruction set, but an
elegant one. One early design was the Mitsubishi M32 (mid 1987), which
optimised the simple and often used TRON instructions, much like the
80486
and 68040
did. It featured a 5 stage pipeline, dynamic branch
prediction with a target branch buffer similar to that in the
AMD 29K.
It also featured an instruction prefetch queue, but being a prototype,
had no MMU support or FPU.
Commercial versions such as the Gmicro/200 (1988) and other Gmicro/
from Fujitsu/Hitachi/Mitsubishi, and the Toshiba Tx1 were also introduced,
and a 64 bit version (CHIP64) began development, but they didn't catch
on in
the non-Japanese market (definitive specifications or descriptions of
the OS's actual operation were hard to come by, while research systems
like Mach of BSD Unix were widely available for experimentation). In
addition, newer techniques (such as load-store designs)
overshadowed the TRON standard. Companies such as
Hitachi switched to
load-store designs, and many American companies (Sun, MIPS) licensed their
(faster) designs openly to Japanese companies. TRON's promise of
a unified architecture (when complete) was less important to companies
than raw performance and immediate compatibility (Unix, MS-DOS/MS
Windows, Macintosh), and has not become significant in the industry, though
TRON operating system development continued as an embedded and distributed
operating system (such as the Intelligent House project, or more
recently the TiPO handheld digital assistant from Seiko (February 1997))
implemented on non-TRON CPUs.
NEC produced a similar memory-data design around the same time, the
V60/V70
series, using thirty two registers, a seven stage pipeline, and
preprocessed branches. NEC later developed the 32-bit load-store V800
series, and became a source of 64-bit
MIPS load-store processors.
- TRON Project Information
-
http://tron.um.u-tokyo.ac.jp/TRON/
- Kahaner's report on TRON
-
http://www.baker.com/grand-unification-theory/archive/199103/04.html
Part II: SPARC, an extreme windowed RISC (1987)
.
.
SPARC, or the Scalable (originally Sun) Processor ARChitecture was
designed by Sun Microsystems for their own use. Sun was a maker of
workstations, and used standard 68000-based
CPUs and a standard
operating system, Unix. Research versions of load-store processors had
promised a major step forward in speed [See
Appendix A], but existing
manufacturers were slow to introduce a RISC processor, so Sun
went ahead and developed its own (based on
Berkeley's design). In keeping
with their open philosophy, they licensed it to other companies, rather
than manufacture it themselves.
SPARC was not the first RISC processor. The
AMD 29000 (see below)
came before it, as did the MIPS R2000
(based on Stanford's experimental
design) and Hewlett-Packard PA-RISC
CPU, among others. The SPARC design
was radical at the time, even omitting multiple cycle multiply and
divide instructions (added in later versions), using single-cycle
"step" instructions instead, while most RISC CPUs were more
conventional.
SPARC usually contains about 128 or 144 integer registers,
(memory-data designs typically had 16 or less).
At each time 32 registers are available -
8 are global, the rest are allocated in a 'window' from a stack of
registers. The window is moved 16 registers down the stack during a
function call, so that the upper and lower 8 registers are shared
between functions, to pass and return values, and 8 are local. The
window is moved up on return, so registers are loaded or saved only at
the top or bottom of the register stack. This allows functions to be
called in as little as 1 cycle. later versions added 32 (non-windowed)
FPU registers. Like most RISC processors, global
register zero is wired to zero to simplify instructions, and SPARC is
pipelined for performance (a new instruction can start execution before
a previous one has finished), but not as deeply as others - like the
MIPS CPUs, it has branch delay slots. Also like
previous processors, a dedicated condition code register (CCR) holds
comparison results.
SPARC is 'scalable' mainly because the register stack can be
expanded (up to 512, or 32 windows), to reduce loads and saves between
functions, or scaled down to reduce interrupt or context switch time,
when the entire register set has to be saved. Function calls are
usually much more frequent than interrupts, so the large register set
is usually a plus, but compilers now can usually produce code which
uses a fixed register set as efficiently as a windowed register set
across function calls.
SPARC is not a chip, but a specification, and so there are various
designs of it. It has undergone revisions, and now has multiply and
divide instructions. Original versions were 32 bits, but 64 bit and
superscalar versions were designed and implemented (beginning with the
Texas Instruments SuperSparc in late 1992), but performance lagged behind
other load-store and even Intel 80x86
processors until the UltraSPARC (late 1995) from Texas Instruments and
Sun, and superscalar HAL/Fujitsu SPARC64 multichip CPU. Most emphasis by
licensees other than Sun and HAL/Fujitsu has been on low cost, embedded
versions.
The UltraSPARC is a 64-bit superscalar processor series
which can issue up to four instructions at once (but not
out of order)
to any of nine units: two integer units, two of the five floating
point/graphics
units, the branch and load/store unit. The UltraSparc also added a
block move instruction which bypasses the caches (2-way 16K instr, 16K
direct mapped data), to avoid disrupting it, and specialized pixel
operations (VIS - the Visual Instruction Set) which can operate in
parallel on 8, 16, or 32-bit integer values packed in a 64-bit floating
point register (for
example, four 8 X 16 -> 16 bit multiplications in a 64 bit word, a sort
of simple SIMD/vector operation. More extensive than the
Intel MMX instructions, or earlier
HP PA-RISC MAX and
Motorola 88110 graphics extensions, VIS also
includes some 3D to 2D conversion, edge processing and pixes distance
(for MPEG, pattern-matching support).
The HAL/Fujitsu SPARC64 (used in Fujitsu servers using Sun Solaris
software) can issue up to four in order instructions
simultaneously to four buffers, then to four integer, two floating
point, two load/store, and the branch unit, and may complete out of
order unlike UltraSPARC (an instruction completes when it finishes
without error, is committed when all instructions ahead of it have
completed, and is retired when its resources are freed - these are
'invisible' stages in the SPARC64 pipeline). A combination of
register renaming, a
branch history table, and processor state storage (like in the
Motorola 88K) allow for
speculative execution
while maintaining precise exceptions/interrupts
(renamed integer, floating, and CC registers -
trap levels are also renamed
and can be entered speculatively).
- Sun Microsystems:
- http://www.sun.com/
- Sun Microprocessor Products:
-
http://www.sun.com/microelectronics/products/microproc.html
- HAL Computer Systems, Inc.:
- http://www.hal.com/
Part III: AMD 29000, a flexible register set (1987)
.
.
The AMD 29000 is another load-store CPU descended from the
Berkeley RISC design
(and the IBM 801 project),
as a modern successor to the earlier 2900
bitslice series (beginning around 1981). Like the
SPARC
design that was intro