============================
                            The 4Mhz CoCo3 Project
                         ============================

   I've wanted to put together a text about this project for a long time now, but
have usually managed to lose track of my time.  This time, I'm finally going to
get this thing written, before I forget too much about it!

   This project is actually a pretty old idea.  The idea of pushing more speed
out of the CoCo3 is almost as old as the CoCo3 itself.  Soon after the
introduction of the CoCo3, it was many people's impression that it was nice and
all, but it just wasnt enough.  Sure you could do 2Mhz, but so could the CoCo2 -
to some extent.  With the increasing speed of *other* computers, the CoCo3 was
starting to lag behind in CPU power.

   To clarify a few things, this project isn't really 4Mhz.  Nor is a stock CoCo3
really 2Mhz.  This little bit of mislabelling actually started at the very
beginning with the CoCo1 - which was actually 0.89Mhz.  Since it was close
enough to 1Mhz, most people just called it as such because it was easier to say.
When the 'fast' pokes were discovered, people simply carried the tradition by
saying it ran at 2Mhz because it was twice as fast.  In actuality, the CoCo3 is
really 1.789Mhz, and a '4Mhz' CoCo3 would be twice that - 3.579Mhz.  But there
are other reasons why I convince myself that I can get away with calling it 4Mhz.
I'll get into those later.


How to speed up your CoCo3
--------------------------

   There are quite a number of ways you can increase the clock rate of your
CoCo3. We'll start with the simplest methods and work our way up to the more
complex hardware modifications.


Method 1
--------
   The easiest modification you can do is to simply replace the main clock
crystal on the motherboard with another frequency.  The CoCo3 has a single clock
crystal on the board, 28.636363Mhz.  This crystal gets divided and provides the
timing for every single piece of hardware on the machine.  If you change it,
EVERYTHING changes speed, including the rate of the video display.  Because of
this, the usefullness of accelerating your CoCo this way is a bit limited.  You
can only change the speed by a few percent before your monitor stops being able
to display the video properly.  Even a 32Mhz crystal, which would result in 2Mhz
operation is enough to stop my CM-8 monitor from syncing up properly with the
video.

   Of course, you can also take apart your monitor, and alter it's timing as well
to match the new video rate.  If you're lucky, your monitor may even be able to
sync up to the faster video signal without any modifications.

   All that aside, there are more problems, especially if you're planning on
going beyond 2Mhz.  If you have an RS-232 Pak, it's operation starts failing
around this point, and there is no solution that I can see unless there is a
faster version of the 6551 chip that you can buy and plug into your RS-232 Pak.
Printer and Cassette port operation also changes rate accordingly.
  The Cassette routines will no longer be able to read your old tapes, and there
is no easy solution - but then, it probably doesn't matter.  Not too many people
use the Cassette port.
  The Printer port is the same, but it's easy to poke new values into the baud
rate selector.  This turns out to be a very minor thing.


Method 2
--------
   The next easiest thing you can do is simply toss out the 6809 and replace it
with a Hitachi 6309.  While this doesn't actually increase the clock speed of the
CPU, it's a common way of increasing performance.  So I've included it here.

   Without doing anything else, the CPU performs at exactly the same speed as a
6809, but once you kick it into 6309 Native mode, things start looking up.  All
old programs automatically gets an average speed increase of about 12% (Everyone
gets a different number, but this is my measurement).  When you start using
programs that were specifically written for the 6309, things get drastically
better.  Some operations can be many times faster, but on average, I'd say it's
somewhere near to 40% faster.  The performance increases of the 6309 are due to
a number of things, but the two important ones are that the CPU executes most
instructions in fewer clock cycles than a 6809.  The other big reason is that is
has a bunch of new registers, functions and instructions that the 6809 does not.

   One good thing about the 6309 is that, even if you do get to try one of the
other methods written below, if you combine the acceleration hardware along with
a 6309, you can increase your performance even more!

   (That by the way is my excuse for getting away with calling my 4Mhz
accelerator project '4Mhz')  Even though my accelerator is really 3.579Mhz, the
extra performance increase of using it along with a 6309 make it not so much of
a lie to call it 4Mhz  :)


Method 3
--------
   For real speed increases, what you need to do is run the CPU at a different
rate while still running the rest of the CoCo hardware at it's original speed.
There are some bottlenecks to this concept.  The biggest enemy is actually the
famous GIME chip itself.  The GIME controls all access to memory, and since it's
a big chunk of circuitry all built into a single package, it becomes impossible
to tinker with it.  The conclusion quickly becomes that even though the CPU can
run at whatever rate you want it to, all memory access STILL has to happen at the
old rate of 1.789Mhz.
  This next idea is that the clock rate of the CPU be generated by a seperate new
circuit that is independent of the main clock circuit.  While I was thinking
about the 4Mhz Project, I had a friend of mine also thinking about it.  Both of
us came to completely different concepts.  This is his idea:

   The new clock circuit for the CPU can run at whatever rate you want.  He had
ideas of making it selectable to any speed, going as fast as 8Mhz.  The circuit
attempts to keep the CPU running at this rate as much as possible, but every time
the CPU needs to access memory, it halts the clock generator and waits for
syncronization with the old 1.789Mhz Bus.   Once syncronized with a Bus cycle,
the new clock generator proceeds to execute a single 1.789Mhz cycle.  Thus memory
is accessed, the rest of the CoCo is happy, and then the CPU tries to resume it's
idea of running at 8 (or whatever) MHz.   Sound cool?

   There turns out to be a hitch.  In order for true 8Mhz operation to happen,
the CPU has to never access memory!  In most operations, the CPU accesses memory
most of the time.  True, there are plenty of 'non memory' cycles, but even single
cycles that don't use memory don't help.  For any single cycle that doesn't use
memory, the CPU DOES execute at 8Mhz, but for the next cycle that does use
memory, the circuitry has to halt the clock until it reaches a point of
syncronization with the Bus.   The reality of this is that it has just spent
7/8ths of a Bus clock cycle waiting for that syncronization point - hence no
speed increase whatsoever.   Ah, but what about cases where there are two or
more CPU cycles in a row that don't use memory?  In those cases, the CPU runs
happily along at 8Mhz, until the next cycle that needs memory comes along.  There
are a number of cases where there are two or more 'non-memory' cycles in a row,
most notably, the MUL instruction which only reads memory twice, and then runs
the next 9 cycles internally.    With this circuit, a MUL will execute in 4,
possibly 3 (if you clock the CPU at just over 8Mhz) Bus cycles, instead of the
stock 11 cycles - That's a speed increase of 366%!  This is a best case scenario
though.  On average, this circuit will only increase performance by a fairly
small percentage.


Method 4
--------
   Surely, there must be some other way, you say?  And yes, there are plenty of
other tricks you can use.  That was my friend's idea, here is what my idea at the
time was (and we came up with these ideas independantly without knowledge of the
drawbacks of each other's idea).   My figuring was that the best way to boost
performance would be to try to sneak extra CPU cycles by the GIME without the
GIME ever noticing what had happened.  (Sounds just like me, doesn't it?)

   This concept doesn't actually run the CPU at a seperate clock speed.  The plan
was that some extra circuitry could analyse the signals coming out of the CPU
and then decide when it could get away with a sneaky 'burst' cycle without the
rest of the motherboard noticing.  The circuit is designed to run the CPU at the
same old speed it usually ran at (1.789Mhz), except that the circuit knows when
the CPU will not be using memory (note the future tense) in the next CPU cycle.
Every time this happens, the new clock circuit takes over.

   This circuit waits for a point near the end of the 1.789Mhz Bus cycle, then
bypasses the normal clock signals.  It starts feeding this new signal to the CPU
instead.  The new signal is effectively pasted in the place of the old one.
The general gist of it is that the last bit of the normal clock pulse gets cut
short, after a small amount of time a very fast extra clock pulse is generated
(somewhere's around 6 or 7Mhz) and completed, then it leaves a bit of blank time
for the CPU to settle and then restores the normal clock operation after shaving
a little bit of time off the beginning of the next 1.789MHz Bus cycle.
The effect?  The CPU just executed two operations in one Bus cycle and the GIME
never saw it.

   The advantage of this method is that it works even when there is only one non-
memory CPU cycle between two cycles that do use memory.  This circuit can make
the CPU run at 4Mhz even when it has to access memory half of the time.  In
reality, the CPU uses memory more often than it does not, so the circuit doesn't
quite reach 4Mhz performance levels.  But it is noticably more effective than the
first method.  But after constructing this circuit, I realized I had other
obstacles...


Method 5
--------
   We are now in the realm of the theoretical.  The next ideas have never been
built but offer even better performance.  The problem with the previous method is
that the 6809 was designed in a way to improve performance.  Yes, that is the
drawback of the previous method.  Motorola's design was that the CPU read memory
while it's decoding the opcode, in the expectation that this value that was just
read will be used in the future.  It works in many cases, but for certain
operations, that value that was just read from memory gets discarded because it
was not needed.  To make matters worse, when it completes executing that opcode,
the CPU then decides to read the SAME memory location again to get the next
opcode to execute.  The CPU has just read one memory location twice in a row!

   I was expecting better performance in my accelerator project, and now I
figured out why it wasn't running as fast as it should have - Each of those
'twice in a row' memory reads take two CPU cycles, One of which effectively
becoming a wasted Bus cycle to my accelerator.

   The solution?  A one-byte cache for the 4Mhz Project.  Every memory location
that gets read by the CPU gets stored in this cache.  Every time the CPU tries
to read the same memory location twice in a row, a new part of the circuit would
decide that this value is already stored in the cache, and enable the 'burst'
cycle to the CPU while putting this value on the data Bus.  This would
effectively eliminate all slowdowns caused by the CPU's inefficient usage of the
Bus.


Method 6
--------
   Of course, even still, the project will not actually run the CPU at a true
double speed.  In cases where the CPU actually needs to access memory many times
in a row, the circuit has to shut down and let the CPU access the Bus at
1.789Mhz.  How could we solve that problem?  Well, there's also a way to cross
that obstacle.  It just so happens that the GIME chip actually accesses memory
16 bits at a time.  That's how we can have those hi-res 16 color graphics while
only needing to access memory twice as fast as on a CoCo2.  The GIME manages to
pull in twice as much data per memory access for video than the CoCo2 did.

   There's a way to use this to your advantage.  Every time the CPU reads
something from the memory, the GIME actually reads a 16bit value, but only feeds
the CPU whichever 8 bits it wants.  If we put a 16 bit cache on the data lines
going to the GIME from the memory, we can store the whole 16 bit value and then
already have the next memory byte for the CPU when it wants to read it during the
next Bus cycle.  (90% of the time, the CPU reads memory sequentially, one byte
after the next.)   Always having the second byte in advance, we can give the CPU
closer to true double speed operation while still retaining the old 1.789Mhz Bus
speed.

  Since we're caching the full 16 bit value read from memory, it's also possible
to speed up the times when the CPU reads the same memory location twice in a row,
and even some cases where the CPU reads memory locations in reverse order (if
any?).  With this idea, we can eliminate the 8 bit CPU cache from the previous
method, and use this 16 bit memory cache in it's place to get even better speed
increases.

  Sounds good?  There is one minor hitch that needs to be worked out, but I don't
think it should be too hard.  The cache should be smart enough to know when the
CPU *isn't* actually reading memory.  The problem would arise when the CPU tries
to read a 16 bit value from some hardware registers (lets say we're reading the
MMU bank values).  Upon reading the first byte during one cycle, the cache will
try to feed the CPU the next byte in the next CPU cycle - but, the CPU isn't
trying to read from memory, so it will be fed the wrong value.   One way of
avoiding this would be to simply disable the cache from working when the CPU is
reading from the $FFxx page of memory (which is where all the hardware registers
are located).


Method 7
--------
   Can we go even further?  You must think this is getting ridiculous.  Since
we've already incorporated a cache into the CoCo, it might be a good idea to take
better advantage of it.  Why not throw in a whole 8K of cache?  It gets a little
tricky when you have an MMU.  The CPU only sees 64k at a time, and if you start
swapping different banks of memory in and out, the cache would get awfully
confused.

   One possibility would be something that might sound a little strange at first,
but it works out quite nicely in the CoCo3's case.   Sure, put in an 8K cache,
but make it 8K by 14 bits.  The lower 8 bits of the cache hold the 8 bit value
from memory, the upper 6 bits hold the value of the MMU bank that this memory
byte originated from.  The address bits don't need to be stored because the
address will actually be the index for the cache.

   Starting to become interesting?  This cache will always automatically retain
the memory last used by the CPU, with no hardware design hassles.  With something
like this, Method 3 in the text becomes extremely appealing.  It solves all the
problems of often waiting for syncronisation with the Bus, and gives us the
possibility of running the CPU at TRUE 8Mhz with only minimal slowdowns in worst
case scenarios!

   Thinking ahead, it would also be a good idea to throw in the extra 2 bits
of data into the cache so that it could properly handle 2Meg machines.  Making
the cache an even 8K by 16 bits.  8 bits for the data, and 8 bits for the MMU
bank value.

   The limitation of this idea is that the CPU can still only write to memory at
1.789Mhz.  All the speed increases are for memory reads only.  But, that's what
the CPU does most of the time.  So the CPU speed will be drastically increased.


Method 8
--------
   More?  No.  I haven't gotten this far.  But the next step would be to allow
the cache to buffer memory writes as well.  Hence removing the 2nd to last
bottleneck from true accelerated CPU performance.    The very last bottleneck
is the 1.789Mhz Bus speed limitation.  And that, is something to deal with in
another time...


					John Kowalski (Sock Master)
					http://users.axess.com/twilight/sock/