============================ The 4Mhz CoCo3 Project ============================ I've wanted to put together a text about this project for a long time now, but have usually managed to lose track of my time. This time, I'm finally going to get this thing written, before I forget too much about it! This project is actually a pretty old idea. The idea of pushing more speed out of the CoCo3 is almost as old as the CoCo3 itself. Soon after the introduction of the CoCo3, it was many people's impression that it was nice and all, but it just wasnt enough. Sure you could do 2Mhz, but so could the CoCo2 - to some extent. With the increasing speed of *other* computers, the CoCo3 was starting to lag behind in CPU power. To clarify a few things, this project isn't really 4Mhz. Nor is a stock CoCo3 really 2Mhz. This little bit of mislabelling actually started at the very beginning with the CoCo1 - which was actually 0.89Mhz. Since it was close enough to 1Mhz, most people just called it as such because it was easier to say. When the 'fast' pokes were discovered, people simply carried the tradition by saying it ran at 2Mhz because it was twice as fast. In actuality, the CoCo3 is really 1.789Mhz, and a '4Mhz' CoCo3 would be twice that - 3.579Mhz. But there are other reasons why I convince myself that I can get away with calling it 4Mhz. I'll get into those later. How to speed up your CoCo3 -------------------------- There are quite a number of ways you can increase the clock rate of your CoCo3. We'll start with the simplest methods and work our way up to the more complex hardware modifications. Method 1 -------- The easiest modification you can do is to simply replace the main clock crystal on the motherboard with another frequency. The CoCo3 has a single clock crystal on the board, 28.636363Mhz. This crystal gets divided and provides the timing for every single piece of hardware on the machine. If you change it, EVERYTHING changes speed, including the rate of the video display. Because of this, the usefullness of accelerating your CoCo this way is a bit limited. You can only change the speed by a few percent before your monitor stops being able to display the video properly. Even a 32Mhz crystal, which would result in 2Mhz operation is enough to stop my CM-8 monitor from syncing up properly with the video. Of course, you can also take apart your monitor, and alter it's timing as well to match the new video rate. If you're lucky, your monitor may even be able to sync up to the faster video signal without any modifications. All that aside, there are more problems, especially if you're planning on going beyond 2Mhz. If you have an RS-232 Pak, it's operation starts failing around this point, and there is no solution that I can see unless there is a faster version of the 6551 chip that you can buy and plug into your RS-232 Pak. Printer and Cassette port operation also changes rate accordingly. The Cassette routines will no longer be able to read your old tapes, and there is no easy solution - but then, it probably doesn't matter. Not too many people use the Cassette port. The Printer port is the same, but it's easy to poke new values into the baud rate selector. This turns out to be a very minor thing. Method 2 -------- The next easiest thing you can do is simply toss out the 6809 and replace it with a Hitachi 6309. While this doesn't actually increase the clock speed of the CPU, it's a common way of increasing performance. So I've included it here. Without doing anything else, the CPU performs at exactly the same speed as a 6809, but once you kick it into 6309 Native mode, things start looking up. All old programs automatically gets an average speed increase of about 12% (Everyone gets a different number, but this is my measurement). When you start using programs that were specifically written for the 6309, things get drastically better. Some operations can be many times faster, but on average, I'd say it's somewhere near to 40% faster. The performance increases of the 6309 are due to a number of things, but the two important ones are that the CPU executes most instructions in fewer clock cycles than a 6809. The other big reason is that is has a bunch of new registers, functions and instructions that the 6809 does not. One good thing about the 6309 is that, even if you do get to try one of the other methods written below, if you combine the acceleration hardware along with a 6309, you can increase your performance even more! (That by the way is my excuse for getting away with calling my 4Mhz accelerator project '4Mhz') Even though my accelerator is really 3.579Mhz, the extra performance increase of using it along with a 6309 make it not so much of a lie to call it 4Mhz :) Method 3 -------- For real speed increases, what you need to do is run the CPU at a different rate while still running the rest of the CoCo hardware at it's original speed. There are some bottlenecks to this concept. The biggest enemy is actually the famous GIME chip itself. The GIME controls all access to memory, and since it's a big chunk of circuitry all built into a single package, it becomes impossible to tinker with it. The conclusion quickly becomes that even though the CPU can run at whatever rate you want it to, all memory access STILL has to happen at the old rate of 1.789Mhz. This next idea is that the clock rate of the CPU be generated by a seperate new circuit that is independent of the main clock circuit. While I was thinking about the 4Mhz Project, I had a friend of mine also thinking about it. Both of us came to completely different concepts. This is his idea: The new clock circuit for the CPU can run at whatever rate you want. He had ideas of making it selectable to any speed, going as fast as 8Mhz. The circuit attempts to keep the CPU running at this rate as much as possible, but every time the CPU needs to access memory, it halts the clock generator and waits for syncronization with the old 1.789Mhz Bus. Once syncronized with a Bus cycle, the new clock generator proceeds to execute a single 1.789Mhz cycle. Thus memory is accessed, the rest of the CoCo is happy, and then the CPU tries to resume it's idea of running at 8 (or whatever) MHz. Sound cool? There turns out to be a hitch. In order for true 8Mhz operation to happen, the CPU has to never access memory! In most operations, the CPU accesses memory most of the time. True, there are plenty of 'non memory' cycles, but even single cycles that don't use memory don't help. For any single cycle that doesn't use memory, the CPU DOES execute at 8Mhz, but for the next cycle that does use memory, the circuitry has to halt the clock until it reaches a point of syncronization with the Bus. The reality of this is that it has just spent 7/8ths of a Bus clock cycle waiting for that syncronization point - hence no speed increase whatsoever. Ah, but what about cases where there are two or more CPU cycles in a row that don't use memory? In those cases, the CPU runs happily along at 8Mhz, until the next cycle that needs memory comes along. There are a number of cases where there are two or more 'non-memory' cycles in a row, most notably, the MUL instruction which only reads memory twice, and then runs the next 9 cycles internally. With this circuit, a MUL will execute in 4, possibly 3 (if you clock the CPU at just over 8Mhz) Bus cycles, instead of the stock 11 cycles - That's a speed increase of 366%! This is a best case scenario though. On average, this circuit will only increase performance by a fairly small percentage. Method 4 -------- Surely, there must be some other way, you say? And yes, there are plenty of other tricks you can use. That was my friend's idea, here is what my idea at the time was (and we came up with these ideas independantly without knowledge of the drawbacks of each other's idea). My figuring was that the best way to boost performance would be to try to sneak extra CPU cycles by the GIME without the GIME ever noticing what had happened. (Sounds just like me, doesn't it?) This concept doesn't actually run the CPU at a seperate clock speed. The plan was that some extra circuitry could analyse the signals coming out of the CPU and then decide when it could get away with a sneaky 'burst' cycle without the rest of the motherboard noticing. The circuit is designed to run the CPU at the same old speed it usually ran at (1.789Mhz), except that the circuit knows when the CPU will not be using memory (note the future tense) in the next CPU cycle. Every time this happens, the new clock circuit takes over. This circuit waits for a point near the end of the 1.789Mhz Bus cycle, then bypasses the normal clock signals. It starts feeding this new signal to the CPU instead. The new signal is effectively pasted in the place of the old one. The general gist of it is that the last bit of the normal clock pulse gets cut short, after a small amount of time a very fast extra clock pulse is generated (somewhere's around 6 or 7Mhz) and completed, then it leaves a bit of blank time for the CPU to settle and then restores the normal clock operation after shaving a little bit of time off the beginning of the next 1.789MHz Bus cycle. The effect? The CPU just executed two operations in one Bus cycle and the GIME never saw it. The advantage of this method is that it works even when there is only one non- memory CPU cycle between two cycles that do use memory. This circuit can make the CPU run at 4Mhz even when it has to access memory half of the time. In reality, the CPU uses memory more often than it does not, so the circuit doesn't quite reach 4Mhz performance levels. But it is noticably more effective than the first method. But after constructing this circuit, I realized I had other obstacles... Method 5 -------- We are now in the realm of the theoretical. The next ideas have never been built but offer even better performance. The problem with the previous method is that the 6809 was designed in a way to improve performance. Yes, that is the drawback of the previous method. Motorola's design was that the CPU read memory while it's decoding the opcode, in the expectation that this value that was just read will be used in the future. It works in many cases, but for certain operations, that value that was just read from memory gets discarded because it was not needed. To make matters worse, when it completes executing that opcode, the CPU then decides to read the SAME memory location again to get the next opcode to execute. The CPU has just read one memory location twice in a row! I was expecting better performance in my accelerator project, and now I figured out why it wasn't running as fast as it should have - Each of those 'twice in a row' memory reads take two CPU cycles, One of which effectively becoming a wasted Bus cycle to my accelerator. The solution? A one-byte cache for the 4Mhz Project. Every memory location that gets read by the CPU gets stored in this cache. Every time the CPU tries to read the same memory location twice in a row, a new part of the circuit would decide that this value is already stored in the cache, and enable the 'burst' cycle to the CPU while putting this value on the data Bus. This would effectively eliminate all slowdowns caused by the CPU's inefficient usage of the Bus. Method 6 -------- Of course, even still, the project will not actually run the CPU at a true double speed. In cases where the CPU actually needs to access memory many times in a row, the circuit has to shut down and let the CPU access the Bus at 1.789Mhz. How could we solve that problem? Well, there's also a way to cross that obstacle. It just so happens that the GIME chip actually accesses memory 16 bits at a time. That's how we can have those hi-res 16 color graphics while only needing to access memory twice as fast as on a CoCo2. The GIME manages to pull in twice as much data per memory access for video than the CoCo2 did. There's a way to use this to your advantage. Every time the CPU reads something from the memory, the GIME actually reads a 16bit value, but only feeds the CPU whichever 8 bits it wants. If we put a 16 bit cache on the data lines going to the GIME from the memory, we can store the whole 16 bit value and then already have the next memory byte for the CPU when it wants to read it during the next Bus cycle. (90% of the time, the CPU reads memory sequentially, one byte after the next.) Always having the second byte in advance, we can give the CPU closer to true double speed operation while still retaining the old 1.789Mhz Bus speed. Since we're caching the full 16 bit value read from memory, it's also possible to speed up the times when the CPU reads the same memory location twice in a row, and even some cases where the CPU reads memory locations in reverse order (if any?). With this idea, we can eliminate the 8 bit CPU cache from the previous method, and use this 16 bit memory cache in it's place to get even better speed increases. Sounds good? There is one minor hitch that needs to be worked out, but I don't think it should be too hard. The cache should be smart enough to know when the CPU *isn't* actually reading memory. The problem would arise when the CPU tries to read a 16 bit value from some hardware registers (lets say we're reading the MMU bank values). Upon reading the first byte during one cycle, the cache will try to feed the CPU the next byte in the next CPU cycle - but, the CPU isn't trying to read from memory, so it will be fed the wrong value. One way of avoiding this would be to simply disable the cache from working when the CPU is reading from the $FFxx page of memory (which is where all the hardware registers are located). Method 7 -------- Can we go even further? You must think this is getting ridiculous. Since we've already incorporated a cache into the CoCo, it might be a good idea to take better advantage of it. Why not throw in a whole 8K of cache? It gets a little tricky when you have an MMU. The CPU only sees 64k at a time, and if you start swapping different banks of memory in and out, the cache would get awfully confused. One possibility would be something that might sound a little strange at first, but it works out quite nicely in the CoCo3's case. Sure, put in an 8K cache, but make it 8K by 14 bits. The lower 8 bits of the cache hold the 8 bit value from memory, the upper 6 bits hold the value of the MMU bank that this memory byte originated from. The address bits don't need to be stored because the address will actually be the index for the cache. Starting to become interesting? This cache will always automatically retain the memory last used by the CPU, with no hardware design hassles. With something like this, Method 3 in the text becomes extremely appealing. It solves all the problems of often waiting for syncronisation with the Bus, and gives us the possibility of running the CPU at TRUE 8Mhz with only minimal slowdowns in worst case scenarios! Thinking ahead, it would also be a good idea to throw in the extra 2 bits of data into the cache so that it could properly handle 2Meg machines. Making the cache an even 8K by 16 bits. 8 bits for the data, and 8 bits for the MMU bank value. The limitation of this idea is that the CPU can still only write to memory at 1.789Mhz. All the speed increases are for memory reads only. But, that's what the CPU does most of the time. So the CPU speed will be drastically increased. Method 8 -------- More? No. I haven't gotten this far. But the next step would be to allow the cache to buffer memory writes as well. Hence removing the 2nd to last bottleneck from true accelerated CPU performance. The very last bottleneck is the 1.789Mhz Bus speed limitation. And that, is something to deal with in another time... John Kowalski (Sock Master) http://users.axess.com/twilight/sock/