NanoOs

6-Feb-2025 - RV32I VM

I spent about a day working with Claude on a VM for the RV32I instruction set architecture. It was at least an order of magnitude simpler than trying to implement a VM for WASM. There are really only two memory segments to manage: Physical Memory and Mapped Memory. Of that, only physical memory was super critical. I do have to be able to handle both in the VM eventually, but most of what’s needed only requires physical memory. On top of that, there wound up only being 10 opcodes that I needed to support for the base implementation.

After doing a little research on how RV32I programs are organized, I realized that I was going to need a way to copy one virtual block of virtual memory to another. The reason for this was that, unlike the WASM programs, the standard base address of RV32I programs is 0x1000, not 0. So, I needed to copy the RV32I binary to address 0x1000 of the physical memory segment.

Because all virtual memory is backed by files, this meant that I needed a new function to copy the contents of a piece of one file into another file. Then I stopped to think about this… Why isn’t there a function to do this in the C standard? There are functions to read a file, write (and by extension, create) a file, rename (and by extension, move) a file, and delete a file. Why, in the 50+ years of the language’s existence, haven’t we added a function to copy contents of one file to another? Meh. Well, it doesn’t exist, so I had to create something.

This turned out to be a bit of a problem. In an application, it would be pretty straightforward: Declare a buffer one block in size for each file, read one block into the destination file’s buffer from the destination offset, read one block into the source file’s buffer from the source offset, copy the piece of the source you need into the appropriate location in the destination, then write the destination block back out. Lather, rinse, and repeat until you copy all the data you need.

In an embedded environment, this won’t work. I MIGHT have enough RAM to store two buffers, but I really can’t count on that. I can really only expect to have one buffer to work with. This means I have to use the same buffer for both reads and writes. That, in turn, means that I have to copy multiples of blocks and that the offsets have to be block aligned. This, however, is not really a big deal for what I’m doing. Since I’m copying an entire file, I just read a block from the source and then write it to the destination. If the length of the source file isn’t block aligned, it makes no difference. So, I made a function in the filesystem process to do just that. Making the wrapper function in the virtual memory library was trivial.

New virtual memory functionality in place, I wrote code to initialize the physical memory segment and copy a specified RV32I-compiled executable file into the appropriate offset of the physical memory segment. So far so good. At this point, I still had a little over 10% of the program flash available on the Nano Every. Time to start putting in handlers for instructions.

This was a job for Claude. I did not want to put all the case statements in place for this. I’ll spare you the details on this one, but it took some “coaching” on my part to get it to make something useful. It’s first take on it was to put every handler for every opcode and every subfucntion inline in one giant switch statement. There were multiple levels of switch statements. It also didn’t use defines or enums for any of the values, so everything used hardcoded magic numbers. Bad call.

Anyway, once I finally got all the opcode handler logic in place, it was time to compile it and see what happened. To my great surprise, it actually fit in the available space! Hooray!!! OK! Time to figure out how to make this thing do something useful! How about a “Hello world” program, Claude?

So, I asked Claude how an RV32I-compiled program would print out a string. Its answer was… less than desirable. The RV32I side it produced was fine. However, part of its answer was to extend one of the handlers that it had previously produced. It seems that in the case of the instruction that was needed here (SYSTEM ECALL) it had just returned rather than actually handling the instruction. So, I had to add in the piece that it had skipped over earlier. And, to probably nobody’s surprise, adding that code pushed the total binary size over the flash size.

But, not by much. Close enough that I thought I might be able to identify some functionality I could cut and get back within the limit. The first thing I did was remove all the code that was still related to the old way of running processes. Better, but not good enough. Then, I turned off debugging messages. Then, I commented out all the error messages in the SD library since it’s not really an issue any longer. Getting close! Then, I commented out all the prints that happen at boot up. Not only did that get me under the limit, it got me over 1 KB under the limit! Enough space to do some cleanup too! Cool!!!

Claude’s implementation for the RV32I_SYSCALL_WRITE handler was both suboptimal and flat out wouldn’t work in NanoOs. For one thing, it declared a 512-byte host-side buffer on the stack. Stacks are only 340 bytes. For another, it copied data from process memory to host memory by getting one byte at a time in a for loop. There’s a dedicated function in the virtual memory library for reading data out of a process’s virtual memory into host memory that’s much more efficient than that. So, I changed the handler to allocate a buffer of the needed size out of dynamic memory and use the virtualMemoryRead function to read from the process’s memory straight into the buffer. I hadn’t wanted to do that until I had reclaimed some flash storage because I figured it would probably take up more space to do it that way. I was right: It took 10 extra bytes.

OK, so now I needed to run the compiled file that was on the SD card. Good thing I reclaimed some extra space because I immediately ran into an issue. The scheduler is responsible for launching all user commands. However, this is done by sending a message to the scheduler from one of the shell processes. In this case, I needed to launch the process directly from the scheduler. You can think of this as the difference between launching a command from a shell and the kernel invoking the init command at bootup. Very different mechanisms involved here. So, I had to invent a way for the scheduler to launch a command on its own. And, of course, this consumed some flash space.

Alright, now we can launch the command at bootup. Does it run? No, of course not. Why would it work on the first try? Fine. What’s the issue, then? Well, the only way to figure this out is to put debug statements back in and see what’s going on.

The logic in the VM code that had been generated always incremented the program counter by four bytes (the size of a single instruction in RV32I). Flow control instructions could modify its value, of course, but increment by four bytes was the default. As a first step, I decided to print out the value of the PC register and the instruction at that address for each instruction that was being encountered. What I found did not amuse me.

What I saw was that the same four instructions were being executed over and over again. The program counter was continuing to increment by four bytes after every instruction, but it was reading the same values for bytes 16 to 31 as it was for bytes 0 to 15. It continued to read those same values for bytes 32 to 47, 48 to 63, and so on all the way up to byte 512. At that point, it read all 0s, failed to process the instruction, and effectively crashed. Super…

Something was very clearly wrong with the file-backed virtual memory system I had developed. Not surprising since this was the first time any of this had been tested. Unfortunately, there was a whole lot of insfrastructure code for it, so debugging it was definitely non-trivial.

I’m going to skip over the debug flow, but the issue turned out to not be in the VirtualMemory library itself. The issue was how the fseek function had been implemented. Claude had made a mistake when it had done the initial implementation. Rather than seeking to an exact byte in a file, it was seeking to the beginning of the corresponding file cluster. Since I now had the program binary and the stack in the same virtual memory space, it was seeking ahead to do stack operations, then back to get instructions from the binary. Both seeks were wrong. In both cases, it was reading and writing at the beginning of the cluster of the specified memory location. So, for program reads, it was reading the same data repeatedly and for stack writes, it was writing to the same location repeatedly.

OK, bug fixed, let’s run and see what happens. The first thing that happened was that the system filled the virtual memory file with 16 MB of 0s like it should have to start with. Cool! This brought about a problem, though. To keep things simple, I had created the root partition to only be 32 MB. Given the need for two virtual memory files (one to emulate program space and one to emulate mapped memory), this meant that the partition was only large enough to run a single process. OK, resize and reformat the partiton before we go any further so that I can add more than one set of virtual memory files.

Then, nothing worked. And, I mean nothing. Even the basic file reads that had always worked stopped working. What now?!?!?! This turned out to be a much more indideous problem and took much longer to debug. Again skipping the debugging details, the issue turned out to be integer type sizes. When Claude had put together the structure to represent a FAT partition, it followed the FAT16 metadata format and used types of the corresponding sizes. When I extended the partition, the format of the new filesystem used 65,536 byte clusters. When you try and put that value into a 16-bit unsigned integer, you get… 0x0000. This was being used in multiplications to compute the block to seek to. If you remember your 3rd grade multiplication tables, zero times anything is zero, so literally every read and write was going to block zero. Teriffic… So, I went through the FAT16 implementation and cast all the variables used in block calculations to uint32_t.

“But, James!” I hear someone interjecting, “You could just change all the types in the structure to uint32_t and not have to do all the explicit casts!”

Yeah, I tried that. You have to remember, though, that this is an 8-bit processor. Dealing with larger types requires more instructions to get the processor to manage the data correctly. If I change the types in the structure then ALL the operations that use those values requrie more instructions. If I just focus on the block calculations, then only those areas of the code require more instructions. Since program flash is now exremely limited, I couldn’t afford to change the types in the structures, so I had to deal with explicit casts.

Block calculations fixed, time to try running the binary again. Give it a shot and see what happens. Well, it does get past the 16th byte of the binary now… but… each instruction takes a full second to execute. sigh Back to injecting print statements to narrow down the culprit. Well…

Seems I wasn’t quite done with battling fseek bugs. The performance was terrible. Seeking forward 16 MB for a stack operation was what was taking so long. I took a look at the loop that moved the file forward and I was pretty sure I knew what the issue was. I thought about working on it myself, but then I thought, “No, Claude made this mistake. Give it a chance to fix it.” I uploaded the code to my Claude project and asked it for a faster implementation. It produced a new version which, on first glance, looked like it would probably be better. Put it in, gave it a shot and *POOF* things zipped along! But… I still wasn’t seeing the “Hello, world!” I was expecting… OK, put some more prints in and… Oh, look!!! It’s in an infinite loop!!!

At this point, I’ve consumed so much code space fixing bugs that I’m having a hard time getting debug messages to fit. I don’t have enough space to print out exactly what’s happening after the instruction has been decoded, all I can really print is the address being accessed and the instruction at that address.

OK, Claude, show me your stuff! I feed it the instructions I’m seeing. It tells me it looks like it’s in an infinte loop trying to determine the length of the string. OK, then. I track down the handler it’s in and print out the value of the register it’s evaluating through every pass of the loop. Good news: The value of the register is changing. Bad news: It’s not stopping when it reaches the terminating zero byte of the string.

Then ensues a very long debug process of trying to understand why it’s not stopping. It was a very long debug process because Claude made a mistake at the beginning of the process. It initially told me that the decoding of the comparison instruction was comparing register x15 to register x7. I spent a very long time analyzing register x7 (which held a value of 0) and trying to figure out why comparing that to register x15 (which also held a value of 0) wasn’t evaluating to equal. I eventually realized that Claude had incorrectly decoded the instruction and it was not comparing x15 to x7, it was comparing x15 to x0. In the RV32I instruction set architecture, register x0 is always supposed to be 0. The register value is, in fact, initialized to 0 at the beginning of the VM. Yet, somehow, this comparison was failng. So, I print out the value of register x0. It is not 0. So, something somewhere in this process wrote a non-zero value to register x0 and at the time of the comparison, the register value is non-zero, so the comparison always fails. OK… Move the initialization of the register from before the beginning of the loop to every iteration of the loop before the fetched instruction is executed.

I try again. And it goes through its instruction evaluation sequence, gets to the terminating zero byte, evaluates correctly and prints “Hello, world!” as expected! HOORAY!!! OK! Time to see what this really looks like without all the debugging noise. Take out all the debug prints, re-run, and it prints out “Hello, world!” again… in about 12 seconds.

Twelve… seconds. I couldn’t believe it. After all those bug fixes and optimizations, a simple “Hello, world!” program took 12 seconds to run. I put in a counter to see how many instructions it took to do that. 136 was the magic number. At 12 seconds to execute, that means my VM runs at about 11.3 Hz. Note the conspicuous lack of a letter in front of that “Hz”. The ATMEGA runs at 20 MHz, so I wasn’t expecting anything in the MHz range, but I thought I’d get something in the kHz range. We’re darn near single digits here.

So, it does work and seems to work reliably, but it’s impressively slow. There are a few things I might be able to do to improve performance, some more extreme than others. However, at the moment, none of them are possible because I’m now completely out of program flash space. I had to comment out the VM cleanup code in order to add the instruction counter because there was literally no more space for even that variable left. I’ll likely wind up commenting out a bunch of error and diagnostic messages to free up some space. I have some other ideas that might help too.

Anyway, here we are. The code for the VM is still in the rv32i-vm branch since it doesn’t currently have any way to run a shell and I don’t want to have something on dev or on main that doesn’t have a shell. I’m going to keep it there and continue to do optimization experiments to see if I can get something that runs in a more reasonable amount of time. We shall see. To be continued…

Table of Contents