NanoOs

8-Feb-2025 - VM Optimizations Part II

To recap, there were four things I was thinking about to improve the performance of the VM in the general case:

File reads must be kept to an absolute minimum.
I need to see if I can increase the speed of SD card reads in general.
Data and code need to be in independently-addressable virtual memory segments.
Anything I can do to further optimize fseek is desirable.

I started with the last one since I already knew one thing to do there. I put the optimization about not seeking backward when the target cluster and current cluster were the same back into the codebase. Obviously, this has no effect with the current program, but no reason not to have it in place now. It also consumed virtually zero program space, so just go ahead and have it in there.

I took a look at the second item next. I recalled that the driver I had put together with Claude had initialized the SPI to slow speed during card initialization, but I didn’t recall that it had ever increased the speed before leaving the initialization function. I took a look at the code and confirmed this was the case. The base speed it ran at was 400 kHz. So, I changed the code to reconfigure the SPI interface to 8 MHz before completing the SD card init function. I then reconfigured the VM to use the screwy 144-byte cache buffer again to see what the performance difference would be. Result: About a 12.5% performance improvement. Not bad. I also checked what the performance difference was with the 256-byte cache. Result: About a 7.5% performance improvement. The best case performance benchmark was now 2.267 kHz.

Implementing the third item helps achieve the first, so it was obviously the next thing to consider. Here, however, there was a problem. The binary format I had at this point was just a raw blob with the instructions at the beginning and the data at the end. There was no metadata to indicate where the instructions ended and the data began. If I left the binary in ELF format, I would obviously have the metadata I needed. This, however, was out of the question due to the limited binary space I had remaining. There was simply no room to implement even a partial ELF parser and loader. So, to fully implement this piece, I was going to have to come up with some kind of file format that would provide the information needed.

But, it made absolutely no sense to go about implementing such a format unless I had some confidence that doing the separation was going to provide the kind of performance I needed. I had to have a way to test this. Thankfully, I had an idea on this one. Since I knew that declaring a 144-byte cache buffer basically split the binary into half, I figured that I could just declare two virtual memory segments with buffers of that size and use one for the instruction access and one for the data access. Also, since I knew that the code segment of my “Hello, world!” program was 128 bytes long, I could cheat and do some simple, hard-coded math to determine which segment to use for the data access.

The first thing to do was to create and initialize a fourth virtual memory segment for data access. The cool thing about this is that it could use the exact same file as the program segment. I just had two file pointers open to the same file. This also meant that I only had to do a single copy of the original binary to the code space. I renamed the existing “physical” memory segment to “program” memory and created a new “data” memory segment. I then updated the address and segment lookup function to use the data memory segment instead of the program memory segment if the address was beyond the end of the program memory. Result: 94 millisecond runtime or about 1.447 kHz. Significantly slower than the 60 millisecond runtime when the whole program was in RAM, but still way over 1 kHz, so I was happy.

I should note here that the increase in time had nothing to do with the fact that there were now two virtual memory segments being used. The issue was that the program segment straddled the 144-byte boundary. That meant that the second part of the program had to be loaded to complete the execution.

This, however, did not make a lot of sense to me. I started factoring out initial reads into the VM init function so that the start of memory would already be loaded before the main program execution began. As I did that, I saw total program execution time drop by about 5 milliseconds at a time. I eventually got the main program logic execution time down to 44 milliseconds (over 3 kHz!!!). So, why was adding one read in the middle of program exection adding tens of milliseconds to program execution time?

I decided to do an experiment and changed the program memory buffer to 64-bytes, thereby forcing it to re-read both halves of the program every iteration of the loop. Result: 891 millisecond runtime. Relative to the 44 millisecond runtime, this was an increasea of 847 milliseconds. Since there are 15 characters in the string (including the newline and terminating byte), this would be about 28.25 milliseconds for each read. That’s over 5X the speed of a read as measured by moving the initial read from the program execution to the VM initalization. Why the difference?

Then, it hit me: The way the virtual memory library was implemented, it always wrote the buffer out to the file if the buffer had valid bytes in it. This was wrong. There’s no reason to write the buffer if it hadn’t changed. But, there was nothing currently tracking whether or not the buffer had been modified. So, I needed a new variable in the virtual memory state object to keep track of that. That was undesirable, though. I wanted to keep the virtual memory state as small as possible. But, we were only talking an extra bit here, so I borrowed one from the buffer size. Then, I realized that the buffer size was a uint16_t. No reason it should be that big. So, I changed it to a uint8_t and just used 7 bits. Then, I changed the logic in the code to store one less than the size passed into the constructor and always add one to the value before using it. That effectively allowed for a buffer of up to 128 bytes.

OK, code changed, time to see how it worked without all the unnecessary writes with a 64-byte buffer. Result: 175 millisecond runtime or about 777 Hz. Definitely better, but also definitely short of my goal of keeping the speed above 1 kHz. Then, I changed the buffer back to 128 bytes. Result: 19 millisecond runtime or about 7.15 kHz. WOW!!!

So, a 128-byte program buffer is the clear winner here. With the current data structure, I wouldn’t be able to make it any larger than that, plus taking up more than that would consume an unacceptable amount of dynamic memory. With a 128-byte buffer, I suspect the system will be able to average 1 kHz or better.

Memory is a concern, though. At one point in my experiments, I set the size of both the program buffer and the data buffer to 256 bytes. Result: The VM failed to initialize. Basically, I ran out of dynamic memory. I realized that because of all the files that were being opened now, there was actually quite a lot of dynamic memory being used. I had also made the size of the virtual memory stack buffer 64-bytes. I thought about the possibility of declaring some of the buffers on the virtual machine process’s stack, but that brough up another issue: How big was the VM state now? I put in a print and discovered the state was now consuming 295 bytes of the 340 bytes of process stack space. Too much. Stacks really need around 100 bytes of space available in order to not overflow into their neighbors. 45 bytes was not going to cut it.

I’m going to skip all the details of the things I changed to reclaim space because there were several very tedious revisons that I had to go through to fine tune things. The end result was that I got the state down to 267 bytes including the new buffers that avoided putting all the virtual memory buffers in dynamic memory. Still not what I would like, but livable.

Fun side note since this project aims to recreate early UNIX: I went and looked up the clock speed of the PDP-7, which was the system used to develop the first version of UNIX. It was about 571 kHz. So my little VM is now running at a little better than 1% of the speed of the first UNIX system. Since the “kernel” space is running at 20 MHz (albeit on an 8-bit processor), I have to wonder what the effective speed of my system is relative to the original. At the moment, I have no way of doing any kind of analysis to get a good answer, so it’s more of just a curiosity.

I am going to have to do something about dynamic memory usage. In addition to the 267 bytes the virtual memory object takes up on a virtual machine process’s stack, initializing it also consumes an additional 372 bytes of dynamic memory. 160 bytes (plus memory management metadata, which is probably another 10 bytes) of that is taken up by the cache buffers, but I don’t understand right now what else is consuming so much dynamic memory. I’ll have to see if I can reduce that overhead. Alternatively or in conjunction, I could sacrifice another process slot. However, I’m not willing to do that unless it’s effectively a kernel process that gets sacrificed. I really want to have the ability to run at least 4 user processes in parallel.

I’m also going to have to come up with a usable executable format now that I’ve proven that having two virtual memory objects to manage the accesses works well. There are several things that are going to make that difficult, not the least of which is that I’m running out of code space again due to all these performance enhancements for the VM.

So, right now, there are three things I’m trying to balance: (1) OS binary size, (2) memory usage, and (3) execution speed. They are all affecting each other at this point. It’s getting increasingly tricky to keep 1 and 2 as small as possible while keeping 3 as high as possible. Such is the world of embedded programming, though! This is the kind of challenge that makes the work worth doing. Let’s see what I’m able to do. To be continued…

Table of Contents

This site is open source. Improve this page.