NanoOs

2-Mar-2025 - User Space Shell?

Time to make a shell for user space! Well, actually, I needed to make a little more than that. I also needed an init process to manage user logins. Regardless, though, I needed a lot of infrastructure and I needed it both in kernel space and in user space.

The first thing I started with was making a proper strlen function for my “Hello, world!” program rather calculating the length inline in the _start function. This turned out to be a bigger problem than I thought it would be for a variety of reasons. The first reason was that the linker put the strlen function at the beginning of the binary instead of the _start function, so my VM was crashing. After I got that resolved, the program did run correctly. It was significantly slower than before due to the extra function call and the different memory alignment, but it was still over 1 kHz, so I’m not going to complain.

Then, I decided to get crafty. Since all the registers in the VM are 32-bit values, I thought it would probably take fewer instructions to do a version of strlen that used the 0x01010101 and 0x80808080 bit manipulation logic to evaluate 32-bits at a time instead of just a character at a time. And I was right! It took about a quarter of the total instructions (as one would expect). However, this version of strlen came with two significant downsides: (1) It took almost twice as long to run as the single-character version and (2) it didn’t result in “Hello, world!” actually being printed to the console.

The reason for the performance penalty was obvious when I thought about it. The Nano’s processor is an 8-bit chip with an 8-bit bus and 8-bit registers. On top of that, the algorithm was no longer doing a simple comparison against a value of 0, it was now doing bit manipulation logic. There is simply no way for an 8-bit processor to do all that work as efficiently as it would compare a value to 0. Because the performance penalty was so high, the algorithm was flat out unacceptable to use in my libc implementation. Consequently, there was no reason for me to debug why the string wasn’t being printed. So, I abandoned the logic and went back to the simple algorithm.

The next step was to STOP! I needed to create a proper build environment before I went any further with this. For one thing, I was getting increasingly nervous about the need to keep all the commands straight for building a program. For another, I was about to enter the realm of building libraries and I definitely did not want to start that work without having an automated way to build. So, I put makefiles in place for the work I’d done so far and some pre-enablements for the work that was coming.

Makefiles in place, it was time to start putting in more pieces of a libc implementation. I pulled the print logic out into an fputs function and pulled the exit logic out into an exit function. Compiled everything and gave it a shot. It was a little slower because of the increased number of instructions, but it ran just fine. So far so good.

Then, I thought I would go one step furhter and pull the write logic out into a formal fwrite function and make fputs call fwrite. For this, I decided to go all the way and actually make a for loop that would properly print everything. In order to do that, I had to multiply the size and nmemb parameters to the function to get a total length and then break the total lengths into chunks that could be managed by the OS. This turned out to be a problem. Skipping over the debugging details, the issue turned out to be that multiplication and division are not considered to be part of the base RV32I instruction set. Those instructions are considered to be the RV32M extension.

One option would be to implement the logic in the libc implementation in terms of addition and subtraction. This was just not acceptable to me, however. The performance penalty for that would make any application unusable. I can live without floating point operation support, but multiplication and division instructions are a must.

So, back to Claude to see what it would take to implement support for those operations. As I figured, it was more code than I wanted, but not a horrible amount. I added the suggested code to the OS and it did fit in the available storage space but it used most of it.

Multiplication and division in place, I gave compilation another shot. It worked. Ran the program and that worked too. So far so good. However, at this point, the “Hello, world!” program was taking 277 milliseconds to run 249 instructions with all the function call overhead. I wanted to see if I could bring the total execution time down.

My idea for how to do that was to make the standard C calls static inline functions. That would (a) avoid a lot of stack I/O and (b) keep the code logic closer together. Size of program binaries is basically irrelevant with this system. If all the logic is inlined, that’s totally fine.

I restructured my program so that all the standard C calls were static inline functions, rebuilt and re-ran. And guess what… it crashed. After a little debugging, I discovered that making the library calls inline had caused the linker to put the _start function somewhere in the middle of the binary again. REALLY?! OK, back to Claude to see how I can get gcc to stop putting the start symbol in the wrong place. It recommended adding a few attributes to the _start function. That got it to put the code in the right place (FINALLY!) but the program was still crashing.

After another round of debugging, it turned out that the decoder for an immediarte offset wasn’t sign extending values correctly. So, when certain negative values were used for jump instructions, it was jumping to the wrong address. After a little coaxing with Claude, I got it to come up with the correct decoder and was finally able to run the program. To my dismay, however, the inlined version of the program used the same number of insturctions and actually took a few more milliseconds to run. The only reason it would take more time is if the loops were bouncing back and forth between virtual memory segments in the inlined version. That’s just happenstance, but it meant that there was absolutely nothing to be gained from the inlined code.

Then, I realized something: I hadn’t turned on any compiler optimizations yet. I turned on -O2 with the non-inlined version of the code and got it to run with 85 instructions in 37 milliseconds. Then, I recompiled with the inlined version and got it to run with 74 instructions in 22 milliseconds. So, the inlined code did make a difference. OK! I have a development strategy now: I’m going to write a header-only implementation of the standard C calls.

There is at least one place, however, that cannot be in a header: The implementation of the _start function. From a standards perspective, this is not really a big deal since that function is just a convention and isn’t part of the standard. So, I’ll have at least one “library” where support infrastructure will have to live. Not a huge deal.

What was a huge deal, though, was the amount of code space that I was now consuming on the Nano. With all the bug fixes, I was now down to exactly three (3) bytes of program flash left. I still needed to support some additional system calls for things like reading input into the program and I literally had no space for logic to do that. Time to free up more space.

The easiest way to do this was to just delete error messages. I will be the first to admit this is generally not a good idea, however (a) the areas that I needed to extend were unrelated to the error messages I deleted and (b) the places I deleted messages from had already been shown to be working fairly reliably by this point. So, out they go! That reclaimed about 1.5 KB. Enough to start with.

Then began a debug session. My original problem was that my fwrite calls were producing mangled output on the user terminal. I hadn’t implemented a printf function in user space yet but I needed to see how many characters it thought it was printing. I came up with a very simple solution: Print a variable that holds a single byte that contains the character '0' plus the number of characters written. Since I was dealing with fairly small messages, this shouldn’t have been a big deal.

However, it turned out to be an enormous deal. Writing a value to that variable was corrupting memory. Skipping over a lot of detail and entire day of debugging, the problem turned out to be the way an immediate value was being parsed from instructions: It wasn’t being properly sign-extended. So, rather than adding a negative value to a base address, it was adding a positive value and coming up with I-don’t-know-what to manipulate.

I had to wonder at this point if the guys who wrote the first version of UNIX had this problem. At the moment, there are always three places that a bug can be: My user space implementation, my kernel space implementation, or my VM implementation. Did they have to wonder if bugs they were seeing could be in the PDP-7 hardware they were using or did they have high confidence in it and could focus on just their kernel and user code? At the moment I don’t know.

Skipping over about 5 days worth of additonal debugging of VM + OS + libc, I was eventually able to get an init process that read a username and password and printed "Login success!" if they matched and "Login failure!" if not. This, however, is not enough. A successful login needs to spawn a command shell. For that, I needed a system call that would run a command line.

And, this is where things became intractable. The infrastructure to run a command has been in place for a long time, however adding the system call handler to do it from user space pushed the OS more than 300 bytes over the 48 KB program storage limit. There are no more strings to remove to get me space quickly. The only way I have of reclaiming space at this point is to refactor code and/or remove functionality.

During this work, I’ve also been reading a lot about other, non-UNIX-like OS architectures. I realized that I made an architectural mistake in the context of the current direction of the OS. My goal has always been to support as much of POSIX as possible. When I started, though, I was thinking about supporting POSIX in the context of the kernel processes. I had that line of thinking at the time because I didn’t see any path to being able to run arbitrary processes from a filesystem. Once the VM path proved viable, however, the separation of user space and kernel space emerged. POSIX is a user space specification. It says nothing about kernel space. If I really intend to support POSIX in a user-space VM, my focus in the kernel should be to enable that in the way that makes the most sense for user space. The consequence of trying to make the kernel space conform to POSIX is that I now have a lot of code that’s really unnecessary.

So, I have a problem now. I won’t merge my dev branch to main without a viable shell and that’s simply not possible with the kernel the way it is right now. If it’s possible to fix it, it will take serious restructuring of the kernel. That effort is likely to eliminate the possibility of kernel processes entirely. Meanwhile, what I have on main is a multitasking OS with the condition that the code for all the processes has to reside in program storage and not on an SD card filesystem.

I effectively have two competing architectures and directions now. The OS on main is really suited for an embedded environment. The OS on dev is headed in a direction that makes more sense in an environment with more resources but can’t be finished in the environment I’m working in. I don’t want to lose either one. That said, it doesn’t really make sense to keep the one on dev labeled as “dev” if I’m not going to continue to develop it with intent to merge it to main.

I think what needs to happen at this point is that I need to have two branches and I may need to fork the repo. What’s on main needs to remain focused on embedded environments and what’s on dev needs to be in a position that can be extended to a more robust operating system in the future if time, resources, and desire allow.

Right now, time definitely does not allow. I have an upcoming obligation that’s going to constrain the amount of time that I have to dedicate to this effort. While I will undoubtedly come back to this a short time later (because I’m a bit of a code-writing addict), I’ll have to mothball this for the time being.

So, I’ll organize the branches into a meaningful configuration and maybe fork the repo (not sure on that part) and then come back to this again in the future. We’ll see what happens after that. To (eventually) be continued…

Table of Contents