I started working on scheduler enhancements immediately after my last post. This has turned out to be a lot more trouble than I had anticipated.
The scheduler had been a true round-robin scheduler. NanoOs supports eight (8) concurrent processes, one of which is the scheduler itself, so the scheduler simply held an array of the other seven (7) processes an resumed each one in turn. It did no evaluation of any of the processes states, it simply called resume on each of them. (If they’re not resumable, the resume function simply exits early.)
My goal had been to move to a system with four (4) queues: A ready queue, a free queue, a waiting queue, and a timed-waiting queue. Processes on the waiting queue would be processes that were waiting on a mutex or a condition with an infinite timeout. Processes on the timed-waiting queue would be processes that were waiting on a mutex or a condition with a defined timeout. Processes on the waiting queue would stay there until signalled, but processes on the timed-waiting queue would have to be checked each pass through the scheduler (until I start using hardware timers, which is a future item).
Prior to my last post, I modified my coroutines library to allow for callbacks that are called when a mutex is unlocked or a condition is signalled. I intended to put processes on the waiting or timed-waiting queues when they blocked waiting for a mutex or condition and to remove them from those queues and put them back onto the ready queue when a callback was called. I haven’t gotten there yet. Just breaking the round-robin array up into the ready and free queues has been fraught with problems.
In order to do this, I had to develop a new SchedulerState object. Previously, I just held an array of all the process metadata as a file-local variable in the Processes library. This was ugly. I put the array and the four (4) new queues in the SchedulerState object and declared an instance of that from within the function that held the scheduler loop. That immediately resulted in crashes and hangs. The extra data on the stack overflowed the scheduler’s stack and corrupted the state of the process after that. So, I had to move some data around. I also had to factor out the part that ran the main loop from the part that did the variable initialization as both were becoming increasingly complex.
One of the things that processes have to do to communicate with each other is to look up another process’s Coroutine pointer by its process ID. When the process metadata was an array in the Process’s library, the lookup function simply did a bounds check on the process ID and returned the pointer from the corresponding index in the array. Since this information was now a local variable in the scheduler process, I added an inter-process command to the scheduler to lookup the pointer and made the function send the command to the scheduler and wait for the reply.
This was a mistake. Adding the extra local variables and all the lower-level calls into the Coroutines library caused a stack overflow in the console process, which corrupted one of the other processes’ state and resulted in nondeterministic behavior. This is the peril of using stack segmentation to achive multitasking. It took me about two days to figure out what was happening.
Once I figured out the issue, though, I realized that I had a good test case for detecting coroutine state corruption. I added guard member elements to the Coroutine structure at the top and bottom of the structure and set them to well-known values. Then, I added a check in the coroutineResume function to ensure that the values remained intact before trying to use the state in the object.
At that point, I realized two things: (1) I was going to have to revert back to the array lookup instead of message passing and (2) I was going to have to manage the stack of the console process better. One of the things held in in the console’s stack are the buffers used for reading and writing. Each buffer was 96 characters plus some state, which meant that the start of the console’s stack was already almost 400 bytes into the 512-byte stack. Add on command handlers and I was bumping up against the limit. The good news is that the coroutine corruption detection worked perfectly to detect when the stack was overflowing. So, I reduced each buffer to 80 characters to stay under the 512-byte limit.
So, then it was back to working on the queues. Since I had been working on this for so long at this point, I decided to scale back on my goals and just go for two queues: The ready queue and the free queue. The basic implementation of this worked fine. Initially, all processes were loaded onto the ready queue. (Some of the objects were loaded with a dummy process initially just to get them created and into the queue.) The scheduler popped a process from the ready queue and resumed it. It then checked the state of the process when it returned from being resumed. If it was still running, it was put back on the ready queue and, if not, it was put on the free queue. When a new user command was started, a process was popped from the free queue and used for the new command.
There was one curveball, though: Killing a process. Among other things, killing a process requires removing the process from somewhere on the ready queue, but not necessarily the head of it (and, in fact, most likely not not at the head of it). So, do do this, I needed a function to remove an arbitrary process from a queue.
It turned out that my first implementation of this function had a bug in it, although I didn’t realize it at the time. It resulted in some… highly-undesirable behavior in some cases. Specifically, what I was seeing was that when all process slots were filled and I killed the last one, the shell would hang. Putting in prints in the code, I discovered that the shell didn’t just hang, it completely stopped executing. After about a day of banging my head on my desk (and, to be fair, fixing a few other bugs along the way), I realized that when I called the function that removed a process from the ready queue, it removed the wrong process. In this particular scenario, it removed the shell process instead of the target process. SIGH
So, bug fixed, I could start putting everything back together. I moved the SchedulerState object back to the scheduler’s stack. I tried to move the static messages array onto it too, but that caused a stack overflow. (Fortunately, my corruption check detected that situation and immediately alerted me to it!). I wasn’t willing to move the entier array out, so I compromised and reduced the array size by one message. That allowed it to fit on the stack without an overflow and conserved the amount of RAM available for dynamic memory. HOORAY!!!
I did lose quite a bit of dynamic memory with the enhancements to the coroutines library. All the extra member elements of the Coroutine structure multiplied by the number of processes took its toll the stack space. I’ll have to do something about that in the future because I now only have about 650 bytes of dynamic memory to work with. I want to have at least 1 KB.
And, of course, I still need to finish the rest of the queues and working with the callbacks. Once I get that far, I will also have the ability to make the user processes preemptive instead of cooperative. I really want to be able to get that far because I don’t want user processes to have to worry about the internal mechanics of process management.
That, in turn, will get me closer to POSIX compliance. I’ve decided to make that a target for this system. I have some ideas on how I can get closer. After the scheduler enhancements and converting user processes to being preemptive, the next step will be to add the ability to run binaries that are on external storage. I’ve already ordered and received a MicroSD card reader and I have some ideas for what I can do to run a binary that’s located on that storage. I don’t know how much I can really accomplish in the amount of RAM that’s left, and I’m definitely going to have to reclaim some in order to do much. I’m also starting to push the limit of how much flash storage is available, so I’ll have to plan that carefully as well.
So, as always, progress, but still much more to do. On to the next steps!!! To be continued…