Tuesday, 25 August 2020

GSoC - Final report

 Phase1 -   

  • My initial focus for the project was to completely isolate two thread-stacks. This as a pre-requisite required to utilize the ARMv7 MMU to isolate blocks of memory dynamically. Through the first two weeks, I worked on providing support for dynamic memory isolation. Code sample.
  • The next part of the task involved providing proper context switch support for setting/unsetting memory attributes of the thread-stacks. This completed my first phase of GSoC.
  •   These 4 blog posts provide an explanation for my work in phase 1.
  • A complete view of my work in phase 1 can be viewed here.

Reflection on Phase 1Phase 1 started with a fair bit of skepticism from my side, as I had never ever done anything remotely close to memory isolation. This was compounded by the fact that I was unfamiliar with the RTEMS codebase. Nevertheless, my mentors gave me some valuable suggestions for making incremental progress. Once I had written some code that successfully provided basic memory isolation I felt a bit sure about the direction my project was heading into. Another major challenge in this phase was to write the context switching code for protected stacks in assembly, as I did not have a lot of experience in ARM assembly. Here, the good old GNU debugger came to my help as I checked the changes in all the register values after every instruction. This was a slow and tedious process that took a week to get the correct code for context switch even though the volume of code was not very large.


Phase 2 - 

  • The focus in phase-2 was to get an end to end working, optimized solution for strict thread-stack isolation and provide basic support for thread-stack sharing.
  • A simple test in which a thread tries to read from the stack of another thread, while it has been switched out was used to test this end to end solution.
  • This test laid bare some of the dormant problems with my implementation that had gone unnoticed. I discuss them and their solution in this post.
  •  Up until this point, I had been using separate structures for tracking protected stacks. This mechanism also used to dynamically allocate memory for these structures.  I integrated these structures in the already present Stack Control structure so that they can be allocated statically, based on the number of threads that are configured through the application. 
  •  The next part involved providing a naming mechanism for the protected thread-stacks. This was done in order to identify the memory region to be allocated from shm_open().

Reflection on Phase 2 - For the first two weeks of this phase I was relatively unproductive, as I had to appear for my end semester examinations. Once it was over I quickly started making progress towards my plans. This is when I encountered a major hurdle in my project, I was getting fatal exceptions left, right, and center in my context switch code although theoretically, my mechanism for isolating thread stacks seemed sound. I discussed this on the mailing list and got some useful advice. Although that did not solve the problem completely, I had to rethink and go through all of my assumptions from scratch to figure out exactly what was going wrong. Finally, I realized some of the stupid assumptions that were glaringly obvious. A detailed explanation of what was going wrong and my solution can be found here. On a positive note,t going through everything all over again solidified some of my fundamentals related to memory isolation and how an MMU works in general.


Phase 3
  • After completing the basic support for thread-stack sharing, my focus in this phase was to provide support to high-level POSIX APIs for thread-stack sharing. Another important goal for this phase was to optimize my code and make it merge-ready by using more of the structures that are already present in the RTEMS codebase.
  • I have discussed the rationale behind high-level design choices for stack-sharing here. The relevant code for stack-sharing can be found here.
  • The above changes had some flaws. They were not optimized enough and the mechanism for stack-naming was not integrated properly to the stack control structure. These patches moved my code a bit closer to merging but some of the issues with proper integration into the stack control structure still remained.
  • I also worked on providing a mechanism for the conditional configuration of thread-stack protection. This enables the user to enable the thread-stack protection conditionally at build time. This option is based on the new build system of RTEMS.


Reflection on Phase 3 -    By the time phase 3 started, I had managed to come up with a working stable solution for thread-stack isolation and sharing, or at least so I thought. On testing my code against test cases where thread-dispatching was being done through my solution broke. On discussing it on the mailing list, I realized the mistake and was able to resolve the issue. This phase primarily comprised of refactoring my code to make my code mergeable and shaving out the rough edges where thread-stack isolation was failing

Final work -  

Future Work There are primarily two important things that need to be taken care of before this work becomes merge ready - 
  • Currently, the mechanism for isolating thread-stacks requires setting/unsetting the memory regions corresponding to the thread-stack. This can be optimized by changing the page-table base during each context switch.
  • Handling of deletion of threads when their life-cycle finishes also need to be handled.

Steps to re-create the current work -    

  • Clone the Final_release branch from my repo.
  • Since the configuration option is based on the new build system, the standard 'make' option is not compatible with this feature. Refer to Sebastian Huber's rtems-docs repo, chapter 7, for the basic setup of the new build system.
  • The important point to keep in mind is to set the RTEMS_THREAD_STACK_PROTECTION option to True in my repo's config.ini file before the './waf configure' command.
  • Currently, thread stack isolation and sharing work only for arm/realview_pbx_a9 BSP.
  • Simulation has been done on QEMU.
  • For a demonstration of the thread-stack isolation refer to the thread_stack_protection test in the testsuite. For thread-stack sharing mechanism refer to the thread_stack_sharing test.

Saturday, 22 August 2020

High level design and implementation of thread-stack sharing

In this post, we will be discussing the high-level design for sharing thread stacks. Our focus would be to make the design as much POSIX compliant as possible. But first - 

Why do we need to share thread stacks?

There are certain operations in RTEMS in which a thread writes/reads to/from the stack of another thread. This includes IPC mechanisms such as message queues, in fact, all blocking reads ( sockets, files, etc.) read/write to the stack of a different thread. Now if we have completely isolated thread stacks from each other, these valid operations will give fatal exceptions whenever they read/write to the stack of a different thread. Hence we need to share thread stacks for enabling these operations.

The mechanism for sharing thread-stacks - In the last two posts, we discussed the strict-isolation of thread-stacks. We saw that when a thread is executing, it only has access to its stack and the global data. This is made possible by unsetting (set to NO-ACCESS) the memory attributes of the previous thread during a context-switch.

Now if we want our target thread to have access to the stack of a given thread, we need to set the memory entries of the thread-stack we want to share in the 'context' of the target thread. There are a couple of important things to consider here - 

  1. The thread-stack that will be shared may have memory access permission different than its intrinsic permission, i.e. if the thread-stack has R/W permission in its 'context' it is possible that its access permission while sharing maybe Read-only.
  2. We need to keep track of all the memory regions with a thread along with their access permission. This is important because we need to set/unset all these memory regions during each context switch/restoration( set with proper access permission ).  
Determining a POSIX compliant way of sharing thread-stacks - Since sharing stacks, at its core, is a mapping operation the obvious call for sharing stacks is mmap(), the problem is, mmap usually maps a file to the address space of the currently executing process, but in our case, we need to map a memory region to a thread of our choice. To do this, we need to tailor our mmap operation around different calls to fulfill our needs. This can be achieved by the following sequence of calls -
  1. Get the file descriptor of the memory to be shared by opening a shared memory object through shm_open. Here we provide the access permission of the memory region to be shared. We also provide a fixed pattern of naming to the object (More on this in the next section).
  2. Make a call to ftruncate that truncates the file size to the size of the stack and so that the shared memory object handler points to the stack address.
  3. Now we share this file to the target thread by making a call to mmap(). Here it is important to understand the various parameters we need to pass to mmap() for a successful mmap operation. This call is usually defined as mmap( void* addr, size_t length, int prot, in flags, int fd, off_t offset ). For our operations we need to do the following - 
                           - addr - We pass the address of the target thread stack to indicate the thread with which we want to share the memory region with,i.e suppose we want to share stack space of T2 thread with that of the T1 thread we pass the address of T1 thread.
                            - length- This is the stack size of the sharing thread.
                            - prot - This is the memory access attribute of the region. We have four options -                             PROT_EXEC Pages may be executed
                     PROT_READ Pages may be read.
                     PROT_WRITE Pages may be written
                     PROT_NONE Pages may not be accessed.
                            - flags -  For stack sharing operation we necessarily need to provide the MAP_SHARED option.
                             - fd - This will be the file descriptor to the shared memory object we discussed above.
                             - offset - Since we want to share the complete stack space we keep the offset to zero.

Application requirements for sharing thread stacks-

  The following are some of the requirements that an application writer has to follow for sharing thread stacks - 

  1.  Naming for shared memory objects is done in the application and the name follows a fixed naming pattern ( "/taskfs/" ), this is used to differentiate between a normal mmap operation and a stack sharing operation.
  2. We need to explicitly allocate stack memory from the application for stack sharing, and then set through pthread_attr_setstack*().
  3. This one is a possible improvement that has not been integrated yet-  Any application has to specify a series of repetitive steps (shm_open, ftruncate, mmap) for sharing a particular thread-stack. Maybe this can be wrapped under a function ((rtems_share_stack() ?) ) and we only make a call to that function every time we have to share a thread stack.
  4. For an example of how this is done refer to this test application.

Wednesday, 19 August 2020

Thread-stack isolation-v2

 In the previous post, we discussed a primitive mechanism for isolating thread-stacks. The discussed mechanism has some inherent flaws which will be discussed along with its solution in this post. 

Broadly there are two flaws with the previous implementation - 

Memory entries being set for 1Mb sections -  The ARMv7 MMU implementation for changing the memory entries is defined for 1Mb sections, this causes issues. Suppose we have two thread stacks T1 and T2 if the application writer does not explicitly state the stack size, RTEMS allocates 8K bytes to a stack. Now, on switching from T1 to T2 we set the memory entries of T1 and unset of that of T2, the problem is, we are actually unsetting memory attributes of the entire 1Mb section which may have global R/W data that is used by T2. This will cause unnecessary fatal exceptions whenever we try to access global data from T2.

Solution  -  The solution to the problem is pretty simple, but as we will see the implementation poses some subtle problems. We should set/unset the memory entries for only those regions that contain the thread stack, i.e. if we have stacks of size 8K then we should set/unset memory entires of these regions only. This requires finer grain control, we have to have multilevel (2-levels for 4K pages) page tables. RTEMS, in fact, provides support for 2 level page tables. 
The problem lies in the fact that for Xilinx-zynq BSP, the translation table base is set at 0x100000 by the linker script and extends up to 0x104000 for section-based pages (16K in size).  Although for small pages it will extend up to 0x504000 (4.16Mb in size) this will possibly conflict with other data regions(.txt, .bss, etc.) that are placed in this address space and setting up of translation table for smaller pages will fail. This is a BSP specific problem and depends on how the linker script sets-up the address space of a particular BSP. We will thus have to change the linker script to place the translation table entries in an address space where it does not cause conflict with other memory regions. We actually can take help( switch to ?)  from the realview_pbx_a9 BSP which already supports 4K pages to modify the linker script according to our needs. Here is a snippet - 


Tailoring our linker script according to the above snippet solves our problem and now we can set up translation tables for 4K pages. Now we can set/unset memory entries for our thread stacks without worrying about other memory regions, or maybe not 😏?

Allocated stacks are not page-aligned - As discussed in the previous post we use a custom stack allocator, that is defined from the application, to allocate thread stacks from the workspace and set the memory entries of the stack. The stacks allocated from the workspace are not page-aligned, where we consider 4K pages. In practice, this means that the stack address is, for example, 0xfbf9b70 instead of 0xfbf90000. How is this a problem for us?

When we set the memory entries for 4K pages the entries are set per page, i.e we have E1 entry for 0xfbf9000-..a000 and E2 entry for 0xfbfa000-..b000. Now when we get stack address from the workspace it is possible that we have stack S1 that ranges from 0xfbf9b70 to 0xfbfbb70 (8K size) and S2 ranges from 0xfbf7b60 to 0xfbf9b60. So when we unset the memory entries of S2 (which begins at 0xfbf9b70) during context switch and set the entries of S1( which ends at 0xfbf9b60) we end up setting the memory entries for the entire 0xfbfa000-..b000 (as entries are set per page). This leaves a part of the stack S1 still mapped in and we do not achieve perfect stack isolation.

Solution - Since the memory entries are set per-page, if we allocate page-aligned stacks we will be able to perfectly set/unset memory entries of only the required region. In RTEMS we can allocate byte aligned memory using  Heap_Allocate_aligned_with_boundary(). We set the alignment to 4096 as we want 4K aligned address. Note that this allocation is done in the custom stack allocator.