91³Ô¹ÏÍø

Hardware Overlay Management for Data Intensive, Ultra-Low Power Edge Devices

Rich Collins, Product Marketing Manager, Synopsys

Introduction

Memory management techniques have been critical to processor architectures for many years. The processor¡¯s physical address space defines the range of addresses to memory (RAM) that physically exists within the system. Memory management dynamically allocates portions of the physical memory to a process and frees it for reuse by other processes when not needed.

Virtual addressing separates memory addresses required by these processes from the physical memory addresses, allowing the virtual address space to be larger than the physical memory space. A memory management unit (MMU) effectively ¡°pages¡± or ¡°swaps¡± the memory space required by a specific process to secondary storage by mapping virtual page numbers to physical page numbers in main memory (Figure 1).

Virtual to physical address translation

Figure 1: Virtual to physical address translation

Most low-power embedded (and deeply embedded) applications do not need to leverage a rich operating system such as Linux. These applications typically run on ¡°bare-metal¡± (no operating system) or under a real-time operating system (RTOS). These options do not require the virtual to physical translation provided by an architected MMU. Synopsys¡¯ DesignWare? ARC? EM Processor IP is typically used in deeply embedded applications running an RTOS.

However, there are cases where virtual to physical address translation can help increase performance, such as for a large code base residing in slow secondary memory. Processes can then be paged into faster, smaller on-chip memory called page RAM (PRAM). In systems that run all code as a single process (one Process ID, or PID), using a large virtual address space with a one-to-one correspondence between the virtual address and a large selected area of secondary memory (such as flash memory or DRAM), address-translation can be used to detect when a section (or one or more pages) of code is resident in the PRAM and provide the physical address of the page in the PRAM.

Synopsys has recently added support for this concept, referred to as ¡°hardware overlay management¡± as an add-on option for the ARC EM processor.

ARC EM Hardware Overlay Manager (OLM) Option

The ARC EM processor supports virtual memory addressing when the Overlay Manager (OLM) is present (Figure 2). If the OLM option is not present or if it is present but is disabled, all virtual addresses are mapped directly to physical addresses.

As shown in Figure 1 the Overlay Manager features a Translation Lookaside Buffer (TLB) for address translation and protection of 4KB, 8KB or 16KB memory pages, and fixed mappings of untranslated memory. The upper half of the untranslated memory section is uncached (for IO use) and the lower half of the untranslated memory section is cached (for the operating system kernel).

With the OLM option enabled, the ARC EM core defines a common address space for both instruction and data accesses. The memory translation and protection systems can be arranged to provide separate, non-overlapping protected regions of memory for instruction and data access within a common address space. 

OLM components within the ARC EM pipeline

Figure 2: OLM components within the ARC EM pipeline

The TLB architecture of the OLM option can be thought of as a two level cache for page descriptors: ¡°micro-TLBs¡± for instruction and data (¦ÌI-TLB & ¦ÌD-TLB) as level one, and the ¡°Joint¡± (J-TLB) as level two.

  • The ¦ÌI-TLB and ¦ÌD-TLB are physically located alongside the instruction cache and data cache respectively, where they perform single-cycle virtual to physical address translation and permission checking. The ¦ÌI-TLB and ¦ÌD-TLB are hardware managed. On a ¦ÌI-TLB (or ¦ÌD-TLB) page miss, the hardware fetches the missing page mapping from the J-TLB.

  • The J-TLB consists of a RAM based 256 or 512 entry buffer and is software managed. On a joint TLB page miss, special kernel-mode TLB miss handlers fetch the missing page descriptor from memory and store it in the J-TLB through an auxiliary register interface.

The main page table contains the complete details of each page mapped for use by kernel or user tasks (Figure 3). The ¦ÌTLBs, J-TLB, and miss handlers combine to provide cached access into the OS page table.  It is up the OS (or micro-kernel) to keep page table entries loaded into the OLM in sync (coherent) with the main page table in memory.

Bringing a new page from the secondary storage may involve evicting an existing page from the PRAM (in case the PRAM is full). The eviction is performed using the Least Recently Used (LRU) algorithm. 

OLM page table structure

Figure 3: OLM page table structure

To facilitate efficient operation, an external module is required (customer defined) to track page usages and provide an indication to the software for victim pages to replace when the physical memory is fully-allocated, and a new page is required to be loaded. The OLM module provides an LRU interface giving an external module access to the required internal signals necessary to track used pages.

Summary

Typically, low-power embedded (and deeply embedded) applications rely on RTOS¡¯s and do not require the address translation provided by an MMU for rich operating systems such as Linux, but the rapid growth of applications running on edge devices has pushed desktop class system requirements to ultra-low power embedded processors.

Since power and area are always at a premium, often-times an architected MMU based system is too costly and many software-based solutions have been implemented (such as automated overlay management) to address programmability within the limited memory resources of these embedded devices.

To complement these software solutions, Synopsys has provided a lightweight hardware-based overlay management solution for the ARC EM Processor IP, enabling address translation and access permission validation with minimal power and area overhead. This option boosts the ability to run larger and more data intensive applications on an ARC EM core such as those increasingly prevalent within AIoT ¡°always-on¡± and wireless baseband application spaces.