Cache address mapping

Source:   Editor: Alexander Update Time :2018-09-05

Before understanding the Cache address mapping, some basic knowledge of it is added. There is a difference between the Cache address mapping and the mapping in the MMU (memory management unit) and TLB Cache (translation lookup Cache).  The structure of Cache, TLB Cache, and MMU in the CPU is shown in Figure 1. Figure 1 shows the internal structure of the Cortex A9 Processor, which adopts the Harvard architecture separated instructions from data bus. The CPU accesses internal storage and external storage, and all kinds of peripheral space are physical addresses (hardware bus) at the hardware level. However, the concept of virtual address for applications is proposed to satisfy the multi-process vulnerable software system. MMU is responsible for the mapping of virtual addresses to physical addresses, whose conversion process shows that page tables are stored in memory. When using the first-level page tables for address translation, each read/write data needs to access memory twice. The first access to the first-level page table to get the physical address, the second is the real read / write data. When using the two-level page tables, each read / write data needs to access the memory three times, two page tables (primary and secondary page tables) accessed to get physical addresses and the third to obtain true read/write data. Due to the slow speed of this mechanism, TLB Cache was proposed to store recently used page table entries (segment/large page/small page/minimal page descriptor). Located in MMU, TLB Cache is a memory management unit for improving the conversion speed of virtual address to physical address. There is no in-depth analysis of MMU and TLB. Cache mapping is to establish links between physical blocks and physical blocks at the hardware level.

The capacity of Cache is generally very small, even the largest three-level Cache (L3) is only 20MB ~ 30MB. But the capacity of current memory is in GB as a unit. The CPU usually accesses the memory by reading and writing one word unit at a time. When the CPU fails to access the cache, it is necessary to loaded the word unit stored in main memory into the cache along with several subsequent words. The reason for this exiting in  write-back strategy is to enable subsequent accesses to hit Cache. Therefore, the data unit exchanged between the main memory and Cache should be a data block (the cache line mentioned in the previous article is generally 64 Bytes in size). The size of the data block is fixed, consisting of several words, and the data block size of the main memory and the Cache are the same.

In view of the Cache-main memory model, on the one hand, the access speed of CPU should be close to that of the cache, and on the other hand, the running space provided for the user program should be maintained as the storage space of the main storage capacity. Cache is transparent to user programs in a system that adopts the Cache-main memory hierarchy, that is, the user program does not need to know the existence of Cache. Therefore, each time the CPU accesses memory, it still gives a main memory address as in the case of not using the Cache. However, in the Cache-main memory hierarchy, the CPU doesn’t first accesses the main memory but the Cache. For this reason, a mechanism is needed to convert CPU's main memory address to Cache address. The translation between the main memory address and the cache address is closely related to the mapping relationship between the main memory block and the cache block. How to store the contents of the memory into the Cache requires a mapping algorithm and a blocking mechanism.

The mapping algorithm refers to mapping the memory address space to the Cache address space. Specifically, the content stored in the memory is loaded into the Cache according to certain rules, and the correspondence between the memory address and the Cache address is established. When the processor needs to access the contents of data block, it needs to convert the memory address into a Cache address, so that the data block can be found in the Cache and eventually returned to the processor. The mapping relationships between Cache and memory can be divided into three categories: full associative cache, direct mapped cache, and N-ways associative cache.

Full associative mapping means that any block in main memory can be mapped to any block in the Cache. That is to say, when a block in the main memory needs to be loaded into the Cache, one can be selected to be stored in the main memory block according to the block occupancy or allocation of the Cache at that time, and the selected Cache block can be any one of the Caches. For example, let Cache have m blocks, and main memory n blocks. When a block j of main memory needs to be transferred into Cache, it can be stored in block 0, block 1, ..., block i, ... or block m of Cache.  As shown in Figure 3, the difference is that the correspondence between the cache and the main memory block is different.

In the Cache, it’s necessary to create a directory table, where each table item consists of three parts: a memory address, a Cache block number, and a valid bit. When the processor needs to access a certain memory address, it first use the directory table to query whether the content is cached in the Cache, as shown in Figure 4. When a main memory block is loaded into the Cache, it is registered simultaneously in the associative memory that stores both the main memory block number and the Cache block number mapping table. When the CPU accesses the memory, the block address A of main memory is queried in the Cache's associated memory directory table. If the equivalent memory block address is found and the valid bit is checked, only if it is valid can the cache memory be found in the cache by the cache block number, and the block address B is added to find the corresponding data. At this time, it is called a Cache hit, and the processor gets the data back; otherwise it is called a miss, and the processor needs to read the corresponding data in the memory. With a full associative Cache, block collisions are minimal and Cache utilization is high, but a fast access to associative memory is required. As the capacity of a Cache increases, its circuit design becomes very complicated. Therefore, only the Cache with a small capacity will be designed to be full associative.

Direct mapped cache means that a block of main memory can only be mapped to a specific block of the Cache. The Cache directory table consists of only two parts: the area code and the valid bit. The search process is shown in Figure 5. First, the memory address is divided into three parts: area code A, block number B, and intrablock address C, where area code A and area code B are actually the main memory address A of the full associative cache. Find the id area code in the table of contents according to area code A, and in the case of valid bit, it is explained that the data is in the Cache.  Then, the block address in the Cache is obtained through the block number B of the memory address, plus the block address C, and eventually the data is found. If an equal area code is not found in the directory table, or if the valid bit is invalid, the data is not in the Cache and needs to be read in memory. The advantage of the direct associative mapping is that it is the simplest to compare the circuit, but the disadvantage is that the Cache block collision rate is higher, thereby reducing the utilization of the Cache.

The above two methods have their own advantages and disadvantages, and interestingly, their advantages and disadvantages are just the opposite. So the N-ways associative mapping emerges, which is the most commonly used mapping method. The N-ways associative cache memory is divided into a plurality of groups, one of which is the size of multiple cache lines, one of which is mapped to the corresponding consecutive cache lines. That is, a Cache group and any one of the groups can be mapped to any of the corresponding Cache groups. It can be seen that outside the group it adopts the direct mapped cache mapping mode, while in the group the full associative cache mapping mode is adopted.

Suppose there is a 4-ways associative Cache with a size of 1M and a Cache line of 64B, then there are 16K Cache lines in total. However,  in the case of 4-ways group association we don’t simply have 16K Cache lines, instead of  having 4K groups, each with four Cache lines.  A memory unit can be cached to any of the Cache lines in its corresponding group. Figure 6 shows a 4-ways associative Cache as an example to describe its search process in the Cache. The directory table consists of three parts, namely "area code + block number", Cache block number and valid bits. When a memory address is received, it is divided into four parts: area code A, group number B, block number C, and intrablock address D. Firstly, a set of directory table items is found by address according to the group number B. In the 4-ways group association, there are four table items, each of which may store the memory block. Then, an association lookup (i.e., parallel lookup to improve efficiency) is made in the group table items according to area number A and block number C. If the match and the valid bit are valid, it indicates that the data block is cached in the Cache, and the cache block number is obtained. Plus the block address D, the address mapped by the memory address in the Cache is obtained and eventually the data is found. If no match is found or the valid bit is invalid, it means that the memory block is not in the cache and needs to be read by the processor into the memory. 

This article refers to two books, computer architecture》《Computer  Architecture A Quantitative Approachand some bloggers'articles.