Flash memory data retention and integrity technology
The giant text carved on the rock will not be traced after tens of millions of years.
Therefore, any storage technology and stored data can not be saved forever. There is a retention time. In flash memory, the problem of data retention time is called Data Retention. When the deadline is reached, the data will be in error. The mark is that the data read from the flash memory cannot be successfully corrected by ECC. We know that flash memory generally has the following errors:
Electrical problems: such as pseudo soldering or chip failures, resulting in normal commands fail to be executed or data error rates being abnormally high, This was found in flash memory or SSD factory testing.
Failed to read, write, or erase: The basic command fails to be executed, and the result can be read through the status bit. These problems may also occur during the use of chips, but the probability is very small.
ECC error correction failure: The data error rate is too high, which exceed the error correction capability of the algorithm. Data Retention is one of the culprits.
The mechanism of flash memory storage is to use quantum tunneling to cause electrons to transit to the floating gate layer and remain there. As time goes by, the electrons still have a certain probability to leave the floating gate layer and return to the inside of the channel. If there are too many electrons left, it may cause the read out result of the programed unit is the same as the erased unit’s, then the data is wrong. Data Retention is related to the thickness of the oxide layer below the floating gate layer. After all, the thicker the oxide layer, the less likely the electrons will leave. Studies show that if the thickness of the oxide layer is 4.5 nm, the data can be stored for 10 years theoretically.
In the SSDFans WeChat group, there is an expert member, Dr. Yu Cai. He studied at the Carnegie Mellon University, specializing in flash memory correction. And later he went to LSI to do research in this area. According to one of his dissertation "Data Retention in MLC NAND Flash Memory: Characterization, Optimization, and Recovery", he introduced his analysis of flash data storage.
Figure1 Floating gate transistor a. Intrinsic electric field b. TAT effect
The figure is a cross-sectional view of a floating gate transistor（flash memory basic cell）. The top is the control layer, the middle is the floating gate layer, the polycrystalline silicon oxide layer is above the floating gate, and the tunnel oxide layer is at the bottom. When the control voltage is high, the quantum tunneling effect occurs. The electrons pass through the tunnel oxide layer from the Substrate and enter the floating gate to be stored, thus completing the programing operation and charging. Conversely, when a strong negative voltage is applied to the control layer, the electrons tunnel from the floating gate to the substrate, and this operation is called erasing. However, when the control layer is not applied with voltage, there is still an electric field in the oxide layer. It called the intrinsic electric field, which is generated by the electrons in the floating gate. Under application of this electric field, electrons will leak from the floating gate slowly. If there is too much leakage, the data will be wrong. From the program operation to the electronic leak, and then to the data error, this period is called the data retention. In the SLC era, this period lasted for many years. But in the TLC era, it was less than a year, and some were only a few months.
So why is the longer the flash memory is used, and the shorter the data is saved? This is due to an effect: Trap-assisted tunneling (TAT). Looking at b in Figure 1, we know that the tunnel oxide layer is insulated, but with the use of the flash memory and many times of erasing, the oxide layer is old, and a lot of charges are retained, causing the insulator to be electrically conductive. This time, the charge runs faster from the floating gate. Therefore, the more flash erases, the shorter the data retention is. When the rated number of erasures is about to reach, such as 3000 times, the data just programed will be very error-prone.
However, the oxide layer is not always trapped charge, and sometimes the charge will leave, called Charge de-trapping. But there are both positive and negative charges leaving, so the effect on the threshold voltage is bidirectional.
So, how to solve the problem of Data Retention? You can't let your users' data be lost for months or years. Generally, SSDs use Read Scrub technology, or called data patrol, scan rewrite technology, and so on.
If you know about storage technology, then when you see the word Scrub, the first thing that comes to mind is the famous ZFS (Zettabyte File System) file system developed by Sun. ZFS designers have found that many users have not read the data for a long time, let alone have been programed. Even for database applications with frequent data reads, there also have data that have not been accessed for a long time. However, no matter what type of disk, there will always be a probability of bit flipping, resulting in data errors. When you need this data, the error may be already serious, and it can not recover as it is. In the ZFS file system, each data block has its own Checksum. As long as it is read, it can be found by Checksum whether the data is wrong or not, and the error data block will be corrected in advance. Therefore, ZFS provides a function called Scrub, which scans the file system, finds those erroneous data in advance and reprograms.
The Read Scrub technology of SSDs is similar to ZFS. When the SSD is not busy, it scans the whole disk according to a certain algorithm. If it finds that the number of flipping bits of a flash page exceeds a certain threshold, it will reprogram the data to a new place. The advantage of this is to avoid data being placed for too long, resulting in the number of bit flips exceeding the error correction capability of the ECC algorithm, thereby reducing ECC’s uncorrectable errors.
Flash memory data integrity
One of the characteristics of flash memory is that with the use of flash memory and the elongation of data storage time, the data stored in the flash memory is prone to bit flipping, resulting in random errors. This problem becomes more serious as the flash memory process becomes smaller. Therefore, the SSD that use flash memory as a storage medium requires some data integrity techniques to ensure that users’ data is not lost. Common techniques are:
ECC error correction
RAID data recovery