Test practice and experience of distributed storage products

Source:   Editor: admin Update Time :2019-04-22

Background introduction 
Over the past few years, I have been engaged in the test and development of distributed storage products, along with its first launching and upgrading. Until today, I participate in the release of numerous versions, meeting users’ demand for our storage products. Therefore, I want to write an article in order to summarize my working experience. It would be a great honor for me if someone could get some inspiration or harvest from it. The following several aspects will be elaborated.
1. Change of test roles in the development of distributed storage products
2. Test practice of distributed storage products
3. Problems in test practice
4. Experience

Change of test roles
In general, there are several roles in a project, including product manager, development engineer, test engineer, operation engineer and project manager.
Project development process
First, product managers collect users’ requirements, analyze business scenarios, and feed them back to development and test engineers. Second, development and test engineers discuss these requirements, and define on-line functions and acceptance criteria. Third, project managers formulate the project plan and track the progress. Fourth, development engineers develop the code and give it to the test engineers. Fifth, after completing the test, test engineers send it to the operation engineer. Finally, operation engineer release products online.

The different division of test roles in different product periods
The entire development of distributed storage products can be divided into two phases.
Phase I: Rapid iteration and release of initial products
One cluster is needed, whose members are made up of 2 development engineers, 1 test engineer, 1 operation engineer, 1 product manager and 1 project manager.
They publish a new product version every day for a week.
This phase is characterized by fewer clusters, fewer users, fewer visits and fewer functions
The cluster focuses on rapid development and iteration, mainly to meet functional requirements, and to be allowed continuous trials and errors.
At this phase, it is based on rapid release.
Because developing function is relatively single, 1 test engineer can basically meet the business needs.
Problems:
Due to time constraints, the test can only be a trade-off. There is no standardized process, resulting in more problems to be repaired. In addition, Adding new requirements means introducing more frequent upgrades and tests. So the whole phase tends to fall into a vicious circle.

Test roles:
At this phase, test engineers are mainly to execute test cases. After completing unit testing, development engineers basically do not undertake testing tasks.
Phase II: Iteration and release of stable product
Dozens of clusters are needed, whose members consist of more than 10 development engineers, 1 test engineer, 3 operation engineers, 2 product managers and 1 project manager.
They publish a new product version every two months.
This phase is characterized by more clusters, more users, more visits and more functions.
The clusters focus on product stability and are not allowed trials and errors.
Problems:
Different from the previous, upgrading becomes very troublesome due to a large number of clusters. Development engineers, test engineers, operation engineers belong to different departments. For development engineers, the faster the demand goes online, the better. It's better to release some code every day. KPI is functional online. For test engineers, the test time is so insufficient that multiple launches bring quality risks. KPI means product quality. For operation engineers, the fewer releases, the better. Every release may pose a risk of misoperation. KPI is product stability. Therefore, it’s easy to cause contradictions with inconsistent interests of all parties.

Test roles:
Test roles need to change when one individual can no longer complete so many test tasks.
So the final agreed outcome is to extend release cycle and reduce release frequency.
Improvement of development process: Introductions of design and code review, static code scanning, and test coverage requirements, especially for UT test coverage, row coverage and functional branch coverage.
Improvement of test process: It requires to introduce test standards and to enhance automated test. Test engineers no longer undertake all test tasks, and assign some tasks to other development engineers, only to evaluate test scope, test plans and test cases.
Improvement of operation process: It requires to introduce automatic release process and to strengthen online monitoring.
At this phage, test engineers are mainly to establish a set of testing mechanisms, while development engineers need to undertake test tasks.

The positioning of test roles from the perspective of company
In the process of product development, the company's positioning for test roles is constantly changing.
At first, test engineers and development engineers are on a team. Then, they belong to different teams. And then, they are on the same team again. Finally, there are no full-time test engineers but full stack engineers.
Everyone may have different understandings about full stack engineers. I think full stack engineers have capacities of developing, testing, and operating. This concept has been supported and opposed. It all makes sense, just as we need large and complete department stores, as well as small and beautiful specialty stores. All decisions of the company are premised on supporting business needs. What we need to do is to embrace its decisions, strive to be multi-skilled and adapt to the rapidly changing environment.

Test practice of distributed storage products 
What do test engineers do during testing distributed storage products?
1. Content of work
Review of requirements and designs
Tests need to be involved in every process. The acceptance criteria need to be known at the time of the design review, which is the most important start. Users’ requirements are the benchmark for testing. The acceptance criteria will deviate if the users’ requirements are not understood.
Test range
Since on-line time is fixed and it may not cover all tests in a limited time, the test range must be specified. This depends on test engineers' understanding of whole system and their abilities to communicate with development engineers.

Design and development of test cases
Design and development of test cases means writing code of test tools or test cases based on requirements. Some common methods are also described in some test books. So I don’t talk much about it here.
Design and maintenance of automated test framework
Only automated testing can liberate people from simple, repetitive and tedious work. Continuous integration mechanism is introduced to find problems in code in time.
Determination of test object
This work is to determine the version to be tested, in order to ensure that the version tested finally is the version online.
Implementation and feedback of tests
After completing the test plan, test engineers write test report, and record the problems found in the test on the Bug tracking system. Then, they will collect these results for project managers to do a quality assessment. Although not comprehensive, it is also an important reference.
Statistical analysis of test results
Test engineers need to summary test coverage and track areas that are uncovered.
It should be noted here that although test coverage is enough, it does not mean that the test is completed. Only all the code is covered. Manual analysis of test completeness is also required.
On-line confirmation and writing of release memorandum
Final on-line version and its configuration files need to be confirmed. In addition, test engineers should notify all on-line functions to partners by email.
Track and feedback of on-line problems
Track and feedback of on-line problems are required to avoid the same problems in the next version.
The development and test of distributed storage products is a huge project. The tests involved needs to be categorized and graded. Therefore, test grade is introduced.
2. Test grade
Test grade ;Test resources ;Test purpose ;Test frequency
Level 1: Unit test; Completion with a single machine; Independent on other environments; Complete code function test; Take some Mock measures to remove environmental dependencies; Submit code each time.
Level 2: Functional test; Small clusters; Simulate real scene; Complete function test; Depend on other modules; Submit code each time
Level 3: System test; Small clusters; Simulate real scene; Complete system test; Combination of functions; Depend on other modules; Submit code each time
Level 4: Primary performance test; Medium clusters; Simulate real scene; Complete performance test; Focus on Latency, QPS, burr rate, throughput and other indicators; Depend on other environments; Release each time
Level 5: Secondary performance test; Moderate Clusters; Simulate real scenarios; Complete stress test and failover test; Focus on the system performance when CPU, memory, network and other resources are exhausted or unavailable; Release each time.
Level 6: Data compatibility and upgrade test; Small clusters; Simulate real scene; Complete storage and on-line release related tests; Release each time.
Level 7: End-to-end simulation user scenario test; Large clusters; Simulate user scenario; Get test data; Release each time
The purpose of test grade is mainly to divide the work.
The code cannot be tested, unless unit test and function test are completed by development engineers. Of course, these options can be trade-off when it comes to emergency online.
Different levels require different test times. The times to complete one unit test and one performance test are different. Both level 1 and level 2 must be tested successfully, while the following level can be tested selectively.
Allocation of test resources

3. Test case types
Characteristics of distributed storage products
1. Store different types of massive data
2. Common machine damage in cluster 
3. Massive User Access
So according to the characteristics of distributed storage products, the following test cases are designed: data compatibility test.
Code has been changing and there will be different types of data, so what should we do to ensure data compatibility?
Generally speaking, compatibility between old and new versions of writing data needs to be considered.
In practice, test engineers can simulates users to write files of different sizes and types every day. Before each upgrading version to pre-publish, test engineers need to verify these data to achieve compatibility test.
Development engineers will also test this part in unit test. It's just testing this at different levels.
Data integrity test
Distributed storage products must be prevented from losing users' data. This is the bottom line for storage products. In practice, test engineers scan new data daily to check data integrity and do full data scanning regularly.
Performance test
Every time a new version is released, we need to know whether the performance of this version has improved compared with the previous. This is also more intuitive to users. Performance test is so complex that it is not elaborated here.
Performance test results are related to client tested, the code used, the type of request, and the amount of cluster data. In practice, a similar test environment is selected for comparison every time in order to reduce the impact of multiple test variables on performance results.
Pressure test
Test engineers simulates exhausted network, disk, CPU and other resources to test system performance. Once alarm threshold set previously is exceeded, the system starts to alarm. Pressure test results can also be considered as references for operation engineers.


Stability test
The test system observes the consumption of memory, network and CPU resources for long term. The common problem is memory leak. If memory leak is a little bit at a time, a short test can not find the problem. So it is generally required system to run continuously for more than 7 days.
Safety test
Slow http attack test
High concurrency analog attack test
Other analog attack tests
System robustness test
System robustness test refers to Failover test, which is also based on stratification in practice.
Primary test is based on different modules, simulating the failure of each module of system, such as process restart and abnormal process restart.
Secondary test is based on different machines. For distributed systems, machine resources failure is certain, such as insufficient CPU, excessive memory, abnormal network card, damaged disk, power-failure machine and so on. Automated test can simulate these situations through software. In actual, we still need to simulate some fault drills. For example, one or more machines are out of power.
Final test is based on different clusters. If the whole cluster is restarted after power failure, test engineers should detect whether the data of system is lost. In addition, how long the system can resume service and whether the switch of cluster is available also must be tested. Of course, these tests require test engineers to cooperate with operation engineers.

4. Test tools 
A handy tool makes a handyman. The choice of test tools is also very important. In the process of our test, we did not use any commercial software. Most of tools we used were developed from engineering practice.
Tool Purpose
Cluster Monitoring status collection and self-test tools are used to collect monitoring data and automatically determine whether there are abnormalities during the test process to help detect problems early.
The report analysis tools of bug and case are used to judge the quality risk points of current products by multiple dimensions of a bug or case.
Test results report analysis tools use test results for comparison and analysis to facilitate the investigation of performance issues
Performance pressure test tools can simulate users’ request pressure or request types, and easily obtain performance data.
System test framework tools can well customize test requirements, complete test tasks, issue test reports and submit test results.
Pre-check-in tools ensure that the code automatically runs through the relevant test collection before submission.
Code coverage report analysis tools can easily give the component code with insufficient coverage.
Static code test tools ensure that the code can run through static code tests and provide report function before submission.
Coverage test tools of protocol layer and tool layer can test the coverage of protocol layer and tool command layer of components to ensure the coverage of tests.

5. Gray release
Even if so many tests have been done, it may go wrong. In practice, I think the more effective way is to adopt gray release, which means that only a part of the machine and observations is released and gradually a batch of machine is released, until the final all online if there is no problem.
Prerequisites for gray release are as follows:
1. A compatible system.
That is to say, the whole system is compatible with both old and new versions, and will not affect each other. If the old version written by the new version cannot be read, it needs to be published to a compatible version.
2. Good monitoring tools
Visualization and monitoring of machine resources such as CPU, memory, network and so on.
Visualization and monitoring of availability indicators of modules in each layer such as success rate, queue length, fitness and so on.
Visualization and monitoring of key business data indicators such as request accuracy, performance, QPS and other business indicators.
A big data tools are introduced to analyze daily access requests and get real business requests.
It’s necessary to do real-time monitoring to determine the stability of the system
3. Responsible Engineers
A responsible engineer will pay attention to whether the functions, logs, online machines and business are working properly after the release.

6. Track review after the release
If there are some errors, it should be found in time and tracked in the defect system until it is repaired and covered in the test case in order to avoid repetitive errors.

7. Guarantee of product quality
How to ensure the quality of product released is a big topic.
There are some ways to ensure product quality.
1. Scan static code; 
2. Test coverage;
3. Test and review code; 
4. Execute a complete test; 
5. Adopt gray release;
6. Publish a summary to increase test coverage and form a good closed-loop.

Problems encountered in test practice
1. Test case instability
Tests often fail because of instability.
2. Problems in test environment
A single environment cannot meet several levels of test requirements. As sometimes test resources are limited, you need to plan well.
3. Problems in test efficiency
As a result of continuous overlapping product functions, the regression set becomes larger and more complex. The time to return is getting longer and longer. Therefore, test cases need to be refactored.
4. Issues of simultaneous release of more than four versions
Since the product may have multiple branches in regression at the time of release, such as the code branch under development and the code branch that needs to be repaired online. Because the efficiency of regression is not high, they can only wait in line. Test engineers need to improve test efficiency and reduce regression time
5. Difficulties in test and investigation
The requirement of test cases is not as high as that of developing code, and the log support in test framework is not friendly enough, which makes it difficult to investigate the problem. As a result, logs need to be improved.
Test engineers still need to do a lot of work to make test faster and more effective.

Experience
1. Easier said than done. A responsible test engineer can do things well in spite of bad conditions.
2. Good quality depends on cooperation of responsible test engineers, development engineers and operation engineers. If every link is done well, there will be good quality.
3. Good product quality is the guarantee of a happy life.
4. In order to find more errors, full review must be introduced.