Test practice and experience of distributed storage products（end）
1. Store different types of massive data
2. Common machine damage in cluster
3. Massive User Access
So according to the characteristics of distributed storage products, the following test cases are designed: data compatibility test.
Code has been changing and there will be different types of data, so what should we do to ensure data compatibility?
Generally speaking, compatibility between old and new versions of writing data needs to be considered.
In practice, test engineers can simulates users to write files of different sizes and types every day. Before each upgrading version to pre-publish, test engineers need to verify these data to achieve compatibility test.
Development engineers will also test this part in unit test. It's just testing this at different levels.
Data integrity test
Distributed storage products must be prevented from losing users' data. This is the bottom line for storage products. In practice, test engineers scan new data daily to check data integrity and do full data scanning regularly.
Every time a new version is released, we need to know whether the performance of this version has improved compared with the previous. This is also more intuitive to users. Performance test is so complex that it is not elaborated here.
Performance test results are related to client tested, the code used, the type of request, and the amount of cluster data. In practice, a similar test environment is selected for comparison every time in order to reduce the impact of multiple test variables on performance results.
Test engineers simulates exhausted network, disk, CPU and other resources to test system performance. Once alarm threshold set previously is exceeded, the system starts to alarm. Pressure test results can also be considered as references for operation engineers.
The test system observes the consumption of memory, network and CPU resources for long term. The common problem is memory leak. If memory leak is a little bit at a time, a short test can not find the problem. So it is generally required system to run continuously for more than 7 days.
Slow http attack test
High concurrency analog attack test
Other analog attack tests
System robustness test
System robustness test refers to Failover test, which is also based on stratification in practice.
Primary test is based on different modules, simulating the failure of each module of system, such as process restart and abnormal process restart.
Secondary test is based on different machines. For distributed systems, machine resources failure is certain, such as insufficient CPU, excessive memory, abnormal network card, damaged disk, power-failure machine and so on. Automated test can simulate these situations through software. In actual, we still need to simulate some fault drills. For example, one or more machines are out of power.
Final test is based on different clusters. If the whole cluster is restarted after power failure, test engineers should detect whether the data of system is lost. In addition, how long the system can resume service and whether the switch of cluster is available also must be tested. Of course, these tests require test engineers to cooperate with operation engineers.
4. Test tools
A handy tool makes a handyman. The choice of test tools is also very important. In the process of our test, we did not use any commercial software. Most of tools we used were developed from engineering practice.
Cluster Monitoring status collection and self-test tools are used to collect monitoring data and automatically determine whether there are abnormalities during the test process to help detect problems early.
The report analysis tools of bug and case are used to judge the quality risk points of current products by multiple dimensions of a bug or case.
Test results report analysis tools use test results for comparison and analysis to facilitate the investigation of performance issues
Performance pressure test tools can simulate users’ request pressure or request types, and easily obtain performance data.
System test framework tools can well customize test requirements, complete test tasks, issue test reports and submit test results.
Pre-check-in tools ensure that the code automatically runs through the relevant test collection before submission.
Code coverage report analysis tools can easily give the component code with insufficient coverage.
Static code test tools ensure that the code can run through static code tests and provide report function before submission.
Coverage test tools of protocol layer and tool layer can test the coverage of protocol layer and tool command layer of components to ensure the coverage of tests.
5. Gray release
Even if so many tests have been done, it may go wrong. In practice, I think the more effective way is to adopt gray release, which means that only a part of the machine and observations is released and gradually a batch of machine is released, until the final all online if there is no problem.
Prerequisites for gray release are as follows:
1. A compatible system.
That is to say, the whole system is compatible with both old and new versions, and will not affect each other. If the old version written by the new version cannot be read, it needs to be published to a compatible version.
2. Good monitoring tools
Visualization and monitoring of machine resources such as CPU, memory, network and so on.
Visualization and monitoring of availability indicators of modules in each layer such as success rate, queue length, fitness and so on.
Visualization and monitoring of key business data indicators such as request accuracy, performance, QPS and other business indicators.
A big data tools are introduced to analyze daily access requests and get real business requests.
It’s necessary to do real-time monitoring to determine the stability of the system
3. Responsible Engineers
A responsible engineer will pay attention to whether the functions, logs, online machines and business are working properly after the release.
6. Track review after the release
If there are some errors, it should be found in time and tracked in the defect system until it is repaired and covered in the test case in order to avoid repetitive errors.
7. Guarantee of product quality
How to ensure the quality of product released is a big topic.
There are some ways to ensure product quality.
1. Scan static code;
2. Test coverage;
3. Test and review code;
4. Execute a complete test;
5. Adopt gray release;
6. Publish a summary to increase test coverage and form a good closed-loop.
Problems encountered in test practice
1. Test case instability
Tests often fail because of instability.
2. Problems in test environment
A single environment cannot meet several levels of test requirements. As sometimes test resources are limited, you need to plan well.
3. Problems in test efficiency
As a result of continuous overlapping product functions, the regression set becomes larger and more complex. The time to return is getting longer and longer. Therefore, test cases need to be refactored.
4. Issues of simultaneous release of more than four versions
Since the product may have multiple branches in regression at the time of release, such as the code branch under development and the code branch that needs to be repaired online. Because the efficiency of regression is not high, they can only wait in line. Test engineers need to improve test efficiency and reduce regression time
5. Difficulties in test and investigation
The requirement of test cases is not as high as that of developing code, and the log support in test framework is not friendly enough, which makes it difficult to investigate the problem. As a result, logs need to be improved.
Test engineers still need to do a lot of work to make test faster and more effective.
1. Easier said than done. A responsible test engineer can do things well in spite of bad conditions.
2. Good quality depends on cooperation of responsible test engineers, development engineers and operation engineers. If every link is done well, there will be good quality.
3. Good product quality is the guarantee of a happy life.
4. In order to find more errors, full review must be introduced.