🙈 By 陈潞波 2023-09-18
中断导致的服务中断可能会造成严重的业务后果,因此构建、运行和测试弹性系统非常重要。系统的弹性源自其各部分的弹性:系统的每个部分都能够处理一定数量的错误或故障。无论是后续的服务不可用、网络延迟还是数据可用性问题,分布式系统都充满了相应的错误处理的隐含非功能性需求。
什么故障注入?
故障注入是软件测试的补充技术,用于提高软件性能和弹性。通常的测试方法验证软件的正确行为,而故障注入则通过故意注入的故障来评估软件的正确行为。
HTTP故障注入,支持中止(abort)来自下游服务的Http请求,和/或延迟(delay)代理请求,一个故障规则必须具有延迟或中止或两者兼有。在将HTTP请求转发到路由中指定的目的地时,可以注入一个或多个故障。故障注入策略应用于客户端HTTP流量。
延迟规范用于在请求转发路径中注入延迟。
中止规范用于提前中止一个请求,并返回一个预先指定的错误码。
类型
故障注入主要有三种类型,同一个location下,只允许配置一种:
- Delay: 只注入延迟故障,延迟时间到后,再发起对upstream的连接请求
- Abort:只注入中止故障,收到客户端请求,直接将注入的状态码返回给客户端,结束请求
- Delay+Abort:同时注入延迟故障和abort故障,先进行延迟,延迟时间到后,直接返回abort设置的状态码给客户端,结束请求,不再发起对upstream的连接请求
指令(fault_inject):
fault_inject 指令
参数介绍:
参数 | 类型 | 必填 | 说明 |
---|---|---|---|
type={type} | string | 是 | delay、abort、delay_abort 三者之一delay:延迟注入类型abort:中止注入类型delay_abort: 延迟+中止注入类型 |
delay_duration={duration} | string | 是(delay和delay_abort 必填) | delay时间, 1h/1m/1s/1ms, 必须>=1ms |
status_code={code} | uint | 是 (abort和delay_abort 必填) | code设置注入返回码, [200, 600] |
delay_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入delay请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10% |
abort_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入abort请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10% |
Abort注入类型
功能: 按照设置的百分比概率,对命中的请求直接返回指定的status_code
参数介绍:
参数 | 类型 | 必填 | 说明 |
---|---|---|---|
status_code={code} | uint | 是 | code设置注入返回码, [200, 600] |
abort_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入abort请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10% |
Delay注入类型
功能: 按照设置的百分比概率,对命中的请求按照设定的duration时间延迟发起对上游upstream的请求
参数介绍:
参数介绍:
参数 | 类型 | 必填 | 说明 |
---|---|---|---|
delay_duration={duration} | string | 是 | delay时间, 1h/1m/1s/1ms, 必须>=1ms |
delay_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入delay请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10% |
1.3.3 Delay+Abort注入类型
• 功能: 按照设置的百分比概率,对命中的请求按照设定的duration时间延迟发起对上游upstream的请求
• 参数介绍:
参数 | 类型 | 必填 | 说明 |
---|---|---|---|
delay_duration={duration} | string | 是 | delay时间, 1h/1m/1s/1ms, 必须>=1ms |
status_code={code} | uint | 是 | code设置注入返回码, [200, 600] |
delay_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入delay请求的百分比,默认是100, 范围: [1, 100], eg: 1表示1%, 100表示100% |
abort_percentage={pct} | uint | 可选,默认100 (100%) | 设置注入abort请求的百分比,默认是100, 范围: [1, 100], eg: 1表示1%, 100表示100% |
声明式API动态故障注入
接口url
GET http://{ip:port}/config/2/config/http_dyn_fault_inject
PUT http://{ip:port}/config/2/config/http_dyn_fault_inject
json格式如下:
{
"servers": [
{
"listens": [
"0.0.0.0:92"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
#此处type可为下面四种
#abort 这个表示中止注入
#delay 这个表示延迟注入
#delay_abort 这个表示延迟+中止注入
#none 这个表示取消故障注入或者不设置故障注入
"fault_inject_type": "delay_abort", #如果该值为none,则其他参数取值无意义
"delay_percentage": 100, #延迟故障注入百分比,如果是abort类型,该值无意义
"abort_percentage": 100, #中止故障注入百分比,如果是delay类型,该值无意义
"status_code": 405, #故障注入状态码,如果是delay类型,该值无意义
"delay_duration": "5s", #故障注入延迟时间,如果是abort类型,该值无意义
"locations": [
{
"location": "/demo",
"fault_inject_type": "none", #子location /demo 故障注入类型
}
]
}
]
}
]
}
OpenNJet配置文件
load_module modules/njt_http_dyn_fault_inject_module.so; **#使用动态故障注入功能需要load该模块**
模块编译
• 故障注入模块是静态编译进去,通过NJT_HTTP_FAULT_INJECT 宏控制开启
• 声明式api动态故障注入模块采用动态编译,需要通过loadmodule指令加载使用
功能测试
HTTP1.1 协议测试
abort故障注入测试
配置:
server {
listen 80 ;
server_name localhost;
location / {
fault_inject type=abort status_code=405 abort_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http1.1 访问,立马返回405,如下:
OpenNJet日志:
2023/07/17 17:04:51 [debug] 13179#0: *35 fault injet abort 405, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"
delay故障注入测试
配置:
server {
listen 80 ;
server_name localhost;
location / {
fault_inject type=delay delay_duration=10s delay_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http1.1 访问,等待10s后返回访问成功页面,如下:
OpenNJet日志:
#等待10s 后正常处理并成功返回页面
2023/07/17 17:02:04 [debug] 13015#0: *28 fault inject start deleay, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 17:02:14 [debug] 13015#0: *28 fault njet delay success, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 17:02:14 [debug] 13015#0: *28 fault inject delay timer clean while closing request, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:8008/", host: "192.168.40.136"
delay_abort故障注入测试
配置:
server {
listen 80 ;
server_name localhost;
location / {
fault_inject type=delay_abort delay_duration=10s status_code=405 delay_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http1.1 访问,等待10s后直接返回405,如下:
OpenNJet日志:
# 先delay等待10s
2023/07/17 16:57:48 [debug] 11894#0: *26 fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 16:57:58 [debug] 11894#0: *26 fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
#直接返回 405 abort
2023/07/17 16:57:58 [debug] 11894#0: *26 fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 16:57:58 [debug] 11894#0: *26 fault inject delay timer clean while closing request, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
HTTP2 协议测试
abort故障注入测试
配置:
server {
listen 81 ssl http2;
server_name 192.168.40.136;
ssl_certificate /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
ssl_certificate_key /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;
location / {
fault_inject type=abort status_code=405 abort_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http2访问,立马返回405,如下:
OpenNJet日志:
#直接返回405
2023/07/17 16:43:54 [debug] 11894#0: *8 fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
delay故障注入测试
配置:
server {
listen 81 ssl http2;
server_name 192.168.40.136;
ssl_certificate /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
ssl_certificate_key /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;
location / {
fault_inject type=delay delay_duration=10s delay_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http2访问,等待10s后成功访问页面,如下:
OpenNJet日志:
#delay 10s后成功访问到页面
2023/07/17 16:42:03 [debug] 11783#0: *4 fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:42:13 [debug] 11783#0: *4 fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:42:13 [debug] 11783#0: *4 fault inject delay timer clean while sending to client, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", upstream: "http://127.0.0.1:8008/", host: "192.168.40.136:81"
delay_abort故障注入测试
配置:
server {
listen 81 ssl http2;
server_name 192.168.40.136;
ssl_certificate /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
ssl_certificate_key /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;
location / {
fault_inject type=delay_abort delay_duration=10s status_code=405 delay_percentage=100;
proxy_next_upstream_tries 0; #关闭重试
proxy_next_upstream_timeout 0; #关闭超时
proxy_pass http://back/;
}
}
测试效果:
浏览器http2 访问,等待10s后返回405,如下:
OpenNJet日志:
#先delay 10s
2023/07/17 16:39:10 [debug] 11598#0: *2 fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:39:20 [debug] 11598#0: *2 fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
#再直接返回405 abort
2023/07/17 16:39:20 [debug] 11598#0: *2 fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:39:20 [debug] 11598#0: *2 fault inject delay timer clean, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
声明式API动态故障注入测试
njet静态配置
...
load_module modules/njt_http_dyn_fault_inject_module.so; #load 动态模块so
...
http{
...
upstream back{
server 127.0.0.1:8008;
}
server{
listen 8008;
location / {
index 8008.html;
add_header Set-Cookie route=8008;
}
}
#8009 为普通server, 返回8008.html文件
server{
listen 8009;
location / {
proxy_pass http://back/;
}
}
...
}
执行get请求
curl http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject | jq
{
"servers": [
{
"listens": [
"0.0.0.0:8009"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "none", #8009 静态默认为none,即没有设置故障注入,
#相关参数没有意义
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 200,
"delay_duration": ""
}
]
},
{
"listens": [
"0.0.0.0:8008"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "none",
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 200,
"delay_duration": ""
}
]
}
]
}
浏览器访问
没有任何延迟,立马返回结果页面
执行put请求,修改
curl -X PUT http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject -d ‘@json.txt’
json.txt如下:
{
"servers": [
{
"listens": [
"0.0.0.0:8009"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "delay_abort", #修改为delay_abort, delay 5s, 然后返回405
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 405,
"delay_duration": "5s"
}
]
},
{
"listens": [
"0.0.0.0:8008"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "none",
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 200,
"delay_duration": ""
}
]
}
]
}
再次get
curl http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject | jq
{
"servers": [
{
"listens": [
"0.0.0.0:8009"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "delay_abort", #与put 修改一致,说明修改成功
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 405,
"delay_duration": "5s"
}
]
},
{
"listens": [
"0.0.0.0:8008"
],
"serverNames": [
""
],
"locations": [
{
"location": "/",
"fault_inject_type": "none",
"delay_percentage": 100,
"abort_percentage": 100,
"status_code": 200,
"delay_duration": ""
}
]
}
]
}
通过浏览器实际访问
实际效果,等待5s,返回405, 与预期一致
后台日志
#日志也显示delay 5s后,返回abort 405
2023/07/20 15:55:17 [debug] 30305#0: *12 fault inject start deleay,
client: 192.168.40.205, server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12 fault njet delay success, client: 192.168.40.205,
server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12 fault injet abort 405, client: 192.168.40.205, server: ,
request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12 fault inject delay timer clean while closing request,
client: 192.168.40.205, server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"