如何利用OpenNJet动态配置故障注入

🙈 By 陈潞波 2023-09-18

中断导致的服务中断可能会造成严重的业务后果,因此构建、运行和测试弹性系统非常重要。系统的弹性源自其各部分的弹性:系统的每个部分都能够处理一定数量的错误或故障。无论是后续的服务不可用、网络延迟还是数据可用性问题,分布式系统都充满了相应的错误处理的隐含非功能性需求。

什么故障注入?

故障注入是软件测试的补充技术,用于提高软件性能和弹性。通常的测试方法验证软件的正确行为,而故障注入则通过故意注入的故障来评估软件的正确行为。

HTTP故障注入,支持中止(abort)来自下游服务的Http请求,和/或延迟(delay)代理请求,一个故障规则必须具有延迟或中止或两者兼有。在将HTTP请求转发到路由中指定的目的地时,可以注入一个或多个故障。故障注入策略应用于客户端HTTP流量。

延迟规范用于在请求转发路径中注入延迟。

中止规范用于提前中止一个请求,并返回一个预先指定的错误码。

类型

故障注入主要有三种类型,同一个location下,只允许配置一种:

  • Delay: 只注入延迟故障,延迟时间到后,再发起对upstream的连接请求
  • Abort:只注入中止故障,收到客户端请求,直接将注入的状态码返回给客户端,结束请求
  • Delay+Abort:同时注入延迟故障和abort故障,先进行延迟,延迟时间到后,直接返回abort设置的状态码给客户端,结束请求,不再发起对upstream的连接请求

指令(fault_inject):

fault_inject 指令

img

参数介绍:

参数 类型 必填 说明
type={type} string delay、abort、delay_abort 三者之一delay:延迟注入类型abort:中止注入类型delay_abort: 延迟+中止注入类型
delay_duration={duration} string 是(delay和delay_abort 必填) delay时间, 1h/1m/1s/1ms, 必须>=1ms
status_code={code} uint 是 (abort和delay_abort 必填) code设置注入返回码, [200, 600]
delay_percentage={pct} uint 可选,默认100 (100%) 设置注入delay请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10%
abort_percentage={pct} uint 可选,默认100 (100%) 设置注入abort请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10%

Abort注入类型

功能: 按照设置的百分比概率,对命中的请求直接返回指定的status_code

参数介绍:

参数 类型 必填 说明
status_code={code} uint code设置注入返回码, [200, 600]
abort_percentage={pct} uint 可选,默认100 (100%) 设置注入abort请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10%

Delay注入类型

功能: 按照设置的百分比概率,对命中的请求按照设定的duration时间延迟发起对上游upstream的请求

参数介绍:

img

参数介绍:

参数 类型 必填 说明
delay_duration={duration} string delay时间, 1h/1m/1s/1ms, 必须>=1ms
delay_percentage={pct} uint 可选,默认100 (100%) 设置注入delay请求的百分比,默认是100%, 范围: [1, 100], eg: 1表示1%, 10表示10%

1.3.3 Delay+Abort注入类型

• 功能: 按照设置的百分比概率,对命中的请求按照设定的duration时间延迟发起对上游upstream的请求

img

• 参数介绍:

参数 类型 必填 说明
delay_duration={duration} string delay时间, 1h/1m/1s/1ms, 必须>=1ms
status_code={code} uint code设置注入返回码, [200, 600]
delay_percentage={pct} uint 可选,默认100 (100%) 设置注入delay请求的百分比,默认是100, 范围: [1, 100], eg: 1表示1%, 100表示100%
abort_percentage={pct} uint 可选,默认100 (100%) 设置注入abort请求的百分比,默认是100, 范围: [1, 100], eg: 1表示1%, 100表示100%

声明式API动态故障注入

接口url

 GET  http://{ip:port}/config/2/config/http_dyn_fault_inject

 PUT  http://{ip:port}/config/2/config/http_dyn_fault_inject

json格式如下:

{
    "servers": [
        {
            "listens": [
                "0.0.0.0:92"
            ],
            "serverNames": [
                ""
            ],
            "locations": [
                {
                    "location": "/",
                    #此处type可为下面四种
                    #abort   这个表示中止注入
                    #delay   这个表示延迟注入
                    #delay_abort  这个表示延迟+中止注入
                    #none    这个表示取消故障注入或者不设置故障注入
                    
                    "fault_inject_type": "delay_abort", #如果该值为none,则其他参数取值无意义
                    
                    "delay_percentage": 100,      #延迟故障注入百分比,如果是abort类型,该值无意义
                    "abort_percentage": 100,      #中止故障注入百分比,如果是delay类型,该值无意义
                    "status_code": 405,         #故障注入状态码,如果是delay类型,该值无意义
                    "delay_duration": "5s",     #故障注入延迟时间,如果是abort类型,该值无意义
                    "locations": [
                        {
                            "location": "/demo",
                            "fault_inject_type": "none", #子location /demo 故障注入类型
                        }
                    ]
                }
            ]
        }
    ]
}

OpenNJet配置文件

load_module modules/njt_http_dyn_fault_inject_module.so; **#使用动态故障注入功能需要load该模块**

模块编译

• 故障注入模块是静态编译进去,通过NJT_HTTP_FAULT_INJECT 宏控制开启

• 声明式api动态故障注入模块采用动态编译,需要通过loadmodule指令加载使用

功能测试

HTTP1.1 协议测试

abort故障注入测试

配置:

server {
        listen       80 ;
        server_name  localhost;

        location / {
          fault_inject type=abort  status_code=405 abort_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http1.1 访问,立马返回405,如下:

img

OpenNJet日志:

2023/07/17 17:04:51 [debug] 13179#0: *35  fault injet abort 405, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"

delay故障注入测试

配置:

server {
        listen       80 ;
        server_name  localhost;

        location / {
          fault_inject type=delay  delay_duration=10s delay_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http1.1 访问,等待10s后返回访问成功页面,如下:

img

OpenNJet日志:

#等待10s 后正常处理并成功返回页面
2023/07/17 17:02:04 [debug] 13015#0: *28  fault inject start deleay, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 17:02:14 [debug] 13015#0: *28  fault njet delay success, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 17:02:14 [debug] 13015#0: *28  fault inject delay timer clean while closing request, client: 192.168.40.205, server: localhost, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:8008/", host: "192.168.40.136"

delay_abort故障注入测试

配置:

server {
        listen       80 ;
        server_name  localhost;

        location / {
          fault_inject type=delay_abort  delay_duration=10s status_code=405 delay_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http1.1 访问,等待10s后直接返回405,如下:

img

OpenNJet日志:

# 先delay等待10s
2023/07/17 16:57:48 [debug] 11894#0: *26  fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 16:57:58 [debug] 11894#0: *26  fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"

#直接返回 405 abort
2023/07/17 16:57:58 [debug] 11894#0: *26  fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"
2023/07/17 16:57:58 [debug] 11894#0: *26  fault inject delay timer clean while closing request, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/1.1", host: "192.168.40.136"

HTTP2 协议测试

abort故障注入测试

配置:

server {
        listen       81 ssl http2;
        server_name  192.168.40.136;
        ssl_certificate                 /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
        ssl_certificate_key             /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;

        location / {
          fault_inject type=abort  status_code=405 abort_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http2访问,立马返回405,如下:

img

OpenNJet日志:

#直接返回405
2023/07/17 16:43:54 [debug] 11894#0: *8  fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"

delay故障注入测试

配置:

server {
        listen       81 ssl http2;
        server_name  192.168.40.136;
        ssl_certificate                 /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
        ssl_certificate_key             /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;

        location / {
          fault_inject type=delay  delay_duration=10s delay_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http2访问,等待10s后成功访问页面,如下:

img

OpenNJet日志:

#delay 10s后成功访问到页面
2023/07/17 16:42:03 [debug] 11783#0: *4  fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:42:13 [debug] 11783#0: *4  fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:42:13 [debug] 11783#0: *4  fault inject delay timer clean while sending to client, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", upstream: "http://127.0.0.1:8008/", host: "192.168.40.136:81"

delay_abort故障注入测试

配置:

server {
        listen       81 ssl http2;
        server_name  192.168.40.136;
        ssl_certificate                 /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3.pem;
        ssl_certificate_key             /root/bug/njet1.0/mycert/www.tmlake.xn--com-7m0aa+3-key.pem;

        location / {
          fault_inject type=delay_abort  delay_duration=10s status_code=405 delay_percentage=100;
          proxy_next_upstream_tries 0;      #关闭重试
          proxy_next_upstream_timeout 0;    #关闭超时
          proxy_pass http://back/;
        }
     }

测试效果:

浏览器http2 访问,等待10s后返回405,如下:

img

OpenNJet日志:

#先delay 10s
2023/07/17 16:39:10 [debug] 11598#0: *2  fault inject start deleay, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:39:20 [debug] 11598#0: *2  fault njet delay success, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"

#再直接返回405 abort
2023/07/17 16:39:20 [debug] 11598#0: *2  fault injet abort 405, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"
2023/07/17 16:39:20 [debug] 11598#0: *2  fault inject delay timer clean, client: 192.168.40.205, server: 192.168.40.136, request: "GET / HTTP/2.0", host: "192.168.40.136:81"

声明式API动态故障注入测试

njet静态配置

...
load_module modules/njt_http_dyn_fault_inject_module.so;     #load 动态模块so
...

http{
...

    upstream back{
       server 127.0.0.1:8008;
    }

     server{
                listen 8008;
                location / {
                        index 8008.html;
                        add_header Set-Cookie route=8008;
                }
     }

     #8009 为普通server, 返回8008.html文件
     server{
                listen 8009;
                location / {
                        proxy_pass http://back/;
                }
     }
...
}

执行get请求

curl http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject | jq

{
  "servers": [
    {
      "listens": [
        "0.0.0.0:8009"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "none",         #8009 静态默认为none,即没有设置故障注入,
                                               #相关参数没有意义
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 200,
          "delay_duration": ""
        }
      ]
    },
    {
      "listens": [
        "0.0.0.0:8008"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "none",
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 200,
          "delay_duration": ""
        }
      ]
    }
  ]
}

浏览器访问

没有任何延迟,立马返回结果页面

img

执行put请求,修改

curl -X PUT http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject -d ‘@json.txt’

json.txt如下:

{
  "servers": [
    {
      "listens": [
        "0.0.0.0:8009"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "delay_abort", #修改为delay_abort, delay 5s, 然后返回405
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 405,
          "delay_duration": "5s"
        }
      ]
    },
    {
      "listens": [
        "0.0.0.0:8008"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "none",
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 200,
          "delay_duration": ""
        }
      ]
    }
  ]
}

再次get

curl http://192.168.40.136:8081/config/2/config/http_dyn_fault_inject | jq

{
  "servers": [
    {
      "listens": [
        "0.0.0.0:8009"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "delay_abort",     #与put 修改一致,说明修改成功
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 405,
          "delay_duration": "5s"
        }
      ]
    },
    {
      "listens": [
        "0.0.0.0:8008"
      ],
      "serverNames": [
        ""
      ],
      "locations": [
        {
          "location": "/",
          "fault_inject_type": "none",
          "delay_percentage": 100,
          "abort_percentage": 100,
          "status_code": 200,
          "delay_duration": ""
        }
      ]
    }
  ]
}

通过浏览器实际访问

实际效果,等待5s,返回405, 与预期一致

img

后台日志

#日志也显示delay 5s后,返回abort 405
2023/07/20 15:55:17 [debug] 30305#0: *12  fault inject start deleay,
client: 192.168.40.205, server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12  fault njet delay success, client: 192.168.40.205,
 server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12  fault injet abort 405, client: 192.168.40.205, server: , 
request: "GET / HTTP/1.1", host: "192.168.40.136:8009"
2023/07/20 15:55:22 [debug] 30305#0: *12  fault inject delay timer clean while closing request, 
client: 192.168.40.205, server: , request: "GET / HTTP/1.1", host: "192.168.40.136:8009"