基于cpu使用率动态调整worker数量(一):物理机

By 陈潞波 2023-10-11

相信很多运维部门都会在即将到来的“双十一”面临流量波动的情况,此外,运营商的场景中,客服会在一天内不同的时间段都会有波动。那么,如何在业务高峰期和低谷期之间有效管理资源以确保业务的稳定性和性能呢?依赖人工监控CPU使用率、配置优化恐怕实现有难度。

NGINX向云原生演进的开源项目OpenNJet实现了一种方法,能够基于CPU的使用率动态调整worker进程数量,从而在业务繁忙时让业务快速响应,而在请求量小的时候还能释放多余的机器资源出来。

image-20231012163242054

在OpenNJet控制面通过动态加载该模块so,通过配置特定的指令(sysguard_cpu),实现在cpu使用率触发配置阈值的情况下能够动态调整worker数量。在cpu使用率低的时候减少worker数量,在cpu使用率高的时候增加worker数量。

平均cpu使用率:OpenNJet所有worker进程的平均cpu使用率, 数据采集自/proc/{pid}/stat,pid为所有worker进程对应的进程id

系统cpu使用率:系统所有cpu核的综合使用率,数据采集自/proc/stat

指令设计

指令:

Syntax sysguard_cpu interval={interval} low_*threshold=*{low_*threshold*} high_*threshold=*{high_*threshold*} worker_step={worker_step} min_worker={min_worker} max_worker={max_worker} sys_high_threshold={sys_high_threshold}
Default -
Context main

参数说明:

参数 类型 默认值 最小值 最大值 描述
interval int 1 (单位为min分钟) 1 - 检测时间间隔,每隔interval去检测一次
low_*threshold* int 10(表示10%) 10 - cpu使用率下限,低于该阈值时,自动减少worker_step个worker, 数量受min_worker限制
high_*threshold* int 70(表示10%) 10 - cpu使用率上限,高于该阈值时,自动增加worker_step个worker, 数量受max_worker限制
worker_step int 1 1 - 每次增加或者减少得worker数量,worker数量受min_worker和max_worker数量限制
min_worker int 1 1 - 最少worker数量限制
max_worker int ncpu,系统cpu核数 1 - 最多worker数量限制
sys_high_threshold int 80(表示80%) 10 - 系统整体cpu使用率,超过该值时,不增加worker数量

其他约定:

• low_*threshold <= high_threshold*

• min_worker <= max_worker

• min_worker数量、max_worker数量与静态配置worker数量的关系?

​ 未触发负载调整worker的时候,实际运行worker数量为初始静态配置数量

• 合理配置情况下,会符合正常逻辑,也就是cpu使用率高了,会增加worker数量,cpu使用率低了,会减少worker数量

• 不合理配置的情况下(比如静态初始配置小于min_worker或者大于max_worker, 也可能通过api修改动态worker数),可能会出现不在min_worker\max_worker的范围内的情况,

比如当前worker数量已经小于min_worker的时候触发减少worker调整的情况,worker数保持不变;

比如当前worker数量已经大于max_worker的时候触发增加worker调整的情况,worker数也保持不变;

一起测试下

ctrl配置

load_module modules/njt_http_sendmsg_module.so;       #依赖该module
load_module modules/njt_ctrl_config_api_module.so;
load_module modules/njt_helper_health_check_module.so;
load_module modules/njt_http_upstream_api_module.so;
load_module modules/njt_http_location_api_module.so;
load_module modules/njt_http_ssl_api_module.so;
load_module modules/njt_doc_module.so;
load_module modules/njt_http_vtsd_module.so;
load_module modules/njt_sysguard_cpu_module.so;       #sysguard_cpu module


user nobody;
sysguard_cpu interval=1 low_threshold=11  high_threshold=20 worker_step=2 min_worker=3 max_worker=5  sys_high_threshold=120;

events {
    worker_connections  1024;
}
error_log         logs/error_ctrl.log info;

http {
    dyn_sendmsg_conf  conf/iot-ctrl.conf;
    access_log        logs/access_ctrl.log combined;

    include           mime.types;

    server {
        listen       8081;
        location /hc {
            health_check_api;
        }

        location /api {
             api write=on;
        }

        location /kv {
            dyn_sendmsg_kv;
        }

        location /config {
            config_api;
        }

        location /ssl {
            dyn_ssl_api;
        }

        location /doc {
            doc_api;
        }
          location /dyn_loc {
           dyn_location_api;
        }

        location /metrics {
            vhost_traffic_status_display;
            vhost_traffic_status_display_format html;
        }
  }

}


cluster_name helper;
node_name node1;

wrk压测,提高cpu使用率,观察worker数量变化情况

./wrk -t 1 -c 500 -d 30s -L http://localhost:9002/

日志:

# 此处日志是总cpu使用率100%
2023/09/14 16:45:21 [info] 14016#0:  total cpu usage:100  usr:18459855  nice:2938  sys:5048704 idle:416296648 work:23511497  prev_work:23510134 total:439808145  pre_total:439806782 work-:1363 total-:1363

#此处日志表示一共三个(14095 14096 14004)worker进程
2023/09/14 16:45:21 [info] 14016#0: get all pids:14095_14096_14004_
# 进程的cpu使用率
2023/09/14 16:45:21 [info] 14016#0:  get process:14095 cpu_usage:31 utime:364 stime:276 cutime:0 cstime:0 work:640 pre_work:211 diff_work:429 diff_total:1363
2023/09/14 16:45:21 [info] 14016#0:  get process:14096 cpu_usage:32 utime:379 stime:277 cutime:0 cstime:0 work:656 pre_work:218 diff_work:438 diff_total:1363
2023/09/14 16:45:21 [info] 14016#0:  get process:14004 cpu_usage:32 utime:1167 stime:753 cutime:0 cstime:0 work:1920 pre_work:1481 diff_work:439 diff_total:1363
2023/09/14 16:45:21 [info] 14016#0:  old pids:14095_14096_14004_ new pids:14095_14096_14004_

#平均使用率
2023/09/14 16:45:21 [info] 14016#0:  average cpu usage:31
#worker触发规则,调整worker数量, 从3个到5个
2023/09/14 16:45:21 [info] 14016#0:  adjust worker num from 3 to 5

#此处是变为5个worker后的日志
2023/09/14 16:45:36 [info] 14016#0:  total cpu usage:54  usr:18460323  nice:2938  sys:5049015 idle:416297294 work:23512276  prev_work:23511497 total:439809570  pre_total:439808145 work-:779 total-:1425
2023/09/14 16:45:36 [info] 14016#0: get all pids:14095_14096_14215_14004_14216_
2023/09/14 16:45:36 [info] 14016#0:  get process:14095 cpu_usage:11 utime:464 stime:343 cutime:0 cstime:0 work:807 pre_work:640 diff_work:167 diff_total:1425
2023/09/14 16:45:36 [info] 14016#0:  get process:14096 cpu_usage:12 utime:485 stime:344 cutime:0 cstime:0 work:829 pre_work:656 diff_work:173 diff_total:1425
2023/09/14 16:45:36 [info] 14016#0:  get process:14215 cpu_usage:7 utime:56 stime:55 cutime:0 cstime:0 work:111 pre_work:0 diff_work:111 diff_total:1425
2023/09/14 16:45:36 [info] 14016#0:  get process:14004 cpu_usage:12 utime:1269 stime:822 cutime:0 cstime:0 work:2091 pre_work:1920 diff_work:171 diff_total:1425
2023/09/14 16:45:36 [info] 14016#0:  get process:14216 cpu_usage:8 utime:60 stime:54 cutime:0 cstime:0 work:114 pre_work:0 diff_work:114 diff_total:1425
2023/09/14 16:45:36 [info] 14016#0:  old pids:14095_14096_14004_ new pids:14095_14096_14215_14004_14216_


#此处随着cpu使用率下降,又重新调整worker个数从5个到3个
2023/09/14 16:45:36 [info] 14016#0:  average cpu usage:10
2023/09/14 16:45:36 [info] 14016#0:  adjust worker num from 5 to 3
2023/09/14 16:45:51 [info] 14016#0:  total cpu usage:1  usr:18460332  nice:2938  sys:5049031 idle:416298764 work:23512301  prev_work:23512276 total:439811065  pre_total:439809570 work-:25 total-:1495
2023/09/14 16:45:51 [info] 14016#0: get all pids:14095_14096_14215_
2023/09/14 16:45:51 [info] 14016#0:  get process:14095 cpu_usage:0 utime:465 stime:344 cutime:0 cstime:0 work:809 pre_work:807 diff_work:2 diff_total:1495
2023/09/14 16:45:51 [info] 14016#0:  get process:14096 cpu_usage:0 utime:485 stime:345 cutime:0 cstime:0 work:830 pre_work:829 diff_work:1 diff_total:1495
2023/09/14 16:45:51 [info] 14016#0:  get process:14215 cpu_usage:0 utime:57 stime:56 cutime:0 cstime:0 work:113 pre_work:111 diff_work:2 diff_total:1495
2023/09/14 16:45:51 [info] 14016#0:  old pids:14095_14096_14215_14004_14216_ new pids:14095_14096_14215_
2023/09/14 16:45:51 [info] 14016#0:  old pid:14004 need remove
2023/09/14 16:45:51 [info] 14016#0:  old pid:14004 remove success
2023/09/14 16:45:51 [info] 14016#0:  old pid:14216 need remove
2023/09/14 16:45:51 [info] 14016#0:  old pid:14216 remove success
2023/09/14 16:45:51 [info] 14016#0:  average cpu usage:0

结果

在以下测试场景中测试通过:

  1. 正常情况
  2. reload
  3. 某个worker进程挂掉重启(比如kill掉)
  4. 通过动态API修改worker数量后