Skip to content

kt-numa-nodes 配置单numa运行,cpu,gpu均按照numa切分,ttft耗时极高 #1938

@poryfly

Description

@poryfly

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Image 如上为numa绑定资源信息

Reproduction

`#!/bin/bash
source /data/miniconda3/etc/profile.d/conda.sh
conda activate kt_infer_new
#conda activate kt_infer_latest
export SGLANG_SET_CPU_AFFINITY=0
#######################################################################

配置区域 - 根据您的实际环境修改

#######################################################################

模型路径

MODEL_PATH="/data/.cache/models/Qwen/Qwen3.5-122B-A10B-FP8"

GPU分配 (根据检测脚本的结果修改)

假设: GPU 0-3 在 NUMA节点0, GPU 4-7 在 NUMA节点1

GPU_GROUP_0="0,1,2,3" # NUMA节点0的GPU
GPU_GROUP_1="4,5,6,7" # NUMA节点1的GPU

NUMA节点配置

NUMA_NODE_0=0
NUMA_NODE_1=1

CPU核心范围 (根据检测脚本的结果修改)

示例: 假设每个NUMA节点有32个核心

CPU_CORES_0="0-47,96-143" # NUMA节点0的CPU核心
CPU_CORES_1="48-95,144-191" # NUMA节点1的CPU核心

端口配置

PORT_0=8173
PORT_1=9173

Tensor Parallel大小 (每个实例使用的GPU数量)

TP_SIZE=4

其他SGLang参数

HOST="0.0.0.0"
extra_args=(
"--fp8-gemm-backend" "cutlass"
"--kt-expert-placement-strategy" "uniform"
"--max-total-tokens" "8192"
"--disable-shared-experts-fusion"
"--speculative-algo" "NEXTN"
"--speculative-num-steps" "3"
"--speculative-eagle-topk" "1"
"--speculative-num-draft-tokens" "4"
"--disable-custom-all-reduce"
"--kt-cpuinfer" "48"
"--kt-threadpool-count" "1"
"--kt-num-gpu-experts" "2"
"--kt-method" "FP8"
"--served-model-name" "qwen-3.5"
"--kt-gpu-prefill-token-threshold" "1024"
"--attention-backend" "triton"
"--trust-remote-code"
"--mem-fraction-static" "0.8"
"--chunked-prefill-size" "4096"
"--max-running-requests" "32"
"--enable-mixed-chunk"
"--enable-p2p-check"
)

#######################################################################

启动函数

#######################################################################
start_instance_0() {
echo "启动实例0: GPU $GPU_GROUP_0, NUMA节点 $NUMA_NODE_0, 端口 $PORT_0"

CUDA_VISIBLE_DEVICES=$GPU_GROUP_0 \
numactl --physcpubind=$CPU_CORES_0 --membind=$NUMA_NODE_0 \
python -m sglang.launch_server \
    --model $MODEL_PATH \
    --kt-weight-path $MODEL_PATH \
    --kt-numa-nodes $NUMA_NODE_0 \
    --tp $TP_SIZE \
    --host $HOST \
    --port $PORT_0 \
"${extra_args[@]}" > logs/instance_0.log 2>&1 &

PID_0=$!
echo "实例0 PID: $PID_0"
echo $PID_0 > logs/instance_0.pid

}

start_instance_1() {
echo "启动实例1: GPU $GPU_GROUP_1, NUMA节点 $NUMA_NODE_1, 端口 $PORT_1"

CUDA_VISIBLE_DEVICES=$GPU_GROUP_1 \
numactl --physcpubind=$CPU_CORES_1 --membind=$NUMA_NODE_1 \
python -m sglang.launch_server \
    --model $MODEL_PATH \
    --kt-weight-path $MODEL_PATH \
    --kt-numa-nodes $NUMA_NODE_1 \
    --tp $TP_SIZE \
    --host $HOST \
    --port $PORT_1 \
"${extra_args[@]}" > logs/instance_1.log 2>&1 &

PID_1=$!
echo "实例1 PID: $PID_1"
echo $PID_1 > logs/instance_1.pid

}

#######################################################################

主程序

#######################################################################

创建日志目录

mkdir -p logs

echo "=========================================="
echo "启动双SGLang服务"
echo "=========================================="

启动实例0

start_instance_0
sleep 10 # 等待第一个实例初始化

启动实例1

start_instance_1
echo -e "\n=========================================="
echo "服务启动完成"
echo "=========================================="
echo "实例0: http://$HOST:$PORT_0 (NUMA节点 $NUMA_NODE_0)"
echo "实例1: http://$HOST:$PORT_1 (NUMA节点 $NUMA_NODE_1)"
echo ""
echo "查看日志:"
echo " tail -f logs/instance_0.log"
echo " tail -f logs/instance_1.log"
echo ""
echo "查看进程状态:"
echo " ps aux | grep sglang"
echo ""
echo "验证NUMA绑定:"
echo " numastat -p $(cat logs/instance_0.pid)"
echo " numastat -p $(cat logs/instance_1.pid)"

```text`
上述为分numa启动两个服务,启动压测后,ttft耗时特别高,主要是prepare耗时变高,

Image 再没有按numa切分资源前,相同数据压测,prepare耗时只有50ms左右

将近增长了10倍。按道理cpu,gpu资源都按numa切分后没有争抢,这个耗时不应该涨。



### Others

_No response_

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions