fix health check bug by guilongyang · Pull Request #13136 · apache/apisix

guilongyang · 2026-03-31T11:00:44Z

[##] Description

hi, @Baoyuantop . Based on the 3.15.0-ubuntu Docker image, I have verified the changes in this PR and confirmed that the issue has been resolved.

Which issue(s) this PR fixes:

Fixes # #13101

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Baoyuantop · 2026-03-31T14:02:37Z

The current PR does not fix the root cause as described in the original issue. Please modify the technical solution.

Baoyuantop · 2026-04-07T07:50:22Z

Please revise your technical solution and reopen the PR. Thank you for your contribution.

guilongyang · 2026-04-07T11:47:28Z

sorry for late.

Technical Solution: Fix ai-proxy-multi Health Check Mechanism

Root Cause

The ai-proxy-multi plugin's health check mechanism has a structural bug:

construct_upstream in timer context (reading from etcd) cannot access the _dns_value field
This field is only set by resolve_endpoint in request context
When health checker creation fails, the resource is permanently removed from waiting_pool, causing it to never be retried
Original code: fetch_checker returns nil, resource is removed from waiting_pool but never retried

Solution

I. Changes to `healthcheck_manager.lua`

1. New SHM Proxy Checker (Lines 127-174)

-- Create a proxy checker object that reads health status directly from SHM.
-- This is used when the checker is not available in the current worker's local working_pool.
local function create_shm_proxy_checker(resource_path)
    -- Returns a proxy object that reads health status directly from shared memory
    -- Uses internal key format of lua-resty-healthcheck
end

function _M.get_target_status_by_path(resource_path, ip, port, hostname)
    -- Query health status based on resource_path
end

Purpose:

Returns a proxy checker instead of nil when the checker is not yet created
Proxy reads health status directly from SHM, avoiding caller blocking while waiting

2. Modified `fetch_checker` Function (Lines 103-113)

-- Check if this is a known waiting resource with matching version
if waiting_pool[resource_path] == resource_ver then
    -- A checker is being created, but we need immediate response
    -- Return a proxy that reads from SHM while waiting
    return create_shm_proxy_checker(resource_path)
end

-- First time requesting this resource, add to waiting pool
-- Don't return anything useful yet - the real checker is being created
waiting_pool[resource_path] = resource_ver
return nil

Purpose:

When resource is in waiting_pool, subsequent requests can immediately read status from SHM
Prevents callers from blocking due to checker not being created yet

3. Added Error Logging (Lines 293-298)

upstream = plugin.construct_upstream(upstream_constructor_config)
if not upstream then
    core.log.error("[healthcheck] default upstream is nil | resource_path: ", resource_path, 
                   "upstream_constructor_config: ", core.json.encode(upstream_constructor_config))
end

II. Changes to `ai-proxy-multi.lua`

1. Modified `fetch_health_instances` Function (Lines 236-282)

local function fetch_health_instances(conf, checkers)
    local instances = conf.instances
    local new_instances = core.table.new(0, #instances)

    for i, ins in ipairs(instances) do
        local checker = checkers and checkers[ins.name]
        if checker then
            -- Checker exists: get health status via fetch_node_status
            local node = ins._dns_value
            local ok, err = healthcheck_manager.fetch_node_status(checker, node.host, port or node.port, host)
            if ok then
                transform_instances(new_instances, ins)
            end
        elseif checker == nil then
            -- Checker is being created: read health status from SHM
            local node = ins._dns_value
            local idx = ins._index or (i - 1)
            local resource_path = conf._meta.parent.resource_key ..
                                  "#plugins['ai-proxy-multi'].instances[" .. idx .. "]"
            local ok = healthcheck_manager.get_target_status_by_path(resource_path, node.host, port or node.port, host)
            if ok then
                transform_instances(new_instances, ins)
            end
        else
            -- No checker configured: include instance by default
            transform_instances(new_instances, ins)
        end
    end
    -- ...
end

Three states handled:

State	Handling Method
Checker created	`fetch_node_status` to get status
Checker being created	`get_target_status_by_path` from SHM
No checker configured	Include by default

2. Modified `get_checkers_status_ver` Function (Lines 223-232)

local function get_checkers_status_ver(checkers)
    local status_ver_total = 0
    for _, checker in pairs(checkers) do
        -- Only count status_ver from real checkers (proxy checkers don't have this attribute)
        if checker.status_ver then
            status_ver_total = status_ver_total + checker.status_ver
        end
    end
    return status_ver_total
end

Reason: Proxy checker doesn't have status_ver attribute, needs type check

3. Modified `pick_target` Function (Lines 356-359)

instances[i]._dns_value = instance._dns_value
instances[i]._nodes_ver = instance._nodes_ver
instances[i]._index = i - 1  -- Store index for later use

Key points:

_dns_value: Syncs node info computed in request context to config object
_index: Stores instance array index for building correct resource_path

Core Mechanism: Why `_dns_value` Can Be Correctly Accessed in Timer Context

Request Context:
  resolve_endpoint(instance) → Sets instance._dns_value
  ↓
  instances[i]._dns_value = instance._dns_value  [Syncs to etcd copy]
  ↓
  res_conf.value.plugins['ai-proxy-multi'].instances[i]._dns_value is set

Timer Context:
  res_conf.value fetched from etcd
  ↓
  jp.value(res_conf.value, json_path) extracts instance config
  ↓
  upstream_constructor_config._dns_value exists ← because it was synced
  ↓
  construct_upstream works correctly

Key Understanding:

On first request, _dns_value is computed by resolve_endpoint and set
Immediately afterward, instances[i]._dns_value = instance._dns_value writes the value to the config object
When timer triggers, the config read from etcd already contains _dns_value
The entire flow is "lazy initialization + state persistence", not recalculation each time

Summary

Problem	Solution
`_dns_value` missing in timer context	Sync computed value to config object in request context, timer reads the synced value
Health checker lost after creation failure	Add SHM proxy, can read status even if checker not created
Caller blocks waiting for checker	Returns proxy checker immediately usable
`status_ver` attribute missing	Add type check to skip proxy checker

guilongyang · 2026-04-07T11:55:59Z

Please revise your technical solution and reopen the PR. Thank you for your contribution.

I don't have permission to reopen the PR. What should I do next?

fix health check bug

a0fee98

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 31, 2026

guilongyang mentioned this pull request Mar 31, 2026

bug: ai-proxy-multi health checker creation always fails in timer context due to missing _dns_value #13101

Open

Baoyuantop closed this Apr 7, 2026

Baoyuantop reopened this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix health check bug#13136

fix health check bug#13136
guilongyang wants to merge 1 commit intoapache:masterfrom
guilongyang:master

guilongyang commented Mar 31, 2026

Uh oh!

Baoyuantop commented Mar 31, 2026

Uh oh!

Baoyuantop commented Apr 7, 2026

Uh oh!

guilongyang commented Apr 7, 2026

Uh oh!

guilongyang commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guilongyang commented Mar 31, 2026

Which issue(s) this PR fixes:

Checklist

Uh oh!

Baoyuantop commented Mar 31, 2026

Uh oh!

Baoyuantop commented Apr 7, 2026

Uh oh!

guilongyang commented Apr 7, 2026

Technical Solution: Fix ai-proxy-multi Health Check Mechanism

Root Cause

Solution

I. Changes to healthcheck_manager.lua

1. New SHM Proxy Checker (Lines 127-174)

2. Modified fetch_checker Function (Lines 103-113)

3. Added Error Logging (Lines 293-298)

II. Changes to ai-proxy-multi.lua

1. Modified fetch_health_instances Function (Lines 236-282)

2. Modified get_checkers_status_ver Function (Lines 223-232)

3. Modified pick_target Function (Lines 356-359)

Core Mechanism: Why _dns_value Can Be Correctly Accessed in Timer Context

Summary

Uh oh!

guilongyang commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

I. Changes to `healthcheck_manager.lua`

2. Modified `fetch_checker` Function (Lines 103-113)

II. Changes to `ai-proxy-multi.lua`

1. Modified `fetch_health_instances` Function (Lines 236-282)

2. Modified `get_checkers_status_ver` Function (Lines 223-232)

3. Modified `pick_target` Function (Lines 356-359)

Core Mechanism: Why `_dns_value` Can Be Correctly Accessed in Timer Context