Skip to content

fix health check bug#13136

Open
guilongyang wants to merge 1 commit intoapache:masterfrom
guilongyang:master
Open

fix health check bug#13136
guilongyang wants to merge 1 commit intoapache:masterfrom
guilongyang:master

Conversation

@guilongyang
Copy link
Copy Markdown

[##] Description

hi, @Baoyuantop . Based on the 3.15.0-ubuntu Docker image, I have verified the changes in this PR and confirmed that the issue has been resolved.

Which issue(s) this PR fixes:

Fixes # #13101

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Mar 31, 2026
@Baoyuantop
Copy link
Copy Markdown
Contributor

The current PR does not fix the root cause as described in the original issue. Please modify the technical solution.

@Baoyuantop
Copy link
Copy Markdown
Contributor

Please revise your technical solution and reopen the PR. Thank you for your contribution.

@Baoyuantop Baoyuantop closed this Apr 7, 2026
@guilongyang
Copy link
Copy Markdown
Author

sorry for late.

Technical Solution: Fix ai-proxy-multi Health Check Mechanism

Root Cause

The ai-proxy-multi plugin's health check mechanism has a structural bug:

  1. construct_upstream in timer context (reading from etcd) cannot access the _dns_value field
  2. This field is only set by resolve_endpoint in request context
  3. When health checker creation fails, the resource is permanently removed from waiting_pool, causing it to never be retried
  4. Original code: fetch_checker returns nil, resource is removed from waiting_pool but never retried

Solution

I. Changes to healthcheck_manager.lua

1. New SHM Proxy Checker (Lines 127-174)

-- Create a proxy checker object that reads health status directly from SHM.
-- This is used when the checker is not available in the current worker's local working_pool.
local function create_shm_proxy_checker(resource_path)
    -- Returns a proxy object that reads health status directly from shared memory
    -- Uses internal key format of lua-resty-healthcheck
end

function _M.get_target_status_by_path(resource_path, ip, port, hostname)
    -- Query health status based on resource_path
end

Purpose:

  • Returns a proxy checker instead of nil when the checker is not yet created
  • Proxy reads health status directly from SHM, avoiding caller blocking while waiting

2. Modified fetch_checker Function (Lines 103-113)

-- Check if this is a known waiting resource with matching version
if waiting_pool[resource_path] == resource_ver then
    -- A checker is being created, but we need immediate response
    -- Return a proxy that reads from SHM while waiting
    return create_shm_proxy_checker(resource_path)
end

-- First time requesting this resource, add to waiting pool
-- Don't return anything useful yet - the real checker is being created
waiting_pool[resource_path] = resource_ver
return nil

Purpose:

  • When resource is in waiting_pool, subsequent requests can immediately read status from SHM
  • Prevents callers from blocking due to checker not being created yet

3. Added Error Logging (Lines 293-298)

upstream = plugin.construct_upstream(upstream_constructor_config)
if not upstream then
    core.log.error("[healthcheck] default upstream is nil | resource_path: ", resource_path, 
                   "upstream_constructor_config: ", core.json.encode(upstream_constructor_config))
end

II. Changes to ai-proxy-multi.lua

1. Modified fetch_health_instances Function (Lines 236-282)

local function fetch_health_instances(conf, checkers)
    local instances = conf.instances
    local new_instances = core.table.new(0, #instances)

    for i, ins in ipairs(instances) do
        local checker = checkers and checkers[ins.name]
        if checker then
            -- Checker exists: get health status via fetch_node_status
            local node = ins._dns_value
            local ok, err = healthcheck_manager.fetch_node_status(checker, node.host, port or node.port, host)
            if ok then
                transform_instances(new_instances, ins)
            end
        elseif checker == nil then
            -- Checker is being created: read health status from SHM
            local node = ins._dns_value
            local idx = ins._index or (i - 1)
            local resource_path = conf._meta.parent.resource_key ..
                                  "#plugins['ai-proxy-multi'].instances[" .. idx .. "]"
            local ok = healthcheck_manager.get_target_status_by_path(resource_path, node.host, port or node.port, host)
            if ok then
                transform_instances(new_instances, ins)
            end
        else
            -- No checker configured: include instance by default
            transform_instances(new_instances, ins)
        end
    end
    -- ...
end

Three states handled:

State Handling Method
Checker created fetch_node_status to get status
Checker being created get_target_status_by_path from SHM
No checker configured Include by default

2. Modified get_checkers_status_ver Function (Lines 223-232)

local function get_checkers_status_ver(checkers)
    local status_ver_total = 0
    for _, checker in pairs(checkers) do
        -- Only count status_ver from real checkers (proxy checkers don't have this attribute)
        if checker.status_ver then
            status_ver_total = status_ver_total + checker.status_ver
        end
    end
    return status_ver_total
end

Reason: Proxy checker doesn't have status_ver attribute, needs type check

3. Modified pick_target Function (Lines 356-359)

instances[i]._dns_value = instance._dns_value
instances[i]._nodes_ver = instance._nodes_ver
instances[i]._index = i - 1  -- Store index for later use

Key points:

  • _dns_value: Syncs node info computed in request context to config object
  • _index: Stores instance array index for building correct resource_path

Core Mechanism: Why _dns_value Can Be Correctly Accessed in Timer Context

Request Context:
  resolve_endpoint(instance) → Sets instance._dns_value
  ↓
  instances[i]._dns_value = instance._dns_value  [Syncs to etcd copy]
  ↓
  res_conf.value.plugins['ai-proxy-multi'].instances[i]._dns_value is set

Timer Context:
  res_conf.value fetched from etcd
  ↓
  jp.value(res_conf.value, json_path) extracts instance config
  ↓
  upstream_constructor_config._dns_value exists ← because it was synced
  ↓
  construct_upstream works correctly

Key Understanding:

  • On first request, _dns_value is computed by resolve_endpoint and set
  • Immediately afterward, instances[i]._dns_value = instance._dns_value writes the value to the config object
  • When timer triggers, the config read from etcd already contains _dns_value
  • The entire flow is "lazy initialization + state persistence", not recalculation each time

Summary

Problem Solution
_dns_value missing in timer context Sync computed value to config object in request context, timer reads the synced value
Health checker lost after creation failure Add SHM proxy, can read status even if checker not created
Caller blocks waiting for checker Returns proxy checker immediately usable
status_ver attribute missing Add type check to skip proxy checker

@guilongyang
Copy link
Copy Markdown
Author

Please revise your technical solution and reopen the PR. Thank you for your contribution.

I don't have permission to reopen the PR. What should I do next?

@Baoyuantop Baoyuantop reopened this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants