Skip to content

Discussion: testing strategy for instructlab/training as a backend library #705

@gkneighb

Description

@gkneighb

Picking up from the abandoned #404 per @RobotSail's invitation in that thread. The original PR drafted a four-tier testing methodology when this repo was the testing surface for the ilab CLI; that framing is stale now that training_hub is the user-facing front door and instructlab/training is positioned as one of several training backends.

This issue is to open a fresh discussion of what testing in this repo should actually look like given the current scope. Sketching a starting position — please push back, this is not a proposal yet.

What changed for testing

  • User-facing tests should live in training_hub, not here. Anything that exercises the run_training() interface contract from a user perspective belongs at the algorithm-level interface, not the SFT backend.
  • This repo's job is now narrower: correctness of SFT mechanics on a given backend matrix (FSDP/DeepSpeed × accelerator variant × feature flag set). The integration story is training_hub's problem.
  • Convergence is the harder problem. Smoke tests can verify "it doesn't crash"; what training_hub can't do for us is "this backend actually produces a model whose loss curve looks right on a fixed-seed run."

Sketched tiers

This is the part most likely to need revision — calling out the shape, not the details.

  1. Linting / type checks. Pre-test gate. Run on every PR.
  2. Unit tests. No GPU. Verify isolated mechanics — config serialization, batch metric accumulation, the BatchLossManager class contracts, checkpoint round-trip on a CPU-faked model. Always run.
  3. Smoke tests. GPU-required, per matrix entry. Verify that training runs end-to-end without crashing on a verified configuration list (a handful of named configs that cover the unique code paths — NVIDIA FSDP, NVIDIA FSDP+LoRA, AMD DeepSpeed, etc.). Hard cap somewhere around 30 minutes per entry, target under 10. Block PRs.
  4. Convergence checks. Per matrix entry, short fixed-seed runs on a fixed dataset, loss curve exported to a stable artifact store, compared against a baseline. The baseline refreshes deliberately on new arch, major dep bump, or training-loop change. Catches the "garbage gradients on accelerator X" case that smoke tests miss.

Downstream model benchmarks (MMLU, MT-Bench) live in training_hub or release qualification, not here.

Open questions

  1. training_hub integration tests: what does that side already cover, and where does it currently break / need this repo's tests to catch failures before integration?
  2. Verified configurations list: who maintains it? Living in this repo seems right but it has to be reviewable by anyone adding a new accelerator or feature.
  3. Convergence baseline storage: where does the baseline live? GitHub Actions artifact cache (with retention rules) is one option; an external S3 bucket is another. Different cost / reproducibility / sharing tradeoffs.
  4. Existing e2e job: e2e test takes over an hour #328 ("e2e test takes over an hour") has prior discussion about caching SDG-CI's pre-generated dataset. That predates the training_hub split — is it still the right path, or does training_hub's user-level e2e make the in-repo long-running e2e job mostly redundant?

Happy to draft a more concrete proposal once there's directional agreement, or to pick up a single piece (e.g. convergence-check prototype, smoke matrix definition) if that's more useful than a big-doc-first approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions