Skip to content

UPSTREAM PR #19317: cleanup llama-quantize --help output#1158

Open
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19317-llama-quantize-help-cleanup
Open

UPSTREAM PR #19317: cleanup llama-quantize --help output#1158
loci-dev wants to merge 5 commits intomainfrom
loci/pr-19317-llama-quantize-help-cleanup

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 8, 2026

Note

Source pull request: ggml-org/llama.cpp#19317

More pleasant formatting for the output of llama-quantize --help.

Before this PR:

Screenshot 2026-02-03 at 10 34 37 PM

After this PR:

Screenshot 2026-02-03 at 10 50 30 PM (this image was previously wrong, updated it to match the current code)

@loci-review
Copy link

loci-review bot commented Feb 8, 2026

Overview

Analysis of 115,630 functions across 47 commits reveals minimal performance impact. Only 10 functions modified (0.009%), all within the llama-quantize offline utility tool. No performance-critical inference pathways were affected.

Power Consumption Changes:

  • build.bin.llama-quantize: -0.11% (-48.02 nJ)
  • build.bin.llama-cvector-generator: 0.00%
  • build.bin.libmtmd.so: -0.00%
  • build.bin.llama-tts: 0.00%
  • build.bin.libllama.so: 0.00%
  • build.bin.llama-tokenize: 0.00%
  • build.bin.llama-qwen2vl-cli: 0.00%
  • build.bin.llama-gguf-split: 0.00%
  • build.bin.llama-llava-cli: 0.00%
  • build.bin.llama-minicpmv-cli: 0.00%
  • build.bin.llama-gemma3-cli: 0.00%
  • build.bin.libggml.so: 0.00%
  • build.bin.libggml-cpu.so: 0.00%
  • build.bin.libggml-base.so: 0.00%
  • build.bin.llama-bench: 0.00%

Function Analysis

Significant Improvements (Compiler Optimizations):

  • std::vector<common_adapter_lora_info>::end(): Response time -52.67% (-90.68ns), throughput time -60.27% (-90.68ns)
  • __val_comp_iter token comparator: Response time -49.49% (-117.39ns), throughput time -57.91% (-117.39ns)
  • regex_traits::operator|: Response time -24.95% (-46.79ns), throughput time -29.75% (-46.79ns)
  • std::vector<unsigned long>::end(): Response time -18.06% (-17.96ns), throughput time -23.10% (-17.96ns)

These improvements occurred without source code changes, indicating toolchain optimization benefits.

Intentional Regression:

  • usage() help text function: Response time +23.09% (+258.97ns), throughput time +35.95% (+95.70ns). Commit e34fe51 doubled printf calls (22→43) for improved help text formatting. Zero practical impact as function only executes on --help requests.

Minor Regression:

  • iterator::operator- for kv_override: Response time +27.30% (+30.87ns), throughput time +33.78% (+33.78ns). Caused by sanitizer instrumentation from build system refactoring (commit 423bee4), affecting only debug builds.

Other analyzed functions showed improvements under 7% with negligible absolute changes.

Additional Findings

Extensive GPU backend work (30+ commits across Metal, Vulkan, CUDA) delivered Flash Attention optimizations, kernel consolidation, and bug fixes without impacting analyzed functions. This confirms proper architectural separation between GPU compute kernels and CPU-side utilities. All modified functions are in non-critical paths; core inference libraries (libllama.so, libggml-cpu.so) show zero power consumption change, confirming stability of performance-critical components.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants