Skip to content

Conversation

@kubraaksux
Copy link

Generic LLM benchmark suite for evaluating inference performance across different backends (vLLM, Ollama, OpenAI, MLX).

Features:

  • Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA), summarization (XSum, CNN/DM), JSON extraction
  • Pluggable backend architecture for different inference engines
  • Performance metrics: latency, throughput, memory usage
  • Accuracy evaluation per workload type
  • HTML report generation

This framework can be used to evaluate SystemDS LLM inference components once they are developed.

Generic LLM benchmark suite for evaluating inference performance
across different backends (vLLM, Ollama, OpenAI, MLX).

Features:
- Multiple workload categories: math (GSM8K), reasoning (BoolQ, LogiQA),
  summarization (XSum, CNN/DM), JSON extraction
- Pluggable backend architecture for different inference engines
- Performance metrics: latency, throughput, memory usage
- Accuracy evaluation per workload type
- HTML report generation

This framework can be used to evaluate SystemDS LLM inference
components once they are developed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant