diff --git a/website/blog/2025-10-24-dynamo-on-aks/index.md b/website/blog/2025-10-24-dynamo-on-aks/index.md index 2ee7f1585..262ea0de9 100644 --- a/website/blog/2025-10-24-dynamo-on-aks/index.md +++ b/website/blog/2025-10-24-dynamo-on-aks/index.md @@ -82,7 +82,7 @@ without constraining the compute-heavy prefill phase, improving overall resource utilization and performance. Dynamo includes an -[SLA-based Planner](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner_quickstart.md) +[SLA-based Planner](https://docs.nvidia.com/dynamo/v-0-9-0/components/planner) that proactively manages GPU scaling for prefill/decode (PD) disaggregated inference. Using pre-deployment profiling, it evaluates how model parallelism and batching affect performance, recommending configurations that meet @@ -92,7 +92,7 @@ time-series models, dynamically adjusting PD worker counts based on predicted demand and real-time metrics. The Dynamo -[LLM-aware Router](https://github.com/ai-dynamo/dynamo/tree/main/docs/router) +[LLM-aware Router](https://docs.nvidia.com/dynamo/v-0-9-0/components/router) manages the key-value (KV) cache across large GPU clusters by hashing requests and tracking cache locations. It calculates overlap scores between incoming requests and cached KV blocks, routing requests to GPUs that maximize cache @@ -285,7 +285,7 @@ This post focused on the foundational serving stack. In upcoming blogs, we will build on this foundation and explore more of Dynamo's advanced features, such as [Disaggregated Serving](https://github.com/ai-dynamo/dynamo/blob/9defc01b9b9c51a4a21abbb02907a4f1d5d2a2d2/examples/basics/disaggregated_serving/README.md#L4) -and [SLA-based Planner](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner_quickstart.md). +and [SLA-based Planner](https://docs.nvidia.com/dynamo/v-0-9-0/components/planner). We'll demonstrate how these features allow for even greater efficiency, moving from a static, holistic deployment to a flexible, phase-splitted architecture. Moving forward, we also plan to extend our testing to include larger diff --git a/website/blog/2026-01-22-dynamo-on-aks-part-2/index.md b/website/blog/2026-01-22-dynamo-on-aks-part-2/index.md index f938aeb43..55be325aa 100644 --- a/website/blog/2026-01-22-dynamo-on-aks-part-2/index.md +++ b/website/blog/2026-01-22-dynamo-on-aks-part-2/index.md @@ -22,9 +22,9 @@ Today, we're shifting focus from raw throughput to **developer velocity** and **operational efficiency**. We will explore how the -[**Dynamo Planner**](https://github.com/ai-dynamo/dynamo/blob/main/docs/planner/sla_planner.md) +[**Dynamo Planner**](https://docs.nvidia.com/dynamo/v-0-9-0/components/planner) and -[**Dynamo Profiler**](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler) +[**Dynamo Profiler**](https://docs.nvidia.com/dynamo/v-0-9-0/components/profiler) remove the guesswork from performance tuning on AKS. @@ -74,8 +74,6 @@ prefill GPUs and idle decode GPUs). NVIDIA Dynamo addresses these gaps through two integrated components: the **Planner Profiler** and the **SLO-based Planner**. ---- - ### Let’s see it through an example application scenario Consider a mission-critical AI workload running on AKS: an airline’s @@ -115,7 +113,7 @@ The Dynamo Planner profiler is your pre-deployment simulation engine. Instead of burning GPU hours testing every possible configuration, you define your requirements in a **DynamoGraphDeploymentRequest (DGDR)** manifest. The profiler then executes an automated -["sweep"](https://github.com/ai-dynamo/dynamo/blob/main/docs/benchmarks/sla_driven_profiling.md#profiling-method) +["sweep"](https://github.com/ai-dynamo/dynamo/blob/release/0.8.1/docs/benchmarks/sla_driven_profiling.md) of the search space: * **Parallelization Mapping**: It tests different TP sizes for both prefill