Five Stages of Performance Optimization Maturity

Increasing capacity and improving performance of Cloud based applications is not a one size fits all problem.  Cloud App Optimization requires bespoke solutions for each stage of app maturity and scale and in order to be responsive to the requirements of the application owner.

 

Stage 1: Don’t Optimize, Scale Out

  • Early in development, esp. for new products, optimizations waste engineering time
  • Design your services to scale out
    • Be Agile about it
  • Throw resources at the problem, when needed:
    • Resources are cheap (or, at least, cheaper than engineers)
    • Invest in automatic scale-out (easy solutions are OK; but set limits!)
    • Keep headroom
  • When to stop doing this
    • It’ll start to get expensive
    • Response time of a single request will get high
    • You have to fix it twice

Stage 2: Monitor Production Performance

  • Identify your production performance metric(s)
    • Throughput
    • Response time
    • Error rate
    • The usual suspects
  • Identify how they affect your business
    • involve customers or customer advocates
  • Start monitoring production environments
    • SaaS-based monitoring services are cheap and easy (there are great on-prem options, too)
    • Don’t overdo it – start small, monitor what’s important, stay above the fray
    • Start watching it, especially around deployments; later on, add triggers and notifications
  • When bottlenecks develop
    • Identify the problem and fix it
  • When to stop the ad-hoc fixing
    • There are too many services to monitor manually around deployments
    • Performance becomes unpredictable in new deployments and are not bugs
    • You’re ready to shift left
    • But: keep on monitoring

Stage 3: Add Performance Testing to Your CI/CD Pipeline

  • Define and implement a performance test suite
    • Use existing load generator or benchmarking tool
    • Capture production traffic and replay
    • Build a custom load generator
  • Make performance regression tests part of your CI/CD pipeline
    • Report results, overlay with other development process metrics: make it visible (good or bad)
    • Pay attention to measurement precision/repeatability
  • When performance regressions are caught
    • Rollback deployment, return to developer to fix (the ugly version of shift-left)
    • Allow executive override
  • When to stop:
    • Runtime resources become too expensive
    • Developers rebel against frequent rollbacks/returns`

Stage 4: Application Performance Management

  • Instrument your code
    • This process is usually language and framework-specific
    • Dedicate (initially small) team that cares only about APM
      • Learn the tools
      • Continuous improvements
      • 3-6 months initial commitment
    • Broadcast early successes to engage other developers
  • Why do it
    • To make (thought-through) targeted code improvements
    • Identify possible improvements when everything else fails
  • When to stop/pause
    • Make sure it’s contained and does not take on a life of its own (balance cost/benefit)
    • When codebase is too volatile (big, architectural changes, e.g., migrating to microservices)
    • Good to maintain instrumentation and monitor, even if not actively optimizing (insurance)

Stage 5: Automated Tuning in the CI/CD Pipeline

  • Identify performance tuning “knobs” to tweak
    • Resources: CPU, memory: reserve or limit; VM instance type; I/O throughput
    • Middleware configuration: JVM GC type/parameters, worker threads, pool sizes, write delays
    • Kernel parameters: page sizes, jumbo packet sizes, scheduler tweaks (special cases only)
    • Your own app’s parameters: thread pools, cache timeouts, memory-vs-cpu tweaks/tradeoffs
  • “Auto-tune” for microservices
    • Grid search – try all combinations (if feasible or you can reduce dimensionality)
    • Heuristics – e.g., make small tweaks – one click up/down for each knob
    • Machine learning – Bayesian optimization, reinforced learning, CMA …
  • Choose where the CI/CD pipeline to insert
    • Tune in staging only, using perf test suite; propagate results programmatically to production
    • Tune in production (canary or all) — requires no-downtime deployments (you have those, right?)
  • When to stop
    • Never, it’s continuous optimization 🙂
Share This