Top Grafana Dashboard Examples for Web Apps
Top Grafana Dashboard Examples for Web Apps
Building effective Grafana dashboards for web applications requires understanding the distinction between dashboards that look comprehensive and dashboards that actually help you identify and resolve problems quickly. Most teams start by creating dashboards with every metric they can find, resulting in walls of graphs that are overwhelming during normal operations and impossible to interpret during incidents.
This guide presents proven dashboard patterns for web applications, organized by use case rather than metric category. You will see specific panel configurations, query examples, and design decisions that make the difference between a dashboard you check daily and one that gathers dust. Each example addresses a specific operational question and explains the tradeoffs involved in different visualization choices.
These patterns work with any data source Prometheus, InfluxDB, CloudWatch, or application-specific metrics but the examples use Prometheus because it is the most common data source for Kubernetes-based web applications.
The Golden Signals Dashboard Pattern
The Google SRE book introduced the concept of the four golden signals: latency, traffic, errors, and saturation. This framework provides immediate visibility into the health of any web service, and a single row of four panels answering these questions should be the first thing on every application dashboard.
The latency panel shows request duration at different percentiles. Most dashboards make the mistake of showing only average latency, which hides problems affecting a small percentage of requests. The 95th and 99th percentiles reveal the experience of your slowest users:
Panel: Request Latency (Graph)
Query 1 (p50): histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Query 2 (p95): histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Query 3 (p99): histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Legend: {{quantile}}
Unit: seconds (s)
Decimals: 3
Showing multiple percentiles on one graph reveals the latency distribution shape. If p50 is 50ms but p99 is 5 seconds, you have a long tail problem that averaging would completely hide. The 5-minute rate window balances between showing recent changes and smoothing out single-request outliers.
The traffic panel quantifies demand on your system. For web applications, this is requests per second:
Panel: Request Rate (Graph)
Query: sum(rate(http_requests_total[5m]))
Legend: Requests/sec
Unit: requests/sec (reqps)
Fill: 0
Line width: 2
A single aggregate number is usually sufficient unless you need to distinguish between different types of requests. The visual pattern matters more than the absolute numbers: sudden drops indicate something stopped working, gradual increases show organic growth, and sharp spikes suggest either real traffic events or attacks.
The error panel shows the percentage of requests that fail. Absolute error counts are less useful than error rates because context matters: 100 errors per second is catastrophic at low traffic but might be noise at high traffic:
Panel: Error Rate (Graph)
Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Legend: Error %
Unit: percent (0-100)
Y-axis min: 0
Y-axis max: auto
Thresholds: warning at 1, critical at 5
This query divides 5xx errors by total requests to produce a percentage. The thresholds add color: green below 1%, yellow from 1-5%, red above 5%. These specific values depend on your SLOs, but the pattern of having clear thresholds helps during quick glances at the dashboard.
The saturation panel measures how full your service is, typically CPU or memory usage for stateless services, or queue depth for asynchronous systems:
Panel: CPU Usage (Graph)
Query: sum(rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m])) / sum(container_spec_cpu_quota{pod=~"myapp.*"}/container_spec_cpu_period{pod=~"myapp.*"}) * 100
Legend: CPU %
Unit: percent (0-100)
Thresholds: warning at 70, critical at 85
This calculates CPU usage as a percentage of the configured limit. Saturation above 70% deserves attention before it becomes a problem at 100%. The specific threshold depends on whether your application can handle sustained high CPU: request-response services need headroom for traffic spikes, while batch processors might safely run at 95%.
Request Flow and Dependency Dashboard
Web applications rarely operate in isolation. Most requests flow through multiple services: an API gateway, an application server, a database, and often external APIs. A dependency dashboard visualizes this flow and shows where requests fail or slow down.
The service map panel requires the diagram plugin, which renders a visual representation of service dependencies. While Grafana does not natively provide automatic service discovery for this, you can manually create a useful diagram that shows the relationship between components:
Panel Type: Diagram (requires grafana-diagram-panel plugin)
Metrics:
- Gateway requests: sum(rate(http_requests_total{service="gateway"}[5m]))
- App requests: sum(rate(http_requests_total{service="app"}[5m]))
- DB queries: sum(rate(db_queries_total[5m]))
Diagram definition: Draw boxes for each service with arrows showing request flow
The more practical approach for most teams is separate stat panels showing request rate and error rate for each service in the dependency chain, arranged left to right in request order:
Row: API Gateway
Panel 1 - Gateway Traffic: sum(rate(http_requests_total{service="gateway"}[5m]))
Panel 2 - Gateway Errors: sum(rate(http_requests_total{service="gateway",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="gateway"}[5m])) * 100
Row: Application Server
Panel 3 - App Traffic: sum(rate(http_requests_total{service="app"}[5m]))
Panel 4 - App Errors: sum(rate(http_requests_total{service="app",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="app"}[5m])) * 100
Row: Database
Panel 5 - Query Rate: sum(rate(db_queries_total[5m]))
Panel 6 - Query Errors: sum(rate(db_query_errors_total[5m])) / sum(rate(db_queries_total[5m])) * 100
When investigating a production issue, you read this dashboard top to bottom. If the gateway shows high traffic but the app shows low traffic, requests are failing at the gateway. If the app shows errors and the database shows errors, the problem is likely database-related. This visual debugging pattern is much faster than reading logs from multiple services.
Endpoint-Level Breakdown
Aggregating all endpoints into one number hides problems with specific API paths. A table panel showing per-endpoint metrics reveals which endpoints are slow or failing:
Panel: Endpoint Performance (Table)
Columns:
- Endpoint: label_values(path)
- Request Rate: sum(rate(http_requests_total[5m])) by (path)
- Error Rate: sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100
- p95 Latency: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path))
Sort by: Error Rate (descending)
Display mode: Table with color thresholds on error rate column
Sorting by error rate puts failing endpoints at the top. During an incident, this immediately shows which endpoints are affected. The alternative approach of showing all endpoints in alphabetical order forces you to scan the entire table to find problems.
Database Performance Dashboard
Database bottlenecks are among the most common performance problems in web applications, and they manifest in ways that are not obvious from application metrics alone. A database dashboard needs to show both database-level metrics and how the application interacts with the database.
The connection pool panel shows whether your application is running out of database connections. This is a leading indicator of database problems: connection pool exhaustion often happens before other database metrics show issues:
Panel: Database Connection Pool (Graph)
Query 1 - Active: db_connection_pool_active
Query 2 - Idle: db_connection_pool_idle
Query 3 - Max: db_connection_pool_max
Legend: {{status}}
Display: Stacked area graph
Unit: connections
When active connections approach the maximum, new requests queue for connections. This causes latency spikes in your application even if database query time is normal. The visualization should use stacked areas so you can see the total pool size at a glance.
The query duration panel breaks down database time by query type. Treating all queries the same hides the fact that one slow query type is causing problems:
Panel: Query Duration by Type (Graph)
Query (SELECT): histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket{query_type="SELECT"}[5m])) by (le))
Query (INSERT): histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket{query_type="INSERT"}[5m])) by (le))
Query (UPDATE): histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket{query_type="UPDATE"}[5m])) by (le))
Legend: {{query_type}}
Unit: seconds
If SELECT queries suddenly slow down while INSERT and UPDATE remain fast, you are likely experiencing a table lock or an index problem. This granularity guides troubleshooting much more effectively than a single aggregated query time metric.
The slow query log panel shows the count of queries exceeding a threshold, typically 1 second. This is a stat panel that should display a large number and turn red when the count exceeds zero:
Panel: Slow Queries (Stat)
Query: sum(increase(db_slow_queries_total[5m]))
Thresholds:
- 0: green
- 1: red
Display: Large number with colored background
Unit: queries
Zero slow queries is the goal. Any non-zero number demands investigation. The stat panel format with color makes this instantly visible from across the room, which is the point of an operational dashboard.
Cache Hit Rate and Performance Dashboard
Caching systems like Redis or Memcached are critical for web application performance, but cache metrics are often ignored until cache failures cause visible problems. An effective cache dashboard shows whether the cache is working and whether it is sized correctly.
The hit rate panel is the most important cache metric. It shows the percentage of requests served from cache versus fetched from the origin:
Panel: Cache Hit Rate (Graph)
Query: sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100
Legend: Hit Rate %
Unit: percent (0-100)
Y-axis min: 0
Y-axis max: 100
Threshold: warning below 80
A healthy cache typically maintains a hit rate above 80%. If hit rate drops suddenly, either your cache is undersized and evicting entries too quickly, or your traffic pattern changed and requests are for data that was never cached. Gradual decline over days suggests cache capacity needs to increase as data volume grows.
The eviction rate panel shows how often the cache removes entries due to memory pressure. High eviction rates indicate the cache is too small for the working set:
Panel: Cache Evictions (Graph)
Query: sum(rate(cache_evictions_total[5m]))
Legend: Evictions/sec
Unit: evictions/sec
Display: Line graph
Some evictions are normal in an LRU cache, but if eviction rate is high while hit rate is low, you are thrashing: caching data, evicting it before it can be reused, then caching it again when requested. This is worse than not caching at all because you pay the overhead of cache operations without getting the benefit.
The memory usage panel shows current cache memory consumption versus the configured limit:
Panel: Cache Memory Usage (Gauge)
Query: cache_memory_used_bytes / cache_memory_max_bytes * 100
Display: Gauge
Unit: percent
Min: 0
Max: 100
Thresholds:
- 0-70: green
- 70-90: yellow
- 90-100: red
A gauge visualization works better than a graph for current status. You want to see at a glance whether the cache is nearly full. Sustained operation above 90% means every cache write triggers an eviction, which increases eviction rate and reduces hit rate.
User Experience and Frontend Performance Dashboard
Backend metrics tell you about server health, but users experience the frontend. A Real User Monitoring dashboard shows the performance users actually see, which often differs significantly from what synthetic monitoring or backend metrics suggest.
The page load time panel uses Navigation Timing API metrics sent from the browser:
Panel: Page Load Time (Graph)
Query (p50): histogram_quantile(0.50, sum(rate(page_load_duration_seconds_bucket[5m])) by (le))
Query (p95): histogram_quantile(0.95, sum(rate(page_load_duration_seconds_bucket[5m])) by (le))
Query (p99): histogram_quantile(0.99, sum(rate(page_load_duration_seconds_bucket[5m])) by (le))
Legend: {{quantile}}
Unit: seconds
Page load time includes everything: DNS lookup, TCP connection, TLS negotiation, server processing, data transfer, and client-side rendering. This is the complete user experience. A p95 of 2 seconds might be acceptable for a complex application but terrible for a simple marketing site. Know your targets and set thresholds accordingly.
The Time to First Byte panel isolates server performance from network and rendering time:
Panel: Time to First Byte (Graph)
Query: histogram_quantile(0.95, sum(rate(ttfb_duration_seconds_bucket[5m])) by (le))
Legend: TTFB p95
Unit: milliseconds
If page load time is high but TTFB is low, the problem is frontend: large JavaScript bundles, unoptimized images, or slow client-side rendering. If TTFB is high, the problem is backend: slow database queries, inefficient API calls, or inadequate server resources. This distinction prevents wasting time optimizing the wrong layer.
The JavaScript errors panel shows client-side errors that users encounter:
Panel: JavaScript Errors (Graph)
Query: sum(rate(js_errors_total[5m])) by (error_type)
Legend: {{error_type}}
Unit: errors/sec
Display: Stacked bars
Breaking down errors by type shows whether errors are concentrated in one problem area. A spike in one error type indicates a specific bug was introduced, while gradual increases across all error types might indicate a browser compatibility issue or a change in user behavior.
Resource Utilization and Capacity Dashboard
Capacity planning requires knowing how close to resource limits your application is running. This dashboard shows resource consumption trends to inform scaling decisions before you run out of capacity.
The pod count panel shows how many replicas are running and whether autoscaling is activating:
Panel: Pod Replicas (Graph)
Query: count(kube_pod_info{namespace="production", app="myapp"})
Legend: Pod Count
Unit: pods
Display: Line graph with points
If pod count increases during traffic spikes and decreases afterward, your horizontal pod autoscaler is working. If pod count is constant despite variable traffic, you are either over-provisioned or autoscaling is not configured. If pod count hits a maximum and stays there while request latency increases, you need to raise the autoscaling maximum.
The resource requests versus limits panel shows how much of your configured resources you are actually using:
Panel: CPU Usage vs Limits (Graph)
Query 1 - Used: sum(rate(container_cpu_usage_seconds_total{pod=~"myapp.*"}[5m]))
Query 2 - Requested: sum(kube_pod_container_resource_requests{resource="cpu", pod=~"myapp.*"})
Query 3 - Limit: sum(kube_pod_container_resource_limits{resource="cpu", pod=~"myapp.*"})
Legend: {{type}}
Unit: CPU cores
Display: Line graph
If usage is consistently well below requests, you are wasting cluster capacity and can reduce requests. If usage frequently exceeds requests, pods might be throttled or evicted, and you should increase requests. The limit should be higher than the request to allow bursting, but if usage regularly hits the limit, increase both request and limit.
Cost Visibility for Cloud Resources
If running in a cloud environment, integrating cost metrics into operational dashboards creates awareness of the financial impact of technical decisions:
Panel: Estimated Monthly Cost (Stat)
Query: sum(aws_ec2_instance_price * on(instance_id) group_left kube_node_info) * 730
Display: Stat panel
Unit: USD ($)
Decimals: 0
Prefix: $
This example multiplies EC2 instance hourly price by 730 (hours in a month) to estimate monthly compute cost. The actual query depends on your cloud provider and how you ingest billing data, but the principle is showing cost alongside performance metrics. When you see that doubling pod count to improve latency will cost an extra $5,000 per month, it informs the capacity decision differently than performance metrics alone.
Alert Status and Incident Dashboard
A dashboard showing the current state of your alerting system helps during incidents by providing a single place to see all active alerts and recent firing history.
The active alerts panel lists currently firing alerts:
Panel: Active Alerts (Table)
Query: ALERTS{alertstate="firing"}
Columns:
- Alert Name: {{alertname}}
- Severity: {{severity}}
- Service: {{service}}
- Description: {{description}}
- Duration: {{for}}
Sort by: Severity, then duration
Display: Table with color coding by severity
This gives incident responders immediate visibility into what is actively wrong. The ALERTS metric is special: Prometheus populates it automatically from alerting rules, so you do not need to instrument anything beyond defining the alerts themselves.
The alert history panel shows how many alerts fired over the past day, broken down by alert name:
Panel: Alert Frequency (Heatmap)
Query: sum(increase(ALERTS{alertstate="firing"}[1h])) by (alertname)
Display: Heatmap
X-axis: Time
Y-axis: Alert name
Color: Count of firings
This visualization reveals patterns: an alert that fires briefly every hour might indicate an aggressive threshold, while an alert that fired continuously for six hours indicates a sustained problem. The heatmap format shows both which alerts fire most often and when they fire, which is more informative than simple frequency counts.
Deployment and Release Dashboard
Correlating application metrics with deployments helps answer the question "did this release cause problems?" A deployment dashboard overlays release events on performance metrics.
The deployment annotations feature in Grafana marks releases on graphs. You can create these from Git tags, CI/CD system webhooks, or manually:
Dashboard Setting: Annotations
Query: http_requests_total{tag=~"release-.*"}
Title: Release {{ tag }}
Tag: deployment
Display: Vertical line on all graphs with tag name
When you see a latency spike, the annotation shows whether it coincides with a deployment. If error rate increased immediately after a release annotation, the release is the likely cause. This visual correlation is faster than checking deployment logs separately.
The version distribution panel shows which versions are currently running, which is important during rollouts:
Panel: Active Versions (Stat)
Query: count(kube_pod_labels{label_version!=""}) by (label_version)
Display: Stat panel repeated per version
Unit: pods
Color mode: Value-based
During a rolling update, you see both the old and new version counts. If the rollout stops halfway, you can see how many pods are running each version. If pods keep restarting, the new version count fluctuates while the old version count stays stable.
Mobile App Specific Dashboards
Mobile applications require different metrics than web applications because of longer release cycles, variable network conditions, and diverse device capabilities.
The app version adoption panel shows which app versions users are running:
Panel: Version Distribution (Pie Chart)
Query: sum(app_sessions_total) by (app_version)
Display: Pie chart
Legend: Version {{app_version}}: {{ percentage }}%
Unlike web applications where you control deployment, mobile apps depend on user update behavior. If you released version 2.0 three months ago but 40% of users are still on version 1.5, you need to support both versions. This metric informs the decision of when to deprecate old API versions or how aggressively to prompt users to update.
The crash rate by device panel identifies device-specific issues:
Panel: Crash Rate by Device (Table)
Columns:
- Device: {{device_model}}
- OS Version: {{os_version}}
- Sessions: sum(app_sessions_total) by (device_model, os_version)
- Crashes: sum(app_crashes_total) by (device_model, os_version)
- Crash Rate: (crashes / sessions) * 100
Sort by: Crash Rate descending
Filter: Minimum 100 sessions
If one device model has a 5% crash rate while others have 0.5%, you have a device-specific bug, likely related to screen size, memory constraints, or OS-specific APIs. The minimum session filter prevents low-volume devices from dominating the list due to statistical noise.
Dashboard Design Principles
Beyond specific panel configurations, effective dashboards follow design principles that make them useful during both normal operations and incidents.
One dashboard per use case, not per service. A "production health" dashboard that shows key metrics from all services is more useful than separate dashboards for each service. During an incident, you want to scan one dashboard to identify which component is failing, not open ten dashboards to check each service individually. Create service-specific detail dashboards for deep investigation, but the overview dashboard should be service-agnostic.
Critical metrics at the top, details below. Dashboard readers scan top to bottom and left to right. The most important information should be in the top left corner. The golden signals belong at the top, resource utilization in the middle, and detailed breakdowns at the bottom. Someone glancing at the dashboard for 5 seconds should see the same information they would get from a 5-minute study, just with less detail.
Use color sparingly and consistently. Red should always mean "action required," not just "this number is high." If you use red for CPU usage above 50%, which might be normal, people learn to ignore red. Reserve red for actual problems: error rates above SLO, alerts firing, or resources exhausted. Green should mean "definitely okay," not "probably okay." Yellow is for "investigate soon," not "might be worth looking at."
Time range selection matters more than most people realize. A dashboard showing the last 15 minutes is useless for capacity planning but perfect for incident response. A dashboard showing the last 30 days is useless during incidents but perfect for capacity planning. Create separate dashboards with appropriate default time ranges rather than forcing users to adjust the time picker constantly.
FAQ
What is the difference between a Grafana dashboard and a Prometheus query?
Prometheus queries retrieve and calculate metric data using PromQL, while Grafana dashboards visualize that data through graphs, tables, and stat panels. You write PromQL queries in Grafana panels to fetch data from Prometheus, then Grafana renders the results visually. Grafana supports multiple data sources beyond Prometheus, including InfluxDB, CloudWatch, and Elasticsearch, making it a visualization layer that works with many backend monitoring systems.
How many dashboards should I create for my application?
Start with three: an operational health dashboard for daily monitoring, a detailed troubleshooting dashboard for investigation, and a capacity planning dashboard for resource trends. More than five dashboards for a single application usually indicates unclear dashboard scope or duplicated metrics. Consolidate rather than proliferate. Each dashboard should answer specific questions for a specific audience, not just be a different arrangement of the same metrics.
Should I create different dashboards for different environments like staging and production?
Use the same dashboard with a variable selector for environment rather than creating separate dashboards. This ensures staging and production are monitored consistently. Use Grafana template variables to filter metrics by environment, allowing one dashboard definition to display data from any environment selected via a dropdown. This approach reduces maintenance: when you add a panel to the dashboard, it automatically applies to all environments.
What is the ideal refresh rate for a Grafana dashboard?
For operational dashboards displayed on monitors, 30-60 seconds balances freshness with system load. Faster refresh rates increase load on Prometheus and rarely provide actionable information because metrics change more slowly than seconds. For dashboards viewed occasionally, disable auto-refresh entirely and let users manually refresh when needed. During active incidents, temporarily increase refresh to 5-10 seconds for the specific dashboard you are using.
How do I make my Grafana dashboards load faster?
Reduce the number of panels, limit the time range queried, and avoid queries with high cardinality. Each panel executes separate queries, so 30 panels means 30 queries every refresh. Combine related metrics into single panels when possible. Queries spanning months of data are slow; use appropriate time ranges for each use case. The biggest performance killer is queries that return thousands of time series due to high-cardinality labels like user IDs or request IDs.
Can I create alerts directly in Grafana or do I need Prometheus alerting?
Grafana supports alerts, but Prometheus alerting is more robust for metric-based alerts. Prometheus evaluates alerting rules against local data and has sophisticated alert grouping and inhibition through Alertmanager. Grafana alerts work better for complex conditions involving multiple data sources or non-metric data. The standard pattern is Prometheus alerts for infrastructure and application metrics, with Grafana alerts for special cases like correlating metrics with logs.
What are dashboard variables and when should I use them?
Dashboard variables create dynamic dashboards where users select values from dropdowns to filter displayed data. Common variables include environment, service name, namespace, or time period. They prevent dashboard proliferation: instead of 10 dashboards for 10 services, create one dashboard with a service variable. Variables also enable ad-hoc investigation: during an incident, quickly switch the namespace variable to focus on the affected namespace without navigating to a different dashboard.
How do I share dashboards with my team?
Save dashboards to Grafana's built-in storage and organize them in folders by team or function. For version control, export dashboards as JSON and commit them to Git, then use provisioning to automatically load dashboards from the repository. This approach treats dashboards as code: changes go through pull requests, you can diff versions, and recovering deleted dashboards is trivial. The grafana-operator for Kubernetes can deploy dashboards from ConfigMaps.
What visualization type should I use for different metrics?
Use graphs for time-series data where trends matter, stat panels for current values where only the latest number matters, and tables for comparing multiple items across several dimensions. Gauges work for percentages and bounded values like CPU usage. Heatmaps show distribution over time. Bar charts compare discrete categories. The wrong visualization obscures information: a stat panel for latency hides whether it is increasing, while a graph for current pod count adds noise without insight.
Should I create separate dashboards for developers and operations teams?
Different roles need different information, but too much separation creates silos. A better approach is a shared operational dashboard that both teams use, supplemented by role-specific detail dashboards. Developers might need detailed application-level metrics and deployment history, while operations needs detailed infrastructure and capacity metrics. Both need to see the same top-level health indicators to maintain shared understanding of system state.
Conclusion
Effective Grafana dashboards serve specific purposes: operational monitoring, incident investigation, or capacity planning. Each use case demands different metrics, time ranges, and visualization choices. The patterns presented here provide starting points, but your specific application will require customization based on architecture, SLOs, and operational practices.
Start with the golden signals dashboard to establish baseline visibility, then add dashboards for specific components like databases and caches as needed. Resist the temptation to create comprehensive dashboards with every available metric. The goal is actionable insight, not exhaustive coverage.
Treat dashboards as living documents that evolve with your application and team understanding. During post-incident reviews, update dashboards to surface the metrics that would have identified the problem faster. Remove panels that no one looks at. The dashboard you actively maintain and reference daily is infinitely more valuable than the comprehensive dashboard that impresses visitors but provides no operational value.