Monitoring IT Infra with Prometheus and Grafana – Part 3


In last post, we created Dashboards for Linux nodes monitoring. Please note, Linux and Windows are not only things which can be monitored but would leave that part to figure you out in case if you have any use case.

Let’s talk about creating a Windows summary dashboard this time like below one. It’s for just one node but it would keep on scaling as more nodes added in Prometheus target config.

Windows Dashboard

Let’s talk of Windows Summary Dashboard.

Windows Dashboard01

Same way, we did for Linux Summary Dashboard, go to settings via cog wheel and then go to variables then create a new variable named job. Type can be constant or custom and value would be the one, which you defined as job name in prometheus.yml

Job

Now in dashboard create rows as you want to arrange panels and then proceed to add panels. I would list down all the queries in order.


count(up{job="$job"})-sum(up{job="$job"}) # Number of nodes offline
sum(up{job="$job"}) # Number of nodes online
up{job="$job"} # Tabular status of nodes up/down
(last_over_time(windows_cs_hostname{job="$job"}[$__rate_interval]) == 1) * on(job, instance) group_left(product,version) windows_os_info{} * on(job, instance) group_left(timezone) windows_os_timezone{} # System information
(windows_os_info{job="$job"}* on(instance) group_right(product) windows_system_system_up_time)*1000 # Last Boot Time
windows_os_processes{job="$job"} # Number of processes
windows_logical_disk_free_bytes{job="$job",volume=~".*:"}/1073741824<(windows_logical_disk_size_bytes{job="$job",volume=~".*:"}/1073741824)*0.2 # Node volumes with less than 20% Disk Space left
100-100* windows_os_physical_memory_free_bytes{job="$job"}/windows_cs_physical_memory_bytes{job="$job"} # Memory utilization in %
max by (job,instance, volume) (windows_logical_disk_free_bytes{job="$job",volume=~".*:"}/windows_logical_disk_size_bytes{job="$job",volume=~".*:"})*100<20 # Disk utilization
(max by (job, instance)(windows_logical_disk_size_bytes{job="$job",volume=~".*:"}))/1073741824 # Disk utilization query 1
(max by (job, instance)(windows_logical_disk_free_bytes{job="$job",volume=~".*:"}))/1073741824 # Disk utilization query 2

One more thing, to add hyper-link for individual nodes in tables, you need to add data links in individual panels with below code

http://[grafana-server-ip]:3000/d/[9_digit_code]/windows-metrices-detailed-base?orgId=1&var-host=${__data.fields.instance}&var-job=${__data.fields.job}

You would need to copy the URL from the second dashboard created in the same folder (Widows nodes) till orgId=1 to make sure that you get it right. We doing this so that when someone clicks on any individual row related to particular then it would pick the job variable value and instance value and then open the page specific to that node only. How? We would cover the same next in detailed dashboard.

Go to settings of new dashboard and define two variables, first job, exactly like we did in last dashboard and then another one named host (can chose instance as well but you would need to replace $host from queries which I would give next).

Host

The query in above is the below one

label_values(up{job="$job"}, instance)

Once the two variables are set, I would list down the queries for creating a dashboard like below one:

Windows Dashboard

The service one is a little different since it usages a plugin which would need to be installed but I would list the query anyways first then would let you know how to install the plugin.

Now here goes the queries in order

time()-windows_system_system_up_time{job="$job",instance="$host"} # Uptime
windows_cs_physical_memory_bytes{job="$job",instance="$host"}/1073741824 # Physical memory
100-(avg(irate(windows_cpu_time_total{job="$job",instance="$host",mode="idle"}[2m])))*100 # CPU load
windows_thermalzone_temperature_celsius{job="$job",instance="$host"} # Temperature
100-(windows_os_physical_memory_free_bytes{job="$job",instance="$host"}/windows_cs_physical_memory_bytes{job="$job",instance="$host"})*100 # Memory utilization
(max by (job,instance) (windows_logical_disk_free_bytes{job="$job",instance="$host"}))/1073741824 # Disk usages
sum(increase(windows_net_bytes_received_total{job="$job",instance="$host"}[24h])) # Data received in last 24 hrs
sum(increase(windows_net_bytes_received_total{job="$job",instance="$host"}[24h])) # Data sent in last 24 hrs
((last_over_time(windows_cs_hostname{job="$job",instance="$host"}[$__rate_interval]) == 1) * on(job, instance) group_left(product,version) windows_os_info{} * on(job, instance) group_left(timezone) windows_os_timezone{} * on(job, instance) group_left() windows_system_system_up_time{}) *1000 # System Information
((max_over_time(windows_service_state{job="$job", instance="$host",name=~"w32time|wuauserv|bits|dosvc|mpssvc|windefend|termservice"}[$__interval]) == 1) * on(job, instance, name) group_left(display_name) windows_service_info{job="$job", instance="$host"}) * 0 # Service Status for select services for which names are listed
sum by (mode)(irate(windows_cpu_time_total{job="$job",instance="$host"}[5m])) # CPU usages
windows_cs_physical_memory_bytes{job="$job",instance="$host"} # Memory usages query 1
windows_os_physical_memory_free_bytes{job="$job",instance="$host"} # Memory usages query 2
windows_os_virtual_memory_bytes{job="$job",instance="$host"} # Memory usages query 3
windows_os_virtual_memory_free_bytes{job="$job",instance="$host"} # Memory usages 4
irate(windows_net_bytes_sent_total{job="$job",instance="$host",nic!~'isatap.*|vpn.*'}[5m])*8 # Network usages query 1
irate(windows_net_bytes_received_total{job="$job",instance="$host",nic!~'isatap.*|vpn.*'}[5m])*8 # Network usages query 2
windows_logical_disk_free_bytes{job="$job",instance="$host",volume=~".*:"} # Disk usages<p>irate(windows_logical_disk_read_bytes_total{job="$job",instance="$host",volume=~".*:"}[5m]) # Disk activity
windows_os_processes{job="$job",instance="$host"} # Number of processes
windows_process_handles{job="$job",instance="$host"} # Process table
sum(windows_service_state{job="$job",instance="$host"}) by (state) # Service Status
((last_over_time(windows_service_state{job="$job",instance="$host"}[$__rate_interval]) == 1) * on(job, instance, name) group_left(display_name,run_as) windows_service_info{job="$job",instance="$host"}) # Service Status Table

Again, this is in no way complete steps as you need to setup a few things about each type of panel, their placement, thresholds, value mapping, field organization, renaming etc as well but I would intentionally leave those over you for two reasons; mine is just primitive and there is a lot more available in dashboard gallery from Grafana and second reason, I trust you would be able to come up with even better-looking dashboards suitable for your environment.

Oh Wait! I forgot about the plugin? Didn’t I. So that would be as simple as running below

grafana-cli plugins install flant-statusmap-panel
systemctl restart grafana-server
systemctl status grafana-server –l

Want more plugins? Check out on https://grafana.com/grafana/plugins/ 

Let me know your feedback that how it’s going so far.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.