Monitoring IT Infra with Prometheus and Grafana – Part 3

In last post, we created Dashboards for Linux nodes monitoring. Please note, Linux and Windows are not only things which can be monitored but would leave that part to figure you out in case if you have any use case.

Let’s talk about creating a Windows summary dashboard this time like below one. It’s for just one node but it would keep on scaling as more nodes added in Prometheus target config.

Let’s talk of Windows Summary Dashboard.

Same way, we did for Linux Summary Dashboard, go to settings via cog wheel and then go to variables then create a new variable named job. Type can be constant or custom and value would be the one, which you defined as job name in prometheus.yml

Now in dashboard create rows as you want to arrange panels and then proceed to add panels. I would list down all the queries in order.


count(up{job="$job"})-sum(up{job="$job"}) # Number of nodes offline
sum(up{job="$job"}) # Number of nodes online
up{job="$job"} # Tabular status of nodes up/down
(last_over_time(windows_cs_hostname{job="$job"}[$__rate_interval]) == 1) * on(job, instance) group_left(product,version) windows_os_info{} * on(job, instance) group_left(timezone) windows_os_timezone{} # System information
(windows_os_info{job="$job"}* on(instance) group_right(product) windows_system_system_up_time)*1000 # Last Boot Time
windows_os_processes{job="$job"} # Number of processes
windows_logical_disk_free_bytes{job="$job",volume=~".*:"}/1073741824<(windows_logical_disk_size_bytes{job="$job",volume=~".*:"}/1073741824)*0.2 # Node volumes with less than 20% Disk Space left
100-100* windows_os_physical_memory_free_bytes{job="$job"}/windows_cs_physical_memory_bytes{job="$job"} # Memory utilization in %
max by (job,instance, volume) (windows_logical_disk_free_bytes{job="$job",volume=~".*:"}/windows_logical_disk_size_bytes{job="$job",volume=~".*:"})*100<20 # Disk utilization
(max by (job, instance)(windows_logical_disk_size_bytes{job="$job",volume=~".*:"}))/1073741824 # Disk utilization query 1
(max by (job, instance)(windows_logical_disk_free_bytes{job="$job",volume=~".*:"}))/1073741824 # Disk utilization query 2

One more thing, to add hyper-link for individual nodes in tables, you need to add data links in individual panels with below code

http://[grafana-server-ip]:3000/d/[9_digit_code]/windows-metrices-detailed-base?orgId=1&var-host=${__data.fields.instance}&var-job=${__data.fields.job}

You would need to copy the URL from the second dashboard created in the same folder (Widows nodes) till orgId=1 to make sure that you get it right. We doing this so that when someone clicks on any individual row related to particular then it would pick the job variable value and instance value and then open the page specific to that node only. How? We would cover the same next in detailed dashboard.

Go to settings of new dashboard and define two variables, first job, exactly like we did in last dashboard and then another one named host (can chose instance as well but you would need to replace $host from queries which I would give next).

The query in above is the below one

label_values(up{job="$job"}, instance)

Once the two variables are set, I would list down the queries for creating a dashboard like below one:

The service one is a little different since it usages a plugin which would need to be installed but I would list the query anyways first then would let you know how to install the plugin.

Now here goes the queries in order

time()-windows_system_system_up_time{job="$job",instance="$host"} # Uptime
windows_cs_physical_memory_bytes{job="$job",instance="$host"}/1073741824 # Physical memory
100-(avg(irate(windows_cpu_time_total{job="$job",instance="$host",mode="idle"}[2m])))*100 # CPU load
windows_thermalzone_temperature_celsius{job="$job",instance="$host"} # Temperature
100-(windows_os_physical_memory_free_bytes{job="$job",instance="$host"}/windows_cs_physical_memory_bytes{job="$job",instance="$host"})*100 # Memory utilization
(max by (job,instance) (windows_logical_disk_free_bytes{job="$job",instance="$host"}))/1073741824 # Disk usages
sum(increase(windows_net_bytes_received_total{job="$job",instance="$host"}[24h])) # Data received in last 24 hrs
sum(increase(windows_net_bytes_received_total{job="$job",instance="$host"}[24h])) # Data sent in last 24 hrs
((last_over_time(windows_cs_hostname{job="$job",instance="$host"}[$__rate_interval]) == 1) * on(job, instance) group_left(product,version) windows_os_info{} * on(job, instance) group_left(timezone) windows_os_timezone{} * on(job, instance) group_left() windows_system_system_up_time{}) *1000 # System Information
((max_over_time(windows_service_state{job="$job", instance="$host",name=~"w32time|wuauserv|bits|dosvc|mpssvc|windefend|termservice"}[$__interval]) == 1) * on(job, instance, name) group_left(display_name) windows_service_info{job="$job", instance="$host"}) * 0 # Service Status for select services for which names are listed
sum by (mode)(irate(windows_cpu_time_total{job="$job",instance="$host"}[5m])) # CPU usages
windows_cs_physical_memory_bytes{job="$job",instance="$host"} # Memory usages query 1
windows_os_physical_memory_free_bytes{job="$job",instance="$host"} # Memory usages query 2
windows_os_virtual_memory_bytes{job="$job",instance="$host"} # Memory usages query 3
windows_os_virtual_memory_free_bytes{job="$job",instance="$host"} # Memory usages 4
irate(windows_net_bytes_sent_total{job="$job",instance="$host",nic!~'isatap.*|vpn.*'}[5m])*8 # Network usages query 1
irate(windows_net_bytes_received_total{job="$job",instance="$host",nic!~'isatap.*|vpn.*'}[5m])*8 # Network usages query 2
windows_logical_disk_free_bytes{job="$job",instance="$host",volume=~".*:"} # Disk usages<p>irate(windows_logical_disk_read_bytes_total{job="$job",instance="$host",volume=~".*:"}[5m]) # Disk activity
windows_os_processes{job="$job",instance="$host"} # Number of processes
windows_process_handles{job="$job",instance="$host"} # Process table
sum(windows_service_state{job="$job",instance="$host"}) by (state) # Service Status
((last_over_time(windows_service_state{job="$job",instance="$host"}[$__rate_interval]) == 1) * on(job, instance, name) group_left(display_name,run_as) windows_service_info{job="$job",instance="$host"}) # Service Status Table

Again, this is in no way complete steps as you need to setup a few things about each type of panel, their placement, thresholds, value mapping, field organization, renaming etc as well but I would intentionally leave those over you for two reasons; mine is just primitive and there is a lot more available in dashboard gallery from Grafana and second reason, I trust you would be able to come up with even better-looking dashboards suitable for your environment.

Oh Wait! I forgot about the plugin? Didn’t I. So that would be as simple as running below

grafana-cli plugins install flant-statusmap-panel
systemctl restart grafana-server
systemctl status grafana-server –l

Want more plugins? Check out on https://grafana.com/grafana/plugins/

Let me know your feedback that how it’s going so far.

	Hindvi Hungama on Making Zabbix Monitoring Serve…
	Vinay cab on Finally a group trip to Agra b…
	bphytolab0 on London Dreams: Dreams are life…
	zarnmonserath on MFA temporary exclusion Tool u…
	coffeesweetlyb55eba2… on My Name is Khan and I am not a…

	Hindvi Hungama on Making Zabbix Monitoring Serve…
	Vinay cab on Finally a group trip to Agra b…
	bphytolab0 on London Dreams: Dreams are life…
	zarnmonserath on MFA temporary exclusion Tool u…
	coffeesweetlyb55eba2… on My Name is Khan and I am not a…

Monitoring IT Infra with Prometheus and Grafana – Part 3

Published by Nitish Kumar

Leave a comment Cancel reply

via Nitish Kumar's Blog:

Related

Published by Nitish Kumar

Leave a comment Cancel reply