node_exporter

Commit Graph

Author	SHA1	Message	Date
Benjamin Drung	dfb6002fad	btrfs_stats: Upgrade to Python 3 (#1359 ) Python 2.7 will not be maintained past 2020. Therefore upgrade `text_collector_examples/btrfs_stats.py` to Python 3. Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>	2019-05-29 06:57:23 -05:00
Paul Gier	bd3fc09b30	fix or ignore codespell issues (#1351 ) Signed-off-by: Paul Gier <pgier@redhat.com>	2019-05-20 13:05:39 -05:00
Henk	fbe390709f	Add nvme_metrics.sh text collector example (#1309 ) * Add nvme_metrics.sh text collector example Signed-off-by: Henk <henk@wearespindle.com>	2019-04-08 15:50:29 +02:00
Edgaras Giedrė	2f87b7cba6	Update smartmon.py to widen self_assessment_passed test (#1293 ) Signed-off-by: EdgarasG <edgaras.giedre@hostinger.com>	2019-03-20 09:38:41 +01:00
Slawomir Gonet	19e5bb6abd	yum.sh: yum update monitor (#1273 ) Signed-off-by: Slawomir Gonet <slawek@otwiera.cz>	2019-02-28 00:12:47 +01:00
Julian Kornberger	5110efc1cd	Translate smartmon.py to Python (#1225 ) * Add smartmon.py python port of the smartmon.sh bash script Signed-off-by: Arthur Skowronek <ags@digineo.de>	2019-02-27 22:19:55 +01:00
Saj Goonatilleke	d546916c6b	Add the inotify-instances text collector (#1186 ) This is an alternative take on the embedded inotify collector: https://github.com/prometheus/node_exporter/pull/988 The proposed embedded collector was not accepted for inclusion because it was not possible for a single unprivileged node_exporter process to detect inotify resource utilisation in other user domains. This text collector works around the problem by giving the operator a choice between the following: - Run only the text collector as root to gain visibility over all processes on the system. - Run one or more instances of the text collector as an unprivileged user to gain visibility over subsets of the system. In either case, the data generated by this collector can be useful when hunting down inotify instance leaks -- and when confirming the resolution of such leaks. Signed-off-by: Saj Goonatilleke <sg@redu.cx>	2019-02-27 01:03:25 +01:00
Cole White	83c9b11747	remove "-n" flag from /usr/bin/awk (#1269 ) This flag causes no ipmi data to be emitted and an error log is generated on each invocation: "awk: not an option: -nf". I was unable to locate a "-n" flag in the mawk or gawk man pages, so I tested it by manually changing the script on a running Debian buster system. The issue was resolved and metrics were emitted. Signed-off-by: Cole White <cwhite@wikimedia.org>	2019-02-23 18:37:06 +01:00
Nuno Tavares	0dc14762ef	ADD Cachevault_Info.Temp, being a distinct phy component, I think it's worth monitoring (#1268 ) Signed-off-by: Nuno Tavares <n.tavares@portavita.eu>	2019-02-21 14:12:45 +01:00
mpursley	1ba436e194	add md_info_detail.sh (#1204 ) Signed-off-by: Matt Pursley <mpursley@gmail.com>	2019-02-10 15:20:42 +01:00
mpursley	7d150d5782	add physical disk "state" to megaraid_pd_info metric (#1226 ) Signed-off-by: Matt Pursley <mpursley@gmail.com>	2019-01-31 12:40:37 +01:00
Dai Dang Van	085d872aaf	Add S.M.A.R.T metrics (#1209 ) Update metrics following SMART attributes in [1][2] - Seek_Error_Rate - ID: 7 - Reallocated_Event_Count - ID: 196 [1] https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes [2] https://en.wikibooks.org/wiki/Minimizing_Hard_Disk_Drive_Failure_and_Data_Loss/Self-Monitoring,_Analysis,_and_Reporting_Technology Signed-off-by: Dai, Dang Van <daikk115@gmail.com>	2019-01-03 18:12:28 +01:00
Anton Tolchanov	cf8b29d1fb	Add a sample btrfs stats collector script (#1200 ) Signed-off-by: Anton Tolchanov <commits@knyar.net>	2018-12-21 14:10:03 +01:00
dhewg	7c960fd683	smartmon.sh: add metric for active/low-power mode (#1192 ) Add this new metric (where sda is active and sdb is in standby mode): smartmon_device_active{disk="/dev/sda",type="sat"} 1 smartmon_device_active{disk="/dev/sdb",type="sat"} 0 Also skip further metrics if the drive is in a low-power mode. This prevents spinning up disks just to get the metrics (which matches e.g. debian's default behavior for smartd). Signed-off-by: Andre Heider <a.heider@gmail.com>	2018-12-13 16:11:23 +01:00
Andreas Wirooks	9c9e17aba7	Handle 'Unknown' as measurement value. (#1113 ) We use the output-compatible perccli and storcli.py does not handle 'Unknown' as a result: ``` sg="Error parsing \"/var/lib/node_exporter/perccli.prom\": text format parsing error in line 222: expected float as value, got \"Unknown\"" source="textfile.go:212" ``` I know, the perccli should not return 'Unknown' but this error breaks all other useful measurements because the prom file is not parsable. My if condition fixes this. Signed-off-by: Andreas Wirooks <andreas.wirooks@1und1.de>	2018-11-23 16:29:56 +01:00
Christopher Blum	1b98db9fa7	textfile example storcli enhancements (#1145 ) * storcli.py: Remove IntEnum This removes an external dependency. Moved VD state to VD info labels * storcli.py: Fix BBU health detection BBU Status is 0 for a healthy cache vault and 32 for a healthy BBU. * storcli.py: Strip all strings from PD Strip all strings that we get from PDs. They often contain whitespaces.... * storcli.py: Add formatting options Add help text explaining how this documented was formatted * storcli.py: Add DG to pd_info label Add disk group to pd_info. That way we can relate to PDs in the same DG. For example to check if all disks in one RAID use the same interface... * storcli.py: Fix promtool issues Fix linting issues reported by promtool check-metrics * storcli.py: Exit if storcli reports issues storcli reports if the command was a success. We should not continue if there are issues. * storcli.py: Try to parse metrics to float This will sanitize the values we hand over to node_exporter - eliminating any unforeseen values we read out... * storcli.py: Refactor code to implement handle_sas_controller() Move code into methods so that we can now also support HBA queries. * storcli.py: Sort inputs "...like a good python developer" - Daniel Swarbrick * storcli.py: Replace external dateutil library with internal datetime Removes external dependency... * storcli.py: Also collect temperature on megaraid cards We have already collected them on mpt3sas cards... * storcli.py: Clean up old code Removed dead code that is not used any more. * storcli.py: strip() all information for labels They often contain whitespaces... * storcli.py: Try to catch KeyErrors generally If some key we expect is not there, we will want to still print whatever we have collected so far... * storcli.py: Increment version number We have made some changes here and there. The general look of the data has not been changed. * storcli.py: Fix CodeSpell issue Split string to avoid issues with Codespell due to Celcius in JSON Key Signed-off-by: Christopher Blum <zeichenanonym@web.de>	2018-11-07 17:12:23 +01:00
Sven Haardiek	29d4629f55	Introduce example to get pending updates from pacman (#1114 ) * Introduce example to get pending updates from pacman Signed-off-by: Sven Haardiek <sven@haardiek.de>	2018-11-05 22:27:57 +01:00
Benjamin Drung	2d5fcdeef4	Add mellanox_hca_temp text collector example (#1128 ) * deleted_libraries: Upgrade to Python 3 Python 2.7 will not be maintained past 2020. Therefore upgrade text_collector_examples/deleted_libraries.py to Python 3. * Add mellanox_hca_temp text collector example mellanox_hca_temp is a script that reads Mellanox HCA temperature using the Mellanox mget_temp_ext tool. Signed-off-by: Benjamin Drung <benjamin.drung@cloud.ionos.com>	2018-11-01 12:23:06 +01:00
Christopher Blum	6aa5cfba6c	textfile example script rework (#1074 ) * textfile smartmon.sh Added functions to also parse megaraid disks. Added parsing to also detect the grown_defects counters. * textfile storcli.py Reworked the example file to export lots more information about megaraid attached controllers, VDs and PDs. Signed-off-by: Christopher Blum <christopher.blum@profitbricks.com>	2018-09-18 22:43:20 +02:00
Mateusz Piotrowski	b46cd80200	Note how to get moreutils on FreeBSD (#1073 ) Signed-off-by: Mateusz Piotrowski <0mp@FreeBSD.org>	2018-09-14 14:14:45 +02:00
Matt Bostock	9e0aee8ae7	Add metrics exposing extended md RAID info (#958 ) Add metrics that expose more information about MD RAID devices and disks: - the RAID level in use - the RAID set that a disk belongs to This allows for things like alert on unusually high I/O utilisation for a disk compared to other disks in the same RAID set, which usually means the disk is failing, and for comparing write/read latency across RAID sets. Output looks like: node_md_disk_info{disk_device="/dev/dm-0", md_device="md1", md_set="A"} 1 node_md_disk_info{disk_device="/dev/dm-3", md_device="md1", md_set="B"} 1 node_md_disk_info{disk_device="/dev/dm-2", md_device="md1", md_set="A"} 1 node_md_disk_info{disk_device="/dev/dm-1", md_device="md1", md_set="B"} 1 node_md_disk_info{disk_device="/dev/dm-4", md_device="md1", md_set="A"} 1 node_md_disk_info{disk_device="/dev/dm-5", md_device="md1", md_set="B"} 1 node_md_info{md_device="md1", md_name="foo", raid_level="10", md_metadata_version="1.2"} 1 The `node_md_info` metric, which gives additional information about the RAID array, is intentionally separate to avoid adding all of those labels to each disk. If you need to query using the labels contained in `node_md_info`, you can do that using PromQL: https://www.robustperception.io/how-to-have-labels-for-machine-roles/ I looked at adding the array UUID, but there's no sysfs entry for it and I'm not sure there's a strong use case for it. This patch to add a sysfs entry for the UUID was apparently not accepted: https://www.spinics.net/lists/raid/msg40667.html Add these metrics as a textfile script rather than adding them to the Go 'md' module as they're perhaps less commonly useful. If lots of people find them useful, we can later rewrite this in Go. Signed-off-by: Matt Bostock <mbostock@cloudflare.com>	2018-08-18 08:57:51 +00:00
Bernd Müller	ee1e1997bc	Add scsi smart data to prometheus exporter (#862 ) Add scsi smart data to prometheus exporter Signed-off-by: mueller <mueller@b1-systems.de>	2018-07-04 00:30:20 +02:00
Matt Bostock	f56e8fcdf4	Fix spelling of celsius in IPMI example script (#967 ) 'Celsius' should be spelt with an 's': https://en.wikipedia.org/wiki/Celsius Signed-off-by: Matt Bostock <mbostock@cloudflare.com>	2018-06-08 19:21:19 +02:00
Matt Bostock	516e5d4beb	Add metric for outdated libraries (#957 ) Add metrics that count how many running processes are linking to deleted libraries on each machine. Deleted libraries are usually outdated libraries, and outdated libraries may have known security vulnerabilities. The rationale behind storing these as metrics is allow the rollout of security fixes to be tracked across a fleet of machines, ensuring that all affected processes are restarted (e.g. via a reboot). I'm parsing the output from `/proc/*/maps` because it's using `lsof -d DEL` can be too slow, particularly if you have sockets that bind to thousands of IP addresses. The metric labels include the library path and the base filename, which allows us to pinpoint the exact path of the deleted library but also allows us to aggregate on the library name (or approximations of it) even if library locations differ between operating system versions. The metrics output and the CPU time consumed is as follows: user@host:~$ time sudo python processes.py # HELP node_processes_linking_deleted_libraries Count of running processes that link a deleted library # TYPE node_processes_linking_deleted_libraries gauge node_processes_linking_deleted_libraries{library_path="locale-archive", library_name="/usr/lib/locale"} 3 node_processes_linking_deleted_libraries{library_path="libevent-2.0.so.5.1.9", library_name="/usr/lib/x86_64-linux-gnu"} 4 real 0m0.071s user 0m0.030s sys 0m0.041s Including the library filename and path will result in reasonably high metrics cardinality, however I think the benefits when an urgent security patch is being deployed outweigh concerns around cardinality. This script assumes that library files do not contain spaces in their path. Signed-off-by: Matt Bostock <mbostock@cloudflare.com>	2018-05-25 18:20:42 +02:00
Sandor Zeestraten	578d814744	Fix metric name in directory size text collector example The directory size text collector example uses the wrong metric name in the HELP and TYPE lines rendering the comments unusable. This fixes that by using the same metric name. Signed-off-by: Sandor Zeestraten <sandor@zeestrataca.com>	2018-05-19 21:11:46 +02:00
mueller	770f420066	added additional smartmonattrs Signed-off-by: mueller <mueller@b1-systems.de>	2018-03-22 11:14:25 +01:00
Ben Kochie	483f59d110	Document use of atomic wrapper (#781 ) Document how to use `sponge` to atomic update textfiles.	2018-02-27 19:46:01 +01:00
anarcat	79ae03c4c7	add sample directory size exporter (#789 ) * add sample directory size exporter This is a possible workaround for the lack of metrics in the new storage backend, as documented in: https://github.com/prometheus/prometheus/issues/3684 Partly inspired by this post as well: https://www.robustperception.io/monitoring-directory-sizes-with-the-textfile-collector/ * properly escape backslashes and double-quotes	2018-02-21 16:24:48 +01:00
tobald	2978728b00	Fix apt.sh syntax (#811 ) This patch fixes: ./apt.test: command substitution: line 19: syntax error near unexpected token `\|' ./apt.test: command substitution: line 19: ` \| /usr/bin/sort \| /usr/bin/uniq -c \| awk '{ gsub(/\\\\/,	2018-02-05 20:43:25 +01:00
Shevchenko Vitaliy	4ed49e73fb	Escape double quotes in device model family (#772 )	2018-01-24 11:35:14 +01:00
Ben Kochie	1ad5ba4dc7	Fix smartmon.sh bugs (#792 ) * Fix smartmon.sh info label consistency. * Fix parsing of SMART-ID attributes <= 99.	2018-01-22 16:51:20 +01:00
Bruce Lee	8d3484d0ca	Update storcli.py (#783 )	2018-01-09 09:10:30 +01:00
Mario Trangoni	a40f7e78da	StorCli text collector: fix pylint issues and handle StorCli not installed (#758 ) * StorCli text collector: fix pylint issues and handle StorCli not installed * StorCli text collector: Add HELP and TYPE strings.	2017-12-12 18:48:06 +01:00
Filippo Giunchedi	af4cf20b46	apt.sh: handle multiple origins in apt-get output (#757 ) It might happen that a given upgrade comes from multiple origins, in which case the origins are separated by ", " and thus breaking whitespace-based split. For example: Inst package [1.2.3] (1.2.4 Debian:8.10/oldstable, Debian-Security:8/oldstable [amd64]) To workaround this case, mangle the apt-get output to remove whitespaces from the origins list.	2017-12-12 10:45:59 +01:00
Derek Marcotte	1527789f76	Added text collector conversion for ipmitool output. (#746 ) * Added text collector conversion for ipmitool output. * Sort metrics before exporting, add namespace. * Added HELP string, tidy up a bit. * Make status a gauge.	2017-12-01 12:58:39 +01:00
William	6ecd8780d9	added Wear_Leveling_Count attribute to smartmon.sh script (#707 )	2017-10-19 19:20:43 +02:00
Ben Kochie	1824ac3b9e	Fix smartmon.sh textfile script (#700 ) When there are no SMART compatible devices (Raspberry Pi for example) an error is returned, but the return code is still 0. `# scan_smart_devices: glob(3) aborted matching pattern /dev/discs/disc` Remove unused `disks` variable. * Filter for only valid `/dev` devices.	2017-10-18 07:37:47 +02:00
Ben Kochie	a47f033f1b	Add text file helper for apt-get. (#680 ) * Add metric for pending upgrades. * Add metric for pending reboot required.	2017-10-04 08:34:30 +02:00
Matt Bostock	89a2f21f45	Always try to return smartmon_device_info metric (#663 ) * Always try to return smartmon_device_info metric Sometimes the 'model family' field is not returned by `smartctl' because a disk is not in the disk database for the version of smartmontools installed on the system. In those cases, the device model and serial number is still returned (at least as far as I have observed. Re-work the logic to prefer the 'vendor' field first, and if not present, always output a `smartmon_device_info` metric even if some labels have empty values. On the box I'm testing this on, where previously no metric was returned, it now returns: # HELP smartmon_device_info SMART metric device_info # TYPE smartmon_device_info gauge smartmon_device_info{disk="/dev/sda",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 smartmon_device_info{disk="/dev/sdb",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 smartmon_device_info{disk="/dev/sdc",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 smartmon_device_info{disk="/dev/sdd",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 smartmon_device_info{disk="/dev/sde",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 smartmon_device_info{disk="/dev/sdf",type="sat",model_family="",device_model="INTEL REDACTED",serial_number="REDACTED",firmware_version="REDACTED"} 1 * Add trailing newline Because POSIX: https://stackoverflow.com/a/729795	2017-08-31 18:00:42 +02:00
William Cooley	977aa94bd3	Added metric for overall health status check to smartmon.sh example script	2017-04-05 10:51:58 -04:00
Rene Treffer	d61fef8ce6	Handle smart raw values >2^31 "%d" in awk will truncate values at 2^31. S.M.A.R.T. values can exceed that, thus use a floating point notation instead to encode larger values (at the possible cost of some precision).	2017-03-21 10:47:27 +01:00
Ben Kochie	58c10628d8	Add ntpd metrics from ntpq rv Add some metrics using to the ntpd helper script using the "request value"[0] command. [0]: https://www.eecis.udel.edu/~mills/ntp/html/ntpq.html#system	2017-02-14 16:20:53 +01:00
Ben Kochie	bde6e5d290	Add a textfile helper for NTPd. Parse the output of `ntpq -np` to provide metrics from a local NTP daemon.	2017-02-10 16:38:39 +01:00
Matt Bostock	004bdca8e5	Add text_collector_examples README	2016-12-22 22:57:14 +00:00
Matt Bostock	2c02571040	Add StorCli text collector example script Collect metrics from the StorCLI utility on the health of MegaRAID hardware RAID controllers and write them to stdout so that they can be used by the textfile collector. We parse the JSON output that StorCLI provides. Script must be run as root or with appropriate capabilities for storcli to access the RAID card. Designed to run under Python 2.7, using the system Python provided with many Linux distributions. The metrics look like this: mbostock@host:~$ sudo ./storcli.py megaraid_status_code 0 megaraid_controllers_count 1 megaraid_emergency_hot_spare{controller="0"} 1 megaraid_scheduled_patrol_read{controller="0"} 1 megaraid_virtual_drives{controller="0"} 1 megaraid_drive_groups{controller="0"} 1 megaraid_virtual_drives_optimal{controller="0"} 1 megaraid_degraded{controller="0"} 0 megaraid_battery_backup_healthy{controller="0"} 1 megaraid_ports{controller="0"} 8 megaraid_failed{controller="0"} 0 megaraid_drive_groups_optimal{controller="0"} 1 megaraid_healthy{controller="0"} 1 megaraid_physical_drives{controller="0"} 24 megaraid_controller_info{controller="0", model="AVAGOMegaRAIDSASPCIExpressROMB"} 1 mbostock@host:~$	2016-12-22 22:55:58 +00:00
Ben Kochie	0d2314e2b4	Add text file utility for SMART metrics Add a utility to parse the output of `smartctl`. * Scans all disks. * Prints metrics for `smartctl --info`. * Prints metrics for `smartctl --attributes`.	2016-11-27 14:32:32 +01:00

46 Commits (dfb6002fad073a66f0439e0d620f4e4a8e963ec2)