resource_monitor(1) | Cooperative Computing Tools | resource_monitor(1) |
NAME¶
resource_monitor - monitors the cpu, memory, io, and disk usage of a tree of processes.
SYNOPSIS¶
resource_monitor [options] -- command [command-options]
DESCRIPTION¶
resource_monitor is a tool to monitor the computational resources used by the process created by the command given as an argument, and all its descendants. The monitor works while a process was running, therefore all the information reported should be considered just as an estimate (this is in contrast with direct methods, such as ptrace). It works on Linux, and can be used automatically by makeflow and work queue applications.
Additionally, the user can specify maximum resource limits in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.
In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used.
Currently, the monitor does not support interactive applications. That is, if a process issues a read call from standard input, and standard input has not been redirected, then the tree process is terminated. This is likely to change in future versions of the tool.
resource_monitor generates up to three log files: a summary file encoded as json with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.
The summary file is a JSON document with the following fields. Unless indicated, all fields are an array with two values, a number that describes the measurement, and a string describing the units (e.g., [ measurement).
-
command: the command line given as an argument start: time at start of execution, since the epoch end: time at end of execution, since the epoch exit_type: one of "normal", "signal" or "limit" (a string) signal: number of the signal that terminated the process
Only present if exit_type is signal cores: maximum number of cores used cores_avg: number of cores as cpu_time/wall_time exit_status: final status of the parent process max_concurrent_processes: the maximum number of processes running concurrently total_processes: count of all of the processes created wall_time: duration of execution, end - start cpu_time: user+system time of the execution virtual_memory: maximum virtual memory across all processes memory: maximum resident size across all processes swap_memory: maximum swap usage across all processes bytes_read: amount of data read from disk bytes_written: amount of data written to disk bytes_received: amount of data read from network interfaces bytes_sent: amount of data written to network interfaces bandwidth: maximum bandwidth used total_files: total maximum number of files and directories of
all the working directories in the tree disk: size of all working directories in the tree limits_exceeded: resources over the limit with -l, -L options (JSON object) peak_times: seconds from start when a maximum occured (JSON object) snapshots: List of intermediate measurements, identified by
snapshot_name (JSON object)
The time-series log has a row per time sample. For each row, the columns have the following meaning (all columns are integers):
-
wall_clock the sample time, since the epoch, in microseconds cpu_time accumulated user + kernel time, in microseconds cores current number of cores used max_concurrent_processes concurrent processes at the time of the sample virtual_memory current virtual memory size, in MB memory current resident memory size, in MB swap_memory current swap usage, in MB bytes_read accumulated number of bytes read, in bytes bytes_written accumulated number of bytes written, in bytes bytes_received accumulated number of bytes received, in bytes bytes_sent accumulated number of bytes sent, in bytes bandwidth current bandwidth, in bps total_files current number of files and directories, across all
working directories in the tree disk current size of working directories in the tree, in MB
OPTIONS¶
- -d, --debug=<subsystem>
- Enable debugging for this subsystem.
- -o, --debug-file=<file>
- Write debugging output to this file. By default, debugging is sent to stderr (":stderr"). You may specify logs to be sent to stdout (":stdout") instead.
- -v,--version
- Show version string.
- -h,--help
- Show help text.
- -i, --interval=<n>
- Maximum interval between observations, in seconds (default=1).
- --pid=pid
- Track pid instead of executing a command line (warning: less precise measurements).
- --accurate-short-processes
- Accurately measure short running processes (adds overhead).
- -c, --sh=<str>
- Read command line from str, and execute as '/bin/sh -c str'.
- -l, --limits-file=<file>
- Use maxfile with list of var: value pairs for resource limits.
- -L, --limits=<string>
- String of the form "var: value, var: value to specify resource limits. (Could be specified multiple times.)
- -f, --child-in-foreground
- Keep the monitored process in foreground (for interactive use).
- -O, --with-output-files=<template>
- Specify template for log files (default=resource-pid).
- --with-time-series
- Write resource time series to template.series.
- --with-inotify
- Write inotify statistics of opened files to default=template.files.
- -V, --verbatim-to-summary=<str>
- Include this string verbatim in a line in the summary. (Could be specified multiple times.)
- --measure-dir=dir
- Follow the size of dir. By default the directory at the start of execution is followed. Can be specified multiple times. See --without-disk-footprint below.
- --follow-chdir
- Follow processes' current working directories.
- --without-disk-footprint
- Do not measure working directory footprint. Overrides --measure-dir.
- --no-pprint
- Do not pretty-print summaries.
- --snapshot-events=file
- Configuration file for snapshots on file patterns. See below.
- --catalog-task-name=<task-name>
- Report measurements to catalog server with "task"=<task-name>.
- --catalog-project=<project>
- Set project name of catalog update to <project> (default=<task-name>).
- --catalog=<catalog>
- Use catalog server <catalog>. (default=catalog.cse.nd.edu:9097).0, "--catalog=<catalog>");
- --catalog-interval=<interval>
- Send update to catalog every <interval> seconds. (default=30).
The limits file should contain lines of the form:
-
resource: max_value
It may contain any of the following fields, in the same units as defined for the summary file:
max_concurrent_processes, wall_time, cpu_time, virtual_memory, resident_memory, swap_memory, bytes_read, bytes_written, workdir_number_files_dirs, workdir_footprint
ENVIRONMENT VARIABLES¶
- •
- CCTOOLS_RESOURCE_MONITOR_HELPER Location of the desired helper library to wrap libc calls. If not provided
EXIT STATUS¶
-
0 The command exit status was 0, and the monitor process ran without errors. -
1 The command exit status was non-zero, and the monitor process ran without errors. -
2 The command was terminated because it ran out of resources (see options -l, -L). -
3 The command did not run succesfully because the monitor process had an error.To obtain the exit status of the original command, see the generated file with extension .summary.
SNAPSHOTS¶
The resource_monitor can be directed to take snapshots of the resources used according to the files created by the processes monitored. The typical use of monitoring snapshots is to set a watch on a log file, and generate a snapshot when a line in the log matches a pattern. To activate the snapshot facility, use the command line argument --snapshot-events=file, in which file is a JSON-encoded document with the following format:
-
{
"FILENAME": {
"from-start":boolean,
"from-start-if-truncated":boolean,
"delete-if-found":boolean,
"events": [
{
"label":"EVENT_NAME",
"on-create":boolean,
"on-truncate":boolean,
"on-pattern":"REGEXP",
"count":integer
},
{
"label":"EVENT_NAME",
...
}
]
},
"FILENAME": {
...
}
All fields but label are optional.
.IP • 4
FILENAME: Name of a file to watch.
.IP • 4
from-start:boolean If FILENAME exits when the monitor starts running, process
from line 1. Default: false, as monitored processes may be appending to
already existing files.
.IP • 4
from-start-if-truncated If FILENAME is truncated, process from line 1.
Default: true, to account for log rotations.
.IP • 4
delete-if-found Delete FILENAME when found. Default: false
.IP • 4
events:
.IP • 4
label Name that identifies the snapshot. Only alphanumeric, -,
and _ characters are allowed.
.IP • 4
on-create Take a snapshot every time the file is created. Default: false
.IP • 4
on-delete Take a snapshot every time the file is deleted. Default: false
.IP • 4
on-truncate Take a snapshot when the file is truncated. Default: false
.IP • 4
on-pattern Take a snapshot when a line matches the regexp pattern. Default:
none
.IP • 4
count Maximum number of snapshots for this label. Default: -1 (no limit)
The snapshots are recorded both in the main resource summary file under the key snapshots, and as a JSON-encoded document, with the extension are identified with the key "snapshot_name", which is a comma separated string of label(count) elements. A label corresponds to a name that identifies the snapshot, and the count is the number of times an event was triggered since last check (several events may be triggered, for example, when several matching lines are written to the log). Several events may have the same label, and exactly one of on-create, on-truncate, and on-pattern should be specified per event.
EXAMPLES¶
To monitor 'sleep 10', at 2 second intervals, with output to sleep-log.summary, and with a monitor alarm at 5 seconds:
-
% resource_monitor --interval=2 -L"wall_time: 5" -o sleep-log -- sleep 10
Execute 'date' and redirect its output to a file:
-
% resource_monitor --sh 'date > date.output'
It can also be run automatically from makeflow, by specifying the '-M' flag:
-
% makeflow --monitor=some-log-dir Makeflow
In this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per rule in the some-log-dir directory
Additionally, it can be run automatically from Work Queue:
-
q = work_queue_create_monitoring(port); work_queue_enable_monitoring(q, some-log-dir, /*kill tasks on exhaustion*/ 1);
wraps every task with the monitor and writes the resulting summaries in some-log-file.
SNAPSHOTS EXAMPLES¶
Generate a snapshot when "my.log" is created:
-
{
"my.log":
{
"events":[
{
"label":"MY_LOG_STARTED",
"on-create:true
}
]
} }
Generate snapshots every time a line is added to "my.log":
-
{
"my.log":
{
"events":[
{
"label":"MY_LOG_LINE",
"on-pattern":"^.*$"
}
]
} }
Generate snapshots on particular lines of "my.log":
-
{
"my.log":
{
"events":[
{
"label":"started",
"on-pattern":"^# START"
},
{
"label":"end-of-start",
"on-pattern":"^# PROCESSING"
}
{
"label":"end-of-processing",
"on-pattern":"^# ANALYSIS"
}
]
} }
The monitor can also generate a snapshot when a particular file is created. The monitor can detected this file, generate a snapshot, and delete the file to get ready for the next snapshot. In the following example the monitor takes a snapshot everytime the file please-take-a-snapshot is created:
-
{
"please-take-a-snapshot":
{
"delete-if-found":true,
"events":[
{
"label":"manual-snapshot",
"on-create":true
}
]
} }
BUGS AND KNOWN ISSUES¶
- The monitor cannot track the children of statically linked executables.
- The option --snapshot-events assumes that the watched files are written by appending to them. File truncation may not be detected if between checks the size of the file is larger or equal to the size after truncation. File checks are fixed at intervals of 1 second.
COPYRIGHT¶
The Cooperative Computing Tools are Copyright (C) 2005-2019 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
CCTools 8.0.0 DEVELOPMENT |