Monitoring¶
Warning
Gnocchi is not supported on the testbed. If you are interested in power monitoring, you should see this trovi artifact
Chameleon collects monitoring information, representing qualities such as CPU load or power consumption data, from various sources into an aggregation service. Data is kept in this service with resolution that decreases over time. Users can retrieve those metrics via a command line interface (CLI).
In Chameleon, the aggregation service is implemented using the Gnocchi time series database. All Chameleon supported images, from which most of our user’s images are derived, are configured to send a selection of system metrics using the collectd system statistics collection daemon. There is a wide range of qualities this daemon can gather; by default only selected metrics are sent but users can configure the daemon (see Configuring collectd) to adapt this set anytime to monitor their experiments better. Another source of metrics is the infrastructure itself, for example the energy and power consumption metrics.
Tip
Reading Command Line Interface (CLI) is highly recommended before continuing on the following sections.
Setting up the Gnocchi CLI¶
In addition to Installing the CLI, you must also install the Gnocchi client plugin. To install on your local machine, run the following command:
pip install gnocchiclient
Then, set up your environment for OpenStack command line usage, as described in The OpenStack RC Script.
Retrieving Metrics¶
Now, you can run the gnocchi
command line utility. To show the different kinds of metrics collected for a specific instance, run:
gnocchi resource show <instance_id>
Tip
You can get the instance’ ID from the GUI.
You can get your list of instances by running:
gnocchi resource list
It will print out a chart similar to below:
+--------------------------------------+---------+------------+
| id | type | project_id |
+--------------------------------------+---------+------------+
| 8d643431-9a90-4100-8e00-f43d56a68d1e | generic | None |
| 39ff85e4-cf4e-4969-9408-af47a372ad06 | generic | None |
| 3c6c81ba-0566-4cde-a8c5-7ae4d4644293 | generic | None |
| 219f2fec-0e90-4e04-a5d7-1a78c9fde93b | generic | None |
| 57f2ba05-e57c-4241-bd27-bf95cca9c027 | generic | None |
| a0cc7bb7-0169-4973-8d4a-08151c52dec6 | generic | None |
| afb1d1e2-85db-463c-9769-2a2752eb447e | generic | None |
| 87e52c8d-c66e-43f5-b9fc-da376eccdf2d | generic | None |
| bf383c17-d76a-4e50-b347-426c96020d3b | generic | None |
| 9f25dffd-79f5-4c34-86b6-79767b8db086 | generic | None |
| 4b8ee1ce-9733-4808-921f-6d8ca92a6752 | generic | None |
| 5887a427-286f-47ad-bd4a-d7b9278bbc0f | generic | None |
| f5856741-89d5-462f-a0a2-f2423d9bfc38 | generic | None |
| fea54e18-9668-4df0-a511-5b2af4c76945 | generic | None |
| 304dc702-c57a-471c-81df-6e711d793e50 | generic | None |
+--------------------------------------+---------+------------+
You will get a result like the following:
+-----------------------+-------------------------------------------------------------------+
| Field | Value |
+-----------------------+-------------------------------------------------------------------+
| created_by_project_id | 2c8f25efb722467eb9fc25f38996b7c4 |
| created_by_user_id | 7961a8c338ba4cb8a4ac6dfe0ab333f5 |
| creator | 7961a8c338ba4cb8a4ac6dfe0ab333f5:2c8f25efb722467eb9fc25f38996b7c4 |
| ended_at | None |
| id | 304dc702-c57a-471c-81df-6e711d793e50 |
| metrics | interface-eno1@if_dropped: 511abf80-d9e9-4e37-bde6-b34de19a7a87 |
| | interface-eno1@if_errors: 7bf316e3-ce63-424c-955c-1654541dafea |
| | interface-eno1@if_octets: 0b9a204b-38fd-4b4f-a5a1-c25b9b739c5c |
| | interface-eno1@if_packets: a62006be-d45a-4b2c-a201-4f1b4770f43c |
| | interface-eno2@if_dropped: 56bd5603-ed8c-401c-891e-05170e3b40a7 |
| | interface-eno2@if_errors: 5d2d1a60-1ca8-4256-a395-0125428cf395 |
| | interface-eno2@if_octets: 3f3daf4b-2ef8-4383-b031-294e51487ae9 |
| | interface-eno2@if_packets: 0aa3fb64-764f-402b-b9eb-6fb47e3d0efc |
| | interface-eno3@if_dropped: 23c59f0f-d018-4538-a387-90bd5809a0f0 |
| | interface-eno3@if_errors: c8ab32bb-02e7-48f7-8a67-92cf96aa6974 |
| | interface-eno3@if_octets: be37ef63-9ed5-4547-851e-46f1aa2e91d6 |
| | interface-eno3@if_packets: 149ae533-2f03-4a87-91a6-6aa0f8a541b3 |
| | interface-eno4@if_dropped: 6b8285d5-7e87-4f10-8abc-1ac848bf8240 |
| | interface-eno4@if_errors: 0dcd9925-c6e6-480d-88cb-6eb099cd4650 |
| | interface-eno4@if_octets: 4ff866fd-d5ef-4a55-aeab-7cfbe1ac1f28 |
| | interface-eno4@if_packets: 0fe10bf7-79ab-4bfb-aa6b-64efd3b925c1 |
| | interface-lo@if_dropped: 39318dc7-f008-4258-8832-457c90193924 |
| | interface-lo@if_errors: f3998907-786f-4ffd-a47b-bea1f4b9ad97 |
| | interface-lo@if_octets: f01791f8-8939-4bf3-aae7-abb1e4bffc2e |
| | interface-lo@if_packets: 6aaf06ee-5a8d-49f2-b7b9-c1d27841a89b |
| | load@load: 8d6132f8-6e60-409b-8d64-7092491aa9db |
| | memory@memory.buffered: a6ad6e20-f951-4152-aac3-d6d081c33c09 |
| | memory@memory.cached: ca0e3b30-b450-484b-ac41-a03424da279b |
| | memory@memory.free: 7aee53a8-93f9-4bac-92e3-7694b219c698 |
| | memory@memory.slab_recl: 074897b8-c40e-4538-9ef6-69338764bed3 |
| | memory@memory.slab_unrecl: 1bb6c19d-e788-40cd-98f0-0c5820e03563 |
| | memory@memory.used: 8b56e1ea-0aaa-4c1b-9462-f3698bad2ca7 |
| original_resource_id | 304dc702-c57a-471c-81df-6e711d793e50 |
| project_id | None |
| revision_end | None |
| revision_start | 2018-02-15T15:42:18.495824+00:00 |
| started_at | 2018-02-15T15:42:18.495781+00:00 |
| type | generic |
| user_id | None |
+-----------------------+-------------------------------------------------------------------+
To get all the measurements of a particular metric, run:
gnocchi measures show <metric_name> --resource-id <instance_id> --refresh
For example, to get measurements of used memory over time for instance d17d5191-af60-4407-9ed2-e3d48e86ac6d
, run:
gnocchi measures show memory@memory.used --resource-id d17d5191-af60-4407-9ed2-e3d48e86ac6d --refresh
Tip
You may notice that each metric has been assigned a UUID to it. Therefore, instead of providing metric name
, you can provide metric uuid
.
This will show the latest measurements of that metric with granularity set to 1.0, as well as aggregate values (by default, the mean) over one minute and one hour. Other aggregation methods can be used with the --aggregation
option, such as std
, count
, min
, max
and sum
. Your results may appear like this:
+---------------------------+-------------+---------------+
| timestamp | granularity | value |
+---------------------------+-------------+---------------+
| 2017-12-22T18:00:00+01:00 | 3600.0 | 1222193280.0 |
| 2017-12-22T18:01:00+01:00 | 60.0 | 1222684672.0 |
| 2017-12-22T18:02:00+01:00 | 60.0 | 1222394538.67 |
| 2017-12-22T18:03:00+01:00 | 60.0 | 1222147413.33 |
| 2017-12-22T18:01:20+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:01:30+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:01:40+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:01:50+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:02:00+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:02:10+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:02:20+01:00 | 1.0 | 1222684672.0 |
| 2017-12-22T18:02:30+01:00 | 1.0 | 1221943296.0 |
| 2017-12-22T18:02:40+01:00 | 1.0 | 1222438912.0 |
| 2017-12-22T18:02:50+01:00 | 1.0 | 1221931008.0 |
| 2017-12-22T18:03:00+01:00 | 1.0 | 1221931008.0 |
| 2017-12-22T18:03:10+01:00 | 1.0 | 1221931008.0 |
| 2017-12-22T18:03:20+01:00 | 1.0 | 1221931008.0 |
| 2017-12-22T18:03:30+01:00 | 1.0 | 1222373376.0 |
| 2017-12-22T18:03:40+01:00 | 1.0 | 1222369280.0 |
| 2017-12-22T18:03:50+01:00 | 1.0 | 1222348800.0 |
+---------------------------+-------------+---------------+
By default, metrics are stored with an archive policy set to “high”, which is defined to keep data as:
- Per second granularity for the last hour
- Per minute granularity for the last week
- Per hour granularity for a year
However, note that since collectd
is configured to collect metrics only every 10 seconds, there is no metric measurement for each second but every 10 seconds.
Configuring collectd
¶
While only a few collectd
plugins are enabled by default, you can leverage the large collection of available plugins. To enable a plugin on your instance, edit the instance’s /etc/collectd.conf
file. Uncomment each LoadPlugin <plugin_name>
line that you wish to enable. Then, restart collectd with the command:
sudo systemctl restart collectd
The collectd
configured to send measurements by batch to minimize network traffic. However, if you want to avoid any interference during your experiments, you can disable collectd
with the command:
sudo systemctl stop collectd && sudo systemctl disable collectd
Metrics for bare metal nodes¶
Chameleon automatically collects power usage and temperature data on all nodes in the system. Instantaneous power usage data (in watts) and temperature readings (in Celsius) are collected through the IPMI interface on the chassis controller for the nodes. This “out-of-band” approach does not consume additional power on the node itself and runs even when the node is powered off.
Attention
Temperature metrics are currently collected from the CPU sensor on each node. These temperature readings are only reported while the node is powered on.
As with the system metrics, retrieving these automatically collected metrics for a node requires the OpenStack CLI and Gnocchi client plugin (see installation instructions Setting up the Gnocchi CLI above). To get a list of metrics available for a node, use this command:
$ gnocchi resource show <node_uuid>
To retrieve a specifc reading:
$ gnocchi measures show <reading-name> --resource-id=<node_uuid> --refresh
Tip
The node UUID and the instance UUID are different. You can get a node’s UUID for a reservation from the Horizon GUI (https://chi.tacc.chameleoncloud.org for TACC reservations, https://chi.uc.chameleoncloud.org for UC reservations). Click on your lease name from within the list of leases on the Leases subtab within the Reservations tab. The node UUID is at the very bottom under the Nodes
section. You can also find an individual instance node UUID on the instance details page. Click on your instance name on the Instances tab and see Physical Host Name
For example, issuing the following command:
$ gnocchi measures show power --resource-id=05dd5e25-440f-4492-b3b8-9d39af83b8bc --refresh
returns the following power results for node with id 05dd5e25-440f-4492-b3b8-9d39af83b8bc
. The output below has been truncated:
+---------------------------+-------------+--------------------+
| timestamp | granularity | value |
+---------------------------+-------------+--------------------+
| 2018-03-21T07:00:00-05:00 | 3600.0 | 3.6990394736842047 |
| 2018-03-21T08:00:00-05:00 | 3600.0 | 3.6944069767441814 |
| 2018-03-21T09:00:00-05:00 | 3600.0 | 3.7072767295597435 |
| 2018-03-21T10:00:00-05:00 | 3600.0 | 3.674499999999995 |
| 2018-03-21T11:00:00-05:00 | 3600.0 | 3.708236024844716 |
| 2018-03-21T12:00:00-05:00 | 3600.0 | 3.6747818181818137 |
| 2018-03-21T13:00:00-05:00 | 3600.0 | 3.706847058823526 |
. . . . . .
| 2018-05-07T08:17:43-05:00 | 1.0 | 3.537 |
| 2018-05-07T08:18:03-05:00 | 1.0 | 3.996 |
| 2018-05-07T08:18:23-05:00 | 1.0 | 3.847 |
| 2018-05-07T08:19:03-05:00 | 1.0 | 4.145 |
| 2018-05-07T08:19:23-05:00 | 1.0 | 4.145 |
| 2018-05-07T08:19:43-05:00 | 1.0 | 3.686 |
| 2018-05-07T08:20:03-05:00 | 1.0 | 3.847 |
| 2018-05-07T08:20:23-05:00 | 1.0 | 3.686 |
| 2018-05-07T08:20:43-05:00 | 1.0 | 3.847 |
+---------------------------+-------------+--------------------+
To retrieve a metric for a specific time interval, pass the start
and stop
parameters; for example:
$ gnocchi measures show temperature_cpu --start 2018-11-27T02:00:00 --stop 2018-11-27T03:00:00 --resource-id=f3f47a67-d805-48d4-9584-f0143ae976cf --refresh
returns:
+---------------------------+-------------+---------------+
| timestamp | granularity | value |
+---------------------------+-------------+---------------+
| 2018-11-27T02:00:00-06:00 | 300.0 | 61.0 |
| 2018-11-27T02:05:00-06:00 | 300.0 | 61.0 |
| 2018-11-27T02:10:00-06:00 | 300.0 | 61.0 |
| 2018-11-27T02:15:00-06:00 | 300.0 | 61.0 |
| 2018-11-27T02:20:00-06:00 | 300.0 | 58.6 |
| 2018-11-27T02:25:00-06:00 | 300.0 | 56.5333333333 |
| 2018-11-27T02:30:00-06:00 | 300.0 | 56.0 |
| 2018-11-27T02:35:00-06:00 | 300.0 | 56.0 |
| 2018-11-27T02:40:00-06:00 | 300.0 | 56.0 |
| 2018-11-27T02:45:00-06:00 | 300.0 | 56.0 |
| 2018-11-27T02:50:00-06:00 | 300.0 | 56.0 |
| 2018-11-27T02:55:00-06:00 | 300.0 | 56.0 |
+---------------------------+-------------+---------------+
Energy and Power Consumption Measurement with etrace2
¶
The CC-CentOS7, CC-CentOS8, CC-Ubuntu16.04 and CC-Ubuntu18.04 appliances,
as well as all Chameleon supported images dervied from them, now include support for reporting energy and power consumption of each CPU socket and of memory DIMMs.
It is done with the etrace2
utility which relies on the Intel RAPL (Running Average Power Limit) interface.
Attention
Currenly, etrace2
requires a kernel feature that is not supported on our ARM nodes.
To spawn your program and print energy consumption:
etrace2 <your_program>
To print power consumption every 0.5 second:
etrace2 -i 0.5 <your_program>
To print power consumption every 1 second for 10 seconds:
etrace2 -i 1.0 -t 10
For example, to report energy consumption during the generation of a large RSA private key:
$ etrace2 openssl genrsa -out private.pem 4096
# ETRACE2_VERSION=0.1
Generating RSA private key, 4096 bit long modulus
..............................................................................................................................................................................................................................................................................................................++
.............................................................................................................................................................++
e is 65537 (0x10001)
# ELAPSED=2.579472
# ENERGY=365.788208
# ENERGY_SOCKET0=99.037841
# ENERGY_DRAM0=78.577698
# ENERGY_SOCKET1=109.230103
# ENERGY_DRAM1=80.336548
The energy consumption is reported in joules.
etrace2
reports power and energy consumption of CPUs and memory of the node during the entire execution of the program. This will include consumption of other programs running during this period, as well as power and energy consumption of CPUs and memory under idle load.
Note the following caveats:
- Intel documents that the RAPL is not an analog power meter, but rather uses a software power model. This software power model estimates energy usage by using hardware performance counters and I/O models. Based on their measurements, they match actual power measurements.
- In some situations the total ENERGY value is incorrectly reported as a value equal or close to zero. However, the sum of ENERGY_SOCKET and ENERGY_DRAM values should be accurate.
- Monitoring periods larger than 10-15 minutes may be inaccurate due to RAPL registers overflowing if they’re not read regularly.
This utility was contributed by Chameleon user Kazutomo Yoshii of Argonne National Laboratory.
Note
The Linux kernel version of CC-Ubuntu16.04 is too old to use etrace2
on Chameleon Skylake nodes.
To solve the problem, simply upgrade the Linux kernel.