Monitoring

Chameleon collects monitoring information, representing qualities such as CPU load or power consumption data, from various sources into an aggregation service. Data is kept in this service with resolution that decreases over time. Users can retrieve those metrics via a command line interface (CLI).

In Chameleon, the aggregation service is implemented using the Gnocchi time series database. All Chameleon supported images, from which most of our user’s images are derived, are configured to send a selection of system metrics using the collectd system statistics collection daemon. There is a wide range of qualities this daemon can gather; by default only selected metrics are sent but users can configure the daemon (see Configuring collectd) to adapt this set anytime to monitor their experiments better. Another source of metrics is the infrastructure itself, for example the energy and power consumption metrics.

Tip

Reading The Command Line Interface is highly recommended before continuing on the following sections.

Setting up the Gnocchi CLI

In addition to Installing the CLI, you must also install the Gnocchi client plugin. To install on your local machine, run the following command:

pip install gnocchiclient

Then, set up your environment for OpenStack command line usage, as described in The OpenStack RC Script.

Retrieving Metrics

Now, you can run the openstack metric command line utility. To show the different kinds of metrics collected for a specific instance, run:

openstack metric resource show <instance_id>

Tip

You can get the instance’ ID from the GUI.

You can get your list of instances by running:

openstack metric resource list

It will print out a chart similar to below:

+--------------------------------------+---------+------------+
| id                                   | type    | project_id |
+--------------------------------------+---------+------------+
| 8d643431-9a90-4100-8e00-f43d56a68d1e | generic | None       |
| 39ff85e4-cf4e-4969-9408-af47a372ad06 | generic | None       |
| 3c6c81ba-0566-4cde-a8c5-7ae4d4644293 | generic | None       |
| 219f2fec-0e90-4e04-a5d7-1a78c9fde93b | generic | None       |
| 57f2ba05-e57c-4241-bd27-bf95cca9c027 | generic | None       |
| a0cc7bb7-0169-4973-8d4a-08151c52dec6 | generic | None       |
| afb1d1e2-85db-463c-9769-2a2752eb447e | generic | None       |
| 87e52c8d-c66e-43f5-b9fc-da376eccdf2d | generic | None       |
| bf383c17-d76a-4e50-b347-426c96020d3b | generic | None       |
| 9f25dffd-79f5-4c34-86b6-79767b8db086 | generic | None       |
| 4b8ee1ce-9733-4808-921f-6d8ca92a6752 | generic | None       |
| 5887a427-286f-47ad-bd4a-d7b9278bbc0f | generic | None       |
| f5856741-89d5-462f-a0a2-f2423d9bfc38 | generic | None       |
| fea54e18-9668-4df0-a511-5b2af4c76945 | generic | None       |
| 304dc702-c57a-471c-81df-6e711d793e50 | generic | None       |
+--------------------------------------+---------+------------+

You will get a result like the following:

+-----------------------+-------------------------------------------------------------------+
| Field                 | Value                                                             |
+-----------------------+-------------------------------------------------------------------+
| created_by_project_id | 2c8f25efb722467eb9fc25f38996b7c4                                  |
| created_by_user_id    | 7961a8c338ba4cb8a4ac6dfe0ab333f5                                  |
| creator               | 7961a8c338ba4cb8a4ac6dfe0ab333f5:2c8f25efb722467eb9fc25f38996b7c4 |
| ended_at              | None                                                              |
| id                    | 304dc702-c57a-471c-81df-6e711d793e50                              |
| metrics               | interface-eno1@if_dropped: 511abf80-d9e9-4e37-bde6-b34de19a7a87   |
|                       | interface-eno1@if_errors: 7bf316e3-ce63-424c-955c-1654541dafea    |
|                       | interface-eno1@if_octets: 0b9a204b-38fd-4b4f-a5a1-c25b9b739c5c    |
|                       | interface-eno1@if_packets: a62006be-d45a-4b2c-a201-4f1b4770f43c   |
|                       | interface-eno2@if_dropped: 56bd5603-ed8c-401c-891e-05170e3b40a7   |
|                       | interface-eno2@if_errors: 5d2d1a60-1ca8-4256-a395-0125428cf395    |
|                       | interface-eno2@if_octets: 3f3daf4b-2ef8-4383-b031-294e51487ae9    |
|                       | interface-eno2@if_packets: 0aa3fb64-764f-402b-b9eb-6fb47e3d0efc   |
|                       | interface-eno3@if_dropped: 23c59f0f-d018-4538-a387-90bd5809a0f0   |
|                       | interface-eno3@if_errors: c8ab32bb-02e7-48f7-8a67-92cf96aa6974    |
|                       | interface-eno3@if_octets: be37ef63-9ed5-4547-851e-46f1aa2e91d6    |
|                       | interface-eno3@if_packets: 149ae533-2f03-4a87-91a6-6aa0f8a541b3   |
|                       | interface-eno4@if_dropped: 6b8285d5-7e87-4f10-8abc-1ac848bf8240   |
|                       | interface-eno4@if_errors: 0dcd9925-c6e6-480d-88cb-6eb099cd4650    |
|                       | interface-eno4@if_octets: 4ff866fd-d5ef-4a55-aeab-7cfbe1ac1f28    |
|                       | interface-eno4@if_packets: 0fe10bf7-79ab-4bfb-aa6b-64efd3b925c1   |
|                       | interface-lo@if_dropped: 39318dc7-f008-4258-8832-457c90193924     |
|                       | interface-lo@if_errors: f3998907-786f-4ffd-a47b-bea1f4b9ad97      |
|                       | interface-lo@if_octets: f01791f8-8939-4bf3-aae7-abb1e4bffc2e      |
|                       | interface-lo@if_packets: 6aaf06ee-5a8d-49f2-b7b9-c1d27841a89b     |
|                       | load@load: 8d6132f8-6e60-409b-8d64-7092491aa9db                   |
|                       | memory@memory.buffered: a6ad6e20-f951-4152-aac3-d6d081c33c09      |
|                       | memory@memory.cached: ca0e3b30-b450-484b-ac41-a03424da279b        |
|                       | memory@memory.free: 7aee53a8-93f9-4bac-92e3-7694b219c698          |
|                       | memory@memory.slab_recl: 074897b8-c40e-4538-9ef6-69338764bed3     |
|                       | memory@memory.slab_unrecl: 1bb6c19d-e788-40cd-98f0-0c5820e03563   |
|                       | memory@memory.used: 8b56e1ea-0aaa-4c1b-9462-f3698bad2ca7          |
| original_resource_id  | 304dc702-c57a-471c-81df-6e711d793e50                              |
| project_id            | None                                                              |
| revision_end          | None                                                              |
| revision_start        | 2018-02-15T15:42:18.495824+00:00                                  |
| started_at            | 2018-02-15T15:42:18.495781+00:00                                  |
| type                  | generic                                                           |
| user_id               | None                                                              |
+-----------------------+-------------------------------------------------------------------+

To get all the measurements of a particular metric, run:

openstack metric measures show <metric_name> --resource-id <instance_id> --refresh

For example, to get measurements of used memory over time for instance d17d5191-af60-4407-9ed2-e3d48e86ac6d, run:

openstack metric measures show memory@memory.used --resource-id d17d5191-af60-4407-9ed2-e3d48e86ac6d --refresh

Tip

You may notice that each metric has been assigned a UUID to it. Therefore, instead of providing metric name, you can provide metric uuid.

This will show the latest measurements of that metric with granularity set to 1.0, as well as aggregate values (by default, the mean) over one minute and one hour. Other aggregation methods can be used with the --aggregation option, such as std, count, min, max and sum. Your results may appear like this:

+---------------------------+-------------+---------------+
| timestamp                 | granularity |         value |
+---------------------------+-------------+---------------+
| 2017-12-22T18:00:00+01:00 |      3600.0 |  1222193280.0 |
| 2017-12-22T18:01:00+01:00 |        60.0 |  1222684672.0 |
| 2017-12-22T18:02:00+01:00 |        60.0 | 1222394538.67 |
| 2017-12-22T18:03:00+01:00 |        60.0 | 1222147413.33 |
| 2017-12-22T18:01:20+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:01:30+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:01:40+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:01:50+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:02:00+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:02:10+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:02:20+01:00 |         1.0 |  1222684672.0 |
| 2017-12-22T18:02:30+01:00 |         1.0 |  1221943296.0 |
| 2017-12-22T18:02:40+01:00 |         1.0 |  1222438912.0 |
| 2017-12-22T18:02:50+01:00 |         1.0 |  1221931008.0 |
| 2017-12-22T18:03:00+01:00 |         1.0 |  1221931008.0 |
| 2017-12-22T18:03:10+01:00 |         1.0 |  1221931008.0 |
| 2017-12-22T18:03:20+01:00 |         1.0 |  1221931008.0 |
| 2017-12-22T18:03:30+01:00 |         1.0 |  1222373376.0 |
| 2017-12-22T18:03:40+01:00 |         1.0 |  1222369280.0 |
| 2017-12-22T18:03:50+01:00 |         1.0 |  1222348800.0 |
+---------------------------+-------------+---------------+

By default, metrics are stored with an archive policy set to “high”, which is defined to keep data as:

  • Per second granularity for the last hour
  • Per minute granularity for the last week
  • Per hour granularity for a year

However, note that since collectd is configured to collect metrics only every 10 seconds, there is no metric measurement for each second but every 10 seconds.

Configuring collectd

While only a few collectd plugins are enabled by default, you can leverage the large collection of available plugins. To enable a plugin on your instance, edit the instance’s /etc/collectd.conf file. Uncomment each LoadPlugin <plugin_name> line that you wish to enable. Then, restart collectd with the command:

sudo systemctl restart collectd

The collectd configured to send measurements by batch to minimize network traffic. However, if you want to avoid any interference during your experiments, you can disable collectd with the command:

sudo systemctl stop collectd && sudo systemctl disable collectd

Power Consumption Metrics for Low Power Nodes

In addition to the system and power consumption metrics described above, Chameleon automatically collects power usage data on all low power nodes in the system. Instantaneous power usage data (in watts) are collected through the IPMI interface on the chassis controller for the nodes. This “out-of-band” approach does not consume additional power on the node itself and runs even when the node is powered off. Low power nodes for which power usage data are now being collected include all Intel Atoms, low power Xeons, and ARM64s.

As with the system metrics, retrieving the power consumption metrics for a low power node requires the OpenStack CLI and Gnocchi client plugin (see installation instructions Setting up the Gnocchi CLI above). Retrieve the power usage metrics using the following command:

$ openstack metric measures show power --resource-id=<node_uuid> --refresh

Tip

The node UUID and the instance UUID are different. You can get a node’s UUID for a reservation from the Horizon GUI (https://chi.tacc.chameleoncloud.org for TACC reservations, https://chi.uc.chameleoncloud.org for UC reservations). Click on your lease name from within the list of leases on the Leases subtab within the Reservations tab. The node UUID is at the very bottom under the Nodes section.

For example, issuing the following command:

$ openstack metric measures show power --resource-id=05dd5e25-440f-4492-b3b8-9d39af83b8bc --refresh

returns the following power results for node with id 05dd5e25-440f-4492-b3b8-9d39af83b8bc. The output below has been truncated:

+---------------------------+-------------+--------------------+
| timestamp                 | granularity |              value |
+---------------------------+-------------+--------------------+
| 2018-03-21T07:00:00-05:00 |      3600.0 | 3.6990394736842047 |
| 2018-03-21T08:00:00-05:00 |      3600.0 | 3.6944069767441814 |
| 2018-03-21T09:00:00-05:00 |      3600.0 | 3.7072767295597435 |
| 2018-03-21T10:00:00-05:00 |      3600.0 |  3.674499999999995 |
| 2018-03-21T11:00:00-05:00 |      3600.0 |  3.708236024844716 |
| 2018-03-21T12:00:00-05:00 |      3600.0 | 3.6747818181818137 |
| 2018-03-21T13:00:00-05:00 |      3600.0 |  3.706847058823526 |

. . . . . .

| 2018-05-07T08:17:43-05:00 |         1.0 |              3.537 |
| 2018-05-07T08:18:03-05:00 |         1.0 |              3.996 |
| 2018-05-07T08:18:23-05:00 |         1.0 |              3.847 |
| 2018-05-07T08:19:03-05:00 |         1.0 |              4.145 |
| 2018-05-07T08:19:23-05:00 |         1.0 |              4.145 |
| 2018-05-07T08:19:43-05:00 |         1.0 |              3.686 |
| 2018-05-07T08:20:03-05:00 |         1.0 |              3.847 |
| 2018-05-07T08:20:23-05:00 |         1.0 |              3.686 |
| 2018-05-07T08:20:43-05:00 |         1.0 |              3.847 |
+---------------------------+-------------+--------------------+

To retrieve power metrics for a specific time interval, pass the start and stop parameters; for example:

$ openstack metric measures show power --start 2018-05-06T12:35:00 --stop 2018-05-06T12:40:00 --resource-id=05dd5e25-440f-4492-b3b8-9d39af83b8bc --refresh

returns:

+---------------------------+-------------+--------------------+
| timestamp                 | granularity |              value |
+---------------------------+-------------+--------------------+
| 2018-05-06T12:00:00-05:00 |      3600.0 | 3.6863040935672484 |
| 2018-05-06T12:35:00-05:00 |        60.0 | 3.5833333333333335 |
| 2018-05-06T12:36:00-05:00 |        60.0 | 3.6870000000000003 |
| 2018-05-06T12:37:00-05:00 |        60.0 | 3.5309999999999997 |
| 2018-05-06T12:38:00-05:00 |        60.0 |              3.537 |
| 2018-05-06T12:39:00-05:00 |        60.0 |              3.692 |
+---------------------------+-------------+--------------------+

Energy and Power Consumption Measurement with etrace2

The CC-CentOS7 and CC-Ubuntu16.04 appliances, as well as all Chameleon supported images dervied from them, now include support for reporting energy and power consumption of each CPU socket and of memory DIMMs. It is done with the etrace2 utility which relies on the Intel RAPL (Running Average Power Limit) interface.

To spawn your program and print energy consumption:

etrace2 <your_program>

To print power consumption every 0.5 second:

etrace2 -i 0.5 <your_program>

To print power consumption every 1 second for 10 seconds:

etrace2 -i 1.0 -t 10

For example, to report energy consumption during the generation of a large RSA private key:

$ etrace2 openssl genrsa -out private.pem 4096
# ETRACE2_VERSION=0.1
Generating RSA private key, 4096 bit long modulus
..............................................................................................................................................................................................................................................................................................................++
.............................................................................................................................................................++
e is 65537 (0x10001)
# ELAPSED=2.579472
# ENERGY=365.788208
# ENERGY_SOCKET0=99.037841
# ENERGY_DRAM0=78.577698
# ENERGY_SOCKET1=109.230103
# ENERGY_DRAM1=80.336548

The energy consumption is reported in joules.

etrace2 reports power and energy consumption of CPUs and memory of the node during the entire execution of the program. This will include consumption of other programs running during this period, as well as power and energy consumption of CPUs and memory under idle load.

Note the following caveats:

  • Intel documents that the RAPL is not an analog power meter, but rather uses a software power model. This software power model estimates energy usage by using hardware performance counters and I/O models. Based on their measurements, they match actual power measurements.
  • In some situations the total ENERGY value is incorrectly reported as a value equal or close to zero. However, the sum of ENERGY_SOCKET and ENERGY_DRAM values should be accurate.
  • Monitoring periods larger than 10-15 minutes may be inaccurate due to RAPL registers overflowing if they’re not read regularly.

This utility was contributed by Chameleon user Kazutomo Yoshii of Argonne National Laboratory.