Mesos

Overview

This plugin collects runtime metrics from Mesos masters and agents (slaves) within its cluster. It gathers information about resource usage and performance characteristics.

Note

This plugin is currently only available for x86_64 Linux.

Setup

The mesos plugin is included with the AppOptics host agent by default, please follow the directions below to enable it on a given host. Note that the directions are slightly different for master vs agent (slave) nodes.

Installation

Activate the plugin by symlinking the binary and its task configuration to the /opt/appoptics/autoload directory:

$ ln -s /opt/appoptics/bin/snap-plugin-collector-mesos /opt/appoptics/autoload/snap-plugin-collector-mesos
$ ln -s /opt/appoptics/etc/tasks.d/task-mesos-publish-appoptics.yaml /opt/appoptics/autoload/task-mesos-publish-appoptics.yaml

If you have an existing task configuration you would like to use, simply rename it to match above. Default configurations for the plugin and its tasks are below.

Configuration

The host agent provides an example configuration file to help you get started quickly. It defines the plugin and task file to be loaded by the agent, but requires you to provide the correct settings for your Mesos deployment. To enable the plugin:

  1. Make a copy of the Mesos example configuration file /opt/appoptics/etc/plugins.d/mesos.yaml.example, renaming it to /opt/appoptics/etc/plugins.d/mesos.yaml:
$ sudo cp /opt/appoptics/etc/plugins.d/mesos.yaml.example /opt/appoptics/etc/plugins.d/mesos.yaml
  1. Note that Mesos provides an endpoint for metrics scraping. That endpoint is turned on by default. You can test whether a master is ready for scraping by checking for a JSON payload returned from the following cURL commands. If you don’t get a payload of JSON data back from the GET requests defined below, check your cluster configuration. Note that the port for masters and agents are different:
$ curl http://<MASTER IP>:5050/metrics/snapshot
$ curl http://<AGENT IP>:5051/metrics/snapshot
  1. Update the /opt/appoptics/etc/plugins.d/mesos.yaml configuration file to indicate whether the instance of the plugin is monitoring a master or agent (slave) node.

Master config

collector:
  mesos:
    all:
      master: "127.0.0.1:5050"

Agent config

collector:
  mesos:
    all:
      agent: "127.0.0.1:5051"
  1. The host agent provides task configuration in /opt/appoptics/autoload/task-mesos-publish-appoptics.yaml. You shouldn’t need to change this, but the default configuration is provided below for reference.

    version: 1
    schedule:
      type: cron
      interval: "0 * * * * *"
    workflow:
      collect:
        metrics:
          /mesos/*: {}
        publish:
        - plugin_name: publisher-appoptics
    
  2. Restart the host agent:

$ sudo service appoptics-snapteld restart
  1. Enable the Mesos plugin in the AppOptics UI

On the Integrations Page you will see the Mesos plugin available if the previous steps were successful. If you do not see the plugin, see Troubleshooting.

Select the Mesos plugin to open the configuration menu in the UI, and enable the plugin.

Metrics and Tags

The tables below outline the default set of metrics collected by the mesos plugin along with the optional metrics available.

Default Metrics

Namespace Description
mesos.master.master.cpus_percent Master CPU Usage % (gauge)
mesos.master.master.disk_percent Master disk usage % (gauge)
mesos.master.master.elected This metric indicates whether this is the elected master. This metric should be fetched from all masters and add up to 1. If this number is not 1 for a period of time your system administrator should be notified (PagerDuty etc). (gauge)
mesos.master.master.mem_percent Master Memory Usage % (gauge)
mesos.master.master.messages_decline_offers This metric provides the number of declined offers. This number should equal the number of agents x the number of frameworks. If this number drops to a low value something is probably getting starved. (counter)
mesos.master.master.messages_kill_task This metric provides the number of kill task messages. (counter)
mesos.master.master.recovery_slave_removals This metric provides the number of agents that were not re-registered during master failover. This is a broad endpoint that combines …reason_unhealthy …reason_unregistered and …reason_registered. You can monitor this explicitly or leverage master.slave_removals.reason_unhealthy master.slave_removals.reason_unregistered and master.slave_removals.reason_registered for specifics. (counter)
mesos.agent.slave.uptime_secs This metric provides the agent uptime in seconds. This number should be always increasing. The moment this number resets to 0 this indicates that the agent process has been rebooted. You can use this metric to detect “flapping”. For example if the agent has an uptime of less than 1 minute (60 seconds) for more than 10 minutes it has probably restarted 10 or more times. (gauge)
mesos.master.master.slave_removals This metric provides the number of agents removed for various reasons including maintenance. Use this metric to determine network partitions after a large number of agents have disconnected. If this number greatly deviates from the previous number your system administrator should be notified (PagerDuty etc). (counter)
mesos.master.master.slave_removals.reason_registered This metric provides the number of agents that were removed when new agents were registered at the same address. New agents replaces old agents. This should be a rare event. If this number increases your system administrator should be notified (PagerDuty etc). (counter)
mesos.master.master.slave_removals.reason_unhealthy This metric provides the number of agents failed because of failed health checks. This endpoint returns the total number of agents that were unhealthy. (counter)
mesos.master.master.slave_removals.reason_unregistered This metric provides the number of agents unregistered. If this number increases drastically this indicates that the master or agent is unable to communicate properly. Use this endpoint to determine network partition. (counter)
mesos.master.master.slave_reregistrations This metric provides the number of agent re-registrations and restarts. Use this metric along with historical data to determine deviations and spikes of when a network partition occurs. If this number drastically increases then the cluster has experienced an outage but has reconnected. (counter)
mesos.master.master.slaves_active This metric provides the number of active agents. The number of active agents is calculated by adding slaves_connected and slave_disconnected. (counter)
mesos.master.master.slaves_disconnected This metric provides the number of disconnected agents. This metric is helpful along with master.slave_removals. If an agent disconnects this number will increase. If an agent reconnects this number will decrease. (gauge)
mesos.master.master.tasks_error This metric provides the number of invalid tasks. (counter)
mesos.master.master.tasks_failed This metric provides the number of failed tasks. (counter)
mesos.master.master.tasks_finished This metric provides the number of running completed. (counter)
mesos.master.master.tasks_killed This metric provides the number of killed tasks. (counter)
mesos.master.master.tasks_lost This metric provides the number of lost tasks. A lost task means a task was killed or disconnected by an external factor. Use this metric when a large number of task deviate from the previous historic number. (counter)
mesos.master.master.tasks_running This metric provides the number of running tasks. (counter)
mesos.master.master.tasks_starting This metric provides the number of tasks starting. (counter)
mesos.master.master.uptime_secs This metric provides the master uptime in seconds. This number should be at least 5 minutes (300 seconds) to indicate a stable master. You can use this metric to detect “flapping”. For example if the master has an uptime of less than 1 minute (60 seconds) for more than 10 minutes it has probably restarted 10 or more times. (gauge)
mesos.agent.slave.cpus_percent Slaves CPU Usage % (gauge)
mesos.master.system.mem_free_bytes Slaves memory free bytes (counter)
mesos.agent.slave.mem_percent Slaves Memory Usage % (gauge)
mesos.agent.slave.disk_percent Slaves Disk Usage % (gauge)

Tags

Tag Name Description
framework_id ID of the framework deployed on Mesos
executor_id ID of the respective task executor

Optional Metrics

Namespace
mesos.master.allocator.event_queue_dispatches
mesos.master.allocator.mesos.allocation_run_latency_ms
mesos.master.allocator.mesos.allocation_run_latency_ms.count
mesos.master.allocator.mesos.allocation_run_latency_ms.max
mesos.master.allocator.mesos.allocation_run_latency_ms.min
mesos.master.allocator.mesos.allocation_run_latency_ms.p50
mesos.master.allocator.mesos.allocation_run_latency_ms.p90
mesos.master.allocator.mesos.allocation_run_latency_ms.p95
mesos.master.allocator.mesos.allocation_run_latency_ms.p99
mesos.master.allocator.mesos.allocation_run_latency_ms.p999
mesos.master.allocator.mesos.allocation_run_latency_ms.p9999
mesos.master.allocator.mesos.allocation_run_ms
mesos.master.allocator.mesos.allocation_run_ms.count
mesos.master.allocator.mesos.allocation_run_ms.max
mesos.master.allocator.mesos.allocation_run_ms.min
mesos.master.allocator.mesos.allocation_run_ms.p50
mesos.master.allocator.mesos.allocation_run_ms.p90
mesos.master.allocator.mesos.allocation_run_ms.p95
mesos.master.allocator.mesos.allocation_run_ms.p99
mesos.master.allocator.mesos.allocation_run_ms.p999
mesos.master.allocator.mesos.allocation_run_ms.p9999
mesos.master.allocator.mesos.allocation_runs
mesos.master.allocator.mesos.event_queue_dispatches
mesos.master.allocator.mesos.offer_filters.roles.active
mesos.master.allocator.mesos.resources.cpus.offered_or_allocated
mesos.master.allocator.mesos.resources.cpus.total
mesos.master.allocator.mesos.resources.disk.offered_or_allocated
mesos.master.allocator.mesos.resources.disk.total
mesos.master.allocator.mesos.resources.mem.offered_or_allocated
mesos.master.allocator.mesos.roles.shares.dominant
mesos.master.framework.active
mesos.master.framework.id
mesos.master.framework.name
mesos.master.framework.offered_resources.cpus
mesos.master.framework.offered_resources.disk
mesos.master.framework.offered_resources.gpus
mesos.master.framework.offered_resources.mem
mesos.master.framework.resources.cpus
mesos.master.framework.resources.disk
mesos.master.framework.resources.gpus
mesos.master.framework.resources.mem
mesos.master.framework.used_resources.cpus
mesos.master.framework.used_resources.disk
mesos.master.framework.used_resources.gpus
mesos.master.framework.used_resources.mem
mesos.master.master.cpus_revocable_percent
mesos.master.master.cpus_revocable_total
mesos.master.master.cpus_revocable_used
mesos.master.master.cpus_total
mesos.master.master.cpus_used
mesos.master.master.disk_revocable_percent
mesos.master.master.disk_revocable_total
mesos.master.master.disk_revocable_used
mesos.master.master.disk_total
mesos.master.master.disk_used
mesos.master.master.dropped_messages
mesos.master.master.event_queue_dispatches
mesos.master.master.event_queue_http_requests
mesos.master.master.event_queue_messages
mesos.master.master.frameworks_active
mesos.master.master.frameworks_connected
mesos.master.master.frameworks_disconnected
mesos.master.master.frameworks_inactive
mesos.master.master.gpus_percent
mesos.master.master.gpus_revocable_percent
mesos.master.master.gpus_revocable_total
mesos.master.master.gpus_revocable_used
mesos.master.master.gpus_total
mesos.master.master.gpus_used
mesos.master.master.invalid_executor_to_framework_messages
mesos.master.master.invalid_framework_to_executor_messages
mesos.master.master.invalid_status_update_acknowledgements
mesos.master.master.invalid_status_updates
mesos.master.master.mem_revocable_percent
mesos.master.master.mem_revocable_total
mesos.master.master.mem_revocable_used
mesos.master.master.mem_total
mesos.master.master.mem_used
mesos.master.master.messages_authenticate
mesos.master.master.messages_deactivate_framework
mesos.master.master.messages_executor_to_framework
mesos.master.master.messages_exited_executor
mesos.master.master.messages_framework_to_executor
mesos.master.master.messages_launch_tasks
mesos.master.master.messages_reconcile_tasks
mesos.master.master.messages_register_framework
mesos.master.master.messages_register_slave
mesos.master.master.messages_reregister_framework
mesos.master.master.messages_reregister_slave
mesos.master.master.messages_resource_request
mesos.master.master.messages_revive_offers
mesos.master.master.messages_status_update
mesos.master.master.messages_status_update_acknowledgement
mesos.master.master.messages_suppress_offers
mesos.master.master.messages_unregister_framework
mesos.master.master.messages_unregister_slave
mesos.master.master.messages_update_slave
mesos.master.master.outstanding_offers
mesos.master.master.slave_registrations
mesos.master.master.slave_removals
mesos.master.master.slave_removals.reason_registered
mesos.master.master.slave_removals.reason_unhealthy
mesos.master.master.slave_removals.reason_unregistered
mesos.master.master.slave_shutdowns_canceled
mesos.master.master.slave_shutdowns_completed
mesos.master.master.slave_shutdowns_scheduled
mesos.master.master.slave_unreachable_canceled
mesos.master.master.slave_unreachable_completed
mesos.master.master.slave_unreachable_scheduled
mesos.master.master.slaves_connected
mesos.master.master.slaves_inactive
mesos.master.master.slaves_unreachable
mesos.master.master.tasks_dropped
mesos.master.master.tasks_gone
mesos.master.master.tasks_gone_by_operator
mesos.master.master.tasks_killing
mesos.master.master.tasks_staging
mesos.master.master.tasks_unreachable
mesos.master.master.valid_executor_to_framework_messages
mesos.master.master.valid_framework_to_executor_messages
mesos.master.master.valid_status_update_acknowledgements
mesos.master.master.valid_status_updates
mesos.master.registrar.log.ensemble_size
mesos.master.registrar.log.recovered
mesos.master.registrar.queued_operations
mesos.master.registrar.registry_size_bytes
mesos.master.registrar.state_fetch_ms
mesos.master.registrar.state_store_ms
mesos.master.registrar.state_store_ms.count
mesos.master.registrar.state_store_ms.max
mesos.master.registrar.state_store_ms.min
mesos.master.registrar.state_store_ms.p50
mesos.master.registrar.state_store_ms.p90
mesos.master.registrar.state_store_ms.p95
mesos.master.registrar.state_store_ms.p99
mesos.master.registrar.state_store_ms.p999
mesos.master.registrar.state_store_ms.p9999
mesos.master.system.cpus_total
mesos.master.system.load_15min
mesos.master.system.load_1min
mesos.master.system.load_5min
mesos.master.system.mem_total_bytes
mesos.agent.containerizer.fetcher.cache_size_total_bytes
mesos.agent.containerizer.fetcher.cache_size_used_bytes
mesos.agent.containerizer.fetcher.task_fetches_failed
mesos.agent.containerizer.fetcher.task_fetches_succeeded
mesos.agent.containerizer.mesos.container_destroy_errors
mesos.agent.containerizer.mesos.provisioner.bind.remove_rootfs_errors
mesos.agent.containerizer.mesos.provisioner.remove_container_errors
mesos.agent.executor.executor_id
mesos.agent.executor.executor_name
mesos.agent.executor.framework_id
mesos.agent.executor.source
mesos.agent.executor.statistics.cpus_limit
mesos.agent.executor.statistics.cpus_system_time_secs
mesos.agent.executor.statistics.cpus_user_time_secs
mesos.agent.executor.statistics.mem_limit_bytes
mesos.agent.executor.statistics.mem_rss_bytes
mesos.agent.executor.statistics.timestamp
mesos.agent.slave.container_launch_errors
mesos.agent.slave.cpus_revocable_percent
mesos.agent.slave.cpus_revocable_total
mesos.agent.slave.cpus_revocable_used
mesos.agent.slave.cpus_total
mesos.agent.slave.cpus_used
mesos.agent.slave.disk_revocable_percent
mesos.agent.slave.disk_revocable_total
mesos.agent.slave.disk_revocable_used
mesos.agent.slave.disk_total
mesos.agent.slave.disk_used
mesos.agent.slave.executor_directory_max_allowed_age_secs
mesos.agent.slave.executors_preempted
mesos.agent.slave.executors_registering
mesos.agent.slave.executors_running
mesos.agent.slave.executors_terminated
mesos.agent.slave.executors_terminating
mesos.agent.slave.frameworks_active
mesos.agent.slave.gpus_percent
mesos.agent.slave.gpus_revocable_percent
mesos.agent.slave.gpus_revocable_total
mesos.agent.slave.gpus_revocable_used
mesos.agent.slave.gpus_total
mesos.agent.slave.gpus_used
mesos.agent.slave.invalid_framework_messages
mesos.agent.slave.invalid_status_updates
mesos.agent.slave.mem_revocable_percent
mesos.agent.slave.mem_revocable_total
mesos.agent.slave.mem_revocable_used
mesos.agent.slave.mem_total
mesos.agent.slave.mem_used
mesos.agent.slave.recovery_errors
mesos.agent.slave.registered
mesos.agent.slave.tasks_failed
mesos.agent.slave.tasks_finished
mesos.agent.slave.tasks_gone
mesos.agent.slave.tasks_killed
mesos.agent.slave.tasks_killing
mesos.agent.slave.tasks_lost
mesos.agent.slave.tasks_running
mesos.agent.slave.tasks_staging
mesos.agent.slave.tasks_starting
mesos.agent.slave.valid_framework_messages
mesos.agent.slave.valid_status_updates
mesos.agent.system.cpus_total
mesos.agent.system.load_15min
mesos.agent.system.load_1min
mesos.agent.system.load_5min
mesos.agent.system.mem_free_bytes
mesos.agent.system.mem_total_bytes