Metrics Specification
Cyral publishes the metrics described here to track data activity and system health. Cyral metrics use a standard set of labels, listed at the end of this page.
Metrics format and exposure
The Cyral sidecar exposes a port that responds with metrics that conform
to the OpenMetrics
specification.
The metrics port defaults to 9000
, but can be changed through configuration.
Refer to your sidecar deployment option documentation for configuration options
and service discovery snippets.
Metric filtering
By default, all Cyral metrics are exposed through the metrics endpoint, but
you can filter metrics by name in two different ways. At deployment time,
you can set a default regex which will filter all requests given to the
endpoint by name. You can also override that default by setting the name_regex
query parameter when scraping the metrics endpoint from your metrics scraper.
For example, to filter out any metrics that do not start with cyral_
, you would configure your scraper
to hit the following endpoint with default port:
cyral-sidecar:9000/metrics?name_regex="^cyral_.*"
Up metric
An up
metric is exposed for each service inside the Cyral sidecar, along
with a sidecar_service_name
label. This metric represents whether or not
the given service has responded to the latest metric poll.
Health status metric
Each sidecar instance exposes a metrics that represents its health status.
The metric has values and labels as follows:
Status | Metric Value | Labels |
---|---|---|
unknown | 0 | status="unknown" |
healthy | 1 | status="healthy" |
degraded | 2 | status="degraded",failed_components="component1;component2..." |
unhealthy | 3 | status="unhealthy",failed_components="component1;component2..." |
Sidecar System Metrics
CPU
This metric tracks CPU utilisation of the sidecar compute node. Short spikes are acceptable, but sustained use above 80% may cause performance degradation in the sidecar. This should be monitored at an individual sidecar instance level, and not aggregated across a group of sidecar instances.
Recommendation: Increase the size of the autoscaling group to provision additional nodes or increase the capacity of each node in the autoscaling group.
Memory
This metric tracks memory utilization of the sidecar compute node. Short spikes are acceptable, but sustained use above 80% may increase the risk that the sidecar instance will have to restart. This should be monitored at an individual sidecar instance level, and not aggregated across a group of sidecar instances.
Recommendations:
Increase the size of the autoscaling group to provision additional nodes or increase the capacity of each node in the autoscaling group.
Consider applying a sidecar memory budget. For deployments that need to keep sidecar memory usage under a certain threshold, Cyral provides an optional mechanism to enforce memory budgets. This feature is disabled by default to ensure every query is analyzed.
If enabled, the sidecar memory budget limits the amount of memory the sidecar uses while parsing and analyzing queries and responses. In particular, the budget sets an upper bound on the maximum query/response size that will be analyzed. Queries/responses larger than what the current memory budget allows will not be analyzed.
tip
Please contact Cyral support for help changing the memory budget or setting the optimal budget for your use case.
Disk
The sidecar does not store any persistent data other than logs which should be automatically rotated as they are forwarded to a log collector service. Sustained sidecar disk utilisation above 50% should be investigated to ensure that log rotation is behaving correctly. Reaching 100% disk utilization may result in the sidecar compute node restarting.
Recommendation: Investigate the cause for increasing disk consumption by connecting to the sidecar instance and looking at the volume. Resolve by ensuring that log rotation is correctly configured.
Cyral Counters
System Health Metrics
Metric Name | Description |
---|---|
cyral_open_client_conns_count | Number of client connections established |
cyral_closed_connection_count | Number of monitored client connections closed |
cyral_query_duration_sum | Cumulative sum of query execution duration |
cyral_wire_dial_errors_count | Number of times wire was unreachable |
cyral_repo_dial_errors_count | Number of times repository was unreachable |
go_memstats_heap_inuse_bytes | Memory used by sidecar applications |
go_goroutines | Number of Goroutines the sidecar is using |
cyral_bypass_wire_count | Number of connections that went through bypass mode due to the unavailability of the wire |
See below for more detailed descriptions of these metrics.
Open connections
Calculated as: cyral_open_client_conns_count - cyral_closed_connection_count
This metric can be used to count the number of concurrent connections to the sidecar/repo. This can be used to alert if the number of connections falls outside an expected range such as:
Connections = 0
— May indicate a problem if an app is expected to maintain a persistent connectionConnections < x
— A deviation from normal, based on expected use of the data, may indicate an issue.
Recommendation: If the number of connections falls outside the expected bounds, investigate the access logs to understand the behaviour change. If connections have increased, the logs will reveal which client is driving the additional traffic. If the connections have dropped, investigate the application for outage or failed authentication etc.
Average query duration
Calculated as: increase(cyral_query_duration_sum[1m]/increase(cyral_query_duration_count[1m])
This metric records the average time taken for a query. This can be used as an indicator of degraded application performance. An increase may indicate an issue with either the sidecar or application. Note that this is an average over many queries and may not be indicative if queries are run on an adhoc basis.
Recommendation: If the average query duration increases, check the sidecar CPU/memory and repository CPU/memory are not reaching their limits. Using the access logs, determine if all queries are taking longer than previously, or a subset of queries. Investigate if the nature of some/all queries have become more complex resulting in longer repository processing time.
Dial errors
cyral_wire_dial_errors_count
measures errors in internal sidecar communication between services.cyral_repo_dial_errors_count
measures errors in external communication with the repo.
These metrics indicate an error communicating internally among sidecar services or externally with the repository, respectively. A single, infrequent event may not be of concern, but a large number of events or an increase in frequency may indicate a connectivity or authentication issue.
Recommendation: Wire dial errors should be reported to Cyral. Repo dial errors indicate that the sidecar is unable to reach the configured repository. Check that the repository endpoint is correctly configured in the Cyral console, and that any security groups on the repository allow traffic from the sidecar on the configured port.
Golang memory usage
The go_memstats_heap_inuse_bytes
metric reports how much memory the
Cyral sidecar applications are using. A constant increase of this
value or reaching 80% of the nodes capacity could indicate a memory
leak and may result in the sidecar restarting
Recommendation: Report to Cyral for investigation
Goroutines
The go_goroutines
metric represents how many goroutines the sidecar
is using. Like memory, a constant increase may indicate a leak and
should be investigated.
Recommendation: Report to Cyral for investigation
Unanalyzed queries
The cyral_bypass_wire_count
metric can alert you when some
database traffic is not being fully monitored by Cyral.
The sidecar supports a mode ("Enter passthrough mode on failure") which prioritizes data access over monitoring. This means that if an internal component of the sidecar has an issue, the sidecar will attempt to ensure traffic is still directed to the repository, even if analysis and monitoring cannot occur. Small increases in this metric may indicate a complex query that is not being analysed correctly and should be reported to Cyral. Large increases in this metric (in line with the increase in queries) suggest that the sidecar has a partial failure and should be investigated/restarted if persisting.
Recommendation: Report to Cyral for investigation
Application Health Metrics
Metric Name | Description |
---|---|
cyral_authentication_failure_count | Number of authentication failures |
cyral_portscan_count | Number of port scans |
cyral_policy_violation_count | Number of queries that have resulted in policy violations |
cyral_blocked_queries_count | Number of queries blocked by Cyral due to policy violations |
cyral_queries_with_errors | Number of queries that resulted in database errors |
Authentication failures
The cyral_authentication_failure_count
metric counts the number of
authentication failures. A small increase in this metric may be due to
someone mis-typing a password. A moderate increase in this metric may
indicate an incorrect/changed password in the repository. A large
increase may indicate an attacker/attempted breach of the repository
and should be investigated.
Port scans
The cyral_portscan_count
metric indicates that a client has
attempted to connect to the sidecar, but not progressed the connection
or provided any authentication, then terminated the connection. This
is typically used by an attacker to scan a network and discover what
addresses/ports are open before attempting to connect. In a
private/restricted network, these should not be expected and should be
investigated.
Policy violations
The
cyral_policy_violation_count
metric indicates how many policy violations have occurred. This can be used for verifying a policy before blocking mode is enabled, monitoring for malicious users, or detecting queries from applications that may not behave as expected.The
cyral_blocked_queries_count
metric indicates how many queries are blocked due to policy violations. Increases may indicate a malicious user, or an application unable to complete its function (due to misconfigured code misconfigured policy)
Query errors
The cyral_queries_with_errors
metric indicates an error occurred at
the repository while processing the request.
Labels
Label | Type | Description |
---|---|---|
repo_id | string | Repository ID |
repo_name | string | Repository name |
repo_type | string | Repository type (MySQL, PostgreSQL, and so on) |
client_host | string | Client IP address |
client_tls | boolean | Whether the client connected to the sidecar using TLS |
repo_tls | boolean | Whether the sidecar connected to the repository using TLS |
sensitive_request | boolean | Whether the request accessed sensitive data |
end_user | string | The user (SSO user or native data repository user) who connected to the repository |
service_name | string | The service that connected to the repository |