Using the support-bundles Tool

The support-bundles tool collects various types of bundles, or modes, of Private Cloud Appliance diagnostic data such as health check status, command outputs, and logs.

Depending on the command options provided, these bundles might contain logs or status. All modes collect files into a bundle directory. No more than one support bundle process is allowed at one time. A support bundle lock file is created at the beginning of bundle collection and removed when bundle collection is complete.

All support-bundles commands return immediately, and the bundle collection runs in the background. This is because bundle collections might take hours to complete. Bundles are stored for two days, then automatically deleted.

The following types of bundles are supported:

  • Triage Mode. Collects data about the current status of the Private Cloud Appliance.

  • Time Slice Mode. Collects data by time slots. These results can be further narrowed by specifying pod name, job, and k8s_app label.

  • Combo Mode. Collects a combination of triage and time slice data.

  • Native Mode. Collects data from management, compute, and ZFS nodes and from ILOM and Cisco hosts.

A good way to start to investigate an issue is to collect a combo bundle. Look for NOT_HEALTHY in the triage mode results and compare that to what you see in the time_slice mode results.

The support-bundles command requires a mode option. All modes accept the service request number option. See the following table. Time slice and native modes have additional options.

Option

Description

Required

-m mode

The type of bundle.

yes

-sr SR_number

--sr_number SR_number

The service request number.

no

The support-bundles command output is stored in the following directory on the management node, where bundle-type is the mode: triage, time_slice, combo, or native:

/nfs/shared_storage/support_bundles/SR_number_bundle-type-bundle_timestamp/

The SR_number is used if you provided the -sr option. If you are creating the support bundle for a service request, specify the SR_number.

This directory contains a bundle collection progress file and an archive file, which are named as follows:

bundle-type_collection.log
SR_number_bundle-type-bundle_timestamp.tar.gz

The archive file contains a header.json file with the following default components:

  • current-time - the timestamp

  • create-support-bundle - the command line that was used

  • sr-number - the SR number associated with the archive file

Logging in to the Management Node

To use the support-bundles command, log in as root to the management node that is running Pacemaker resources. Collect data first from the management node that is running Pacemaker resources, then from other management nodes as needed.

If you do not know which management node is running Pacemaker resources, log in to any management node and check Pacemaker cluster status. The following command shows the Pacemaker cluster resources are running on pcamn01.

[root@pcamn01 ~]# pcs status
Cluster name: mncluster
Stack: corosync
Current DC: pcamn01
...
Full list of resources:

scsi_fencing (stonith:fence_scsi): Stopped (disabled)
Resource Group: mgmt-rg
vip-mgmt-int (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-host (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-ilom (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-lb (ocf::heartbeat:IPaddr2): Started pcamn01
vip-mgmt-ext (ocf::heartbeat:IPaddr2): Started pcamn01
l1api (systemd:l1api): Started pcamn01
haproxy (ocf::heartbeat:haproxy): Started pcamn01
pca-node-state (systemd:pca_node_state): Started pcamn01
dhcp (ocf::heartbeat:dhcpd): Started pcamn01
hw-monitor (systemd:hw_monitor): Started pcamn01

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Triage Mode

In triage mode, Prometheus platform_health_check is queried for both HEALTHY and NOT_HEALTHY status. If NOT_HEALTHY is found, use time_slice mode to get more detail.

# support-bundles -m triage

The following files are in the output archive file.

File

Description

header.json

Timestamp and command line to generate this bundle.

compute_node_info.json

Pods running in the compute node.

hardware_info.json

Hardware component list retrieved from hms, all the ipmitool fru running at all the ready state management and compute nodes, all the zfssa heads information.

management_node_info.json

Pods running in the management node.

rack_info.json

Rack installation time and build version.

loki_search_results.log.n

Chunk files in json.

Time Slice Mode

In time slice mode, data is collected by specifying start and end timestamps. Both of the following options are required:

  • -s start_date

  • -e end_date

Time slice mode has the following options in addition to the mode and service request number options. These options help narrow the data collection. If you do not specify either the -j or --all option, then data is collected from all health checker jobs.

  • Only one of --job_name, --all, and --k8s_app an be specified.

  • If none of --job_name, --all, or --k8s_app is specified, the pod filtering will occur on the default (.+checker).

  • The --all option can collect a huge amount of data. You might want to limit your time slice to 48 hours.

Example:

# support-bundles -m time_slice -j flannel-checker -s 2021-05-29T22:40:00.000Z \
-e 2021-06-29T22:40:00.000Z -l INFO

See more examples below.

Option

Description

Required

-s timestamp

--start_date timestamp

Start date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

yes

-e timestamp

--end_date timestamp

End date in format yyyy-mmm-ddTHH:mm:ss

The minimum argument is yyyy-mmm-dd

yes

-j job_name

--job_name job_name

Loki job name. Default value: .+checker

See Label List Query below.

no

--k8s_app label

The k8s_app label value to query within the k8s-stdout-logs job.

See Label List Query below.

no
--all Queries all job names except for jobs known for too much logging, such as audit, kubernetes-audit, and vault-audit and k8s_app label pcacoredns. no

-l level

--levelname level

Message level

no

--pod_name pod_name

The pod name (such as kube or network-checker) to filter output based on the pod. Only the starting letters are necessary.

no

-t timeout

--timeout timeout

Timeout in seconds for a single Loki query. By default it is 180 seconds.

no

Label List Query

Use the label list query to list the available job names and k8s_app label values.

# support-bundles -m label_list
2021-10-14T23:19:18.265 - support_bundles - INFO - Starting Support Bundles
2021-10-14T23:19:18.317 - support_bundles - INFO - Locating filter-logs Pod
2021-10-14T23:19:18.344 - support_bundles - INFO - Executing command - ['python3', 
'/usr/lib/python3.6/site-packages/filter_logs/label_list.py']
2021-10-14T23:19:18.666 - support_bundles - INFO -
Label:  job
Values: ['admin', 'api-server', 'asr-client', 'asrclient-checker', 'audit', 'cert-checker', 'ceui', 
'compute', 'corosync', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 'flannel-checker', 
'his', 'hms', 'iam', 'k8s-stdout-logs', 'kubelet', 'kubernetes-audit', 'kubernetes-checker', 
'l0-cluster-services-checker', 'messages', 'mysql-cluster-checker', 'network-checker', 'ovm-agent', 
'ovn-controller', 'ovs-vswitchd', 'ovsdb-server', 'pca-healthchecker', 'pca-nwctl', 'pca-platform-l0', 
'pca-platform-l1api', 'pca-upgrader', 'pcsd', 'registry-checker', 'sauron-checker', 'secure', 
'storagectl', 'uws', 'vault', 'vault-audit', 'vault-checker', 'zfssa-checker', 'zfssa-log-exporter']
 
Label:  k8s_app
Values: ['admin', 'api', 'asr-client', 'asrclient-checker', 'brs', 'cert-checker', 'compute', 
'default-http-backend', 'dr-admin', 'etcd', 'etcd-checker', 'filesystem', 'filter-logs', 
'flannel-checker', 'fluentd', 'ha-cluster-exporter', 'has', 'his', 'hms', 'iam', 'ilom', 
'kube-apiserver', 'kube-controller-manager', 'kube-proxy', 'kubernetes-checker', '
l0-cluster-services-checker', 'loki', 'loki-bnr', 'mysql-cluster-checker', 'mysqld-exporter', 
'network-checker', 'pcacoredns', 'pcadnsmgr', 'pcanetwork', 'pcaswitchmgr', 'prometheus', 'rabbitmq', 
'registry-checker', 'sauron-api', 'sauron-checker', 'sauron-grafana', 'sauron-ingress-controller', 
'sauron-mandos', 'sauron-operator', 'sauron-prometheus', 'sauron-prometheus-gw', 
'sauron-sauron-exporter', 'sauron.oracledx.com', 'storagectl', 'switch-metric', 'uws', 'vault-checker', 
'vmconsole', 'zfssa-analytics-exporter', 'zfssa-csi-nodeplugin', 'zfssa-csi-provisioner', 'zfssa-log-exporter']

Examples:

  • No job label, no k8s_app label, collect log from all health checkers.

    # support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
  • One job ceui.

    # support-bundles -m time_slice -sr 3-xxxxxxxxxxx -j ceui -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
  • One k8s_app network-checker.

    # support-bundles -m time_slice -sr 3-xxxxxxxxxxx --k8s_app network-checker -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"
  • All jobs and date.

    # support-bundles -m time_slice -sr 3-xxxxxxxxxxx -s `date -d "2 days ago" -u +"%Y-%m-%dT%H:%M:%S.000Z"` -e `date -d +u +"%Y-%m-%dT%H:%M:%S.000Z"`
  • All jobs.

    # support-bundles -m time_slice -sr 3-xxxxxxxxxxx --all -s "2022-01-11T00:00:00" -e "2022-01-12T23:59:59"

The following files are in the output archive file.

File

Description

header.json

Timestamp and command line to generate this bundle.

loki_search_results.log.n

Chunk files in json. Time slice bundles have a limit of 500,000 logs per query, from start time.

rack_info.json

Rack installation time and build version.

Combo Mode

The combo mode is a combination of a triage bundle and a time slice bundle. The output includes an archive file and two collection log files: triage_collection.log and time_slice_collection.log.

The following files are in the output archive file.

File

Description

triage-bundle_timestamp.tar.gz

The triage bundle archive file.

time_slice-bundle_timestamp.tar.gz

The time slice bundle archive file.

The time slice data collected is for --all jobs from one hour preceding the current time to the current time.

Native Mode

The native_collection.log file in the bundle directory provides collection progress information. Native bundles can take hours to collect.

The native mode has the following parameters in addition to mode and SR number.

Parameter

Description

Required

-t nativetype

--type nativetype

  • zfs_bundle

  • sosreport

  • ilom_snapshot

  • cisco_bundle

Default value: zfs_bundle

no

-c component

--component component

Component name, such as the name of a management, compute, or ZFS node, or an ILOM or Cisco host.

no

The following files are in the output archive file.

File

Description

header.json

Time stamp and command line to generate this bundle.

Native bundle files

These files are specific to the nativetype specified.

rack_info.json

Rack installation time and build version.

ZFS Bundle

When nativetype is a ZFS support bundle, collection starts on both ZFS nodes and downloads the new ZFS support bundles into the bundle directory. When nativetype is not specified, zfs_bundle is created by default.

# support-bundles -m native -t zfs_bundle
SOS Report Bundle

When nativetype is an SOS report bundle, the report is collected from the management node or compute node specified by the --component parameter. If --component is not specified, the report is collected from all management and compute nodes.

# support-bundles -m native -t sosreport -c pcamn01
ILOM Snapshot

When nativeType=ilom_snapshot, the value of the --component parameter is the ILOM host name of a management node or compute node. If the --component parameter is not specified, the report is collected from all ILOM hosts.

# support-bundles -m native -t ilom_snapshot -c ilom-pcacn007
Cisco Bundle

When nativetype is cisco-bundle, the value of the --component parameter is an internal Cisco management, aggregation, or access switch management host name.

# support-bundles -m native -t cisco-bundle -c accsn01

To create a cisco-bundle type of collection, the following conditions must be met:

  • The Cisco OBFL module must be enabled on all Private Cloud Appliance Cisco switches. The Cisco OBFL module is enabled by default on all Private Cloud Appliance Cisco switches.

  • The Cisco EEM module must be enabled on all Private Cloud Appliance Cisco switches. The Cisco EEM module is enabled by default on all Private Cloud Appliance Cisco switches.

  • EEM (Embedded Event Manager) policy