Increasing CPU Cycle Reservation for Orchestrator VMs
Cisco ACI Multi-Site Orchestrator VMs require a dedicated amount of CPU cycles to function optimally. While new deployments automatically set the necessary CPU reservations, upgrading from a version prior to Release 2.1(1) requires manual adjustments to each Orchestrator VM's settings.
Why Increase CPU Reservation?
Properly configuring CPU cycle reservations can help resolve or prevent various unpredictable issues, such as:
Delayed GUI Loading: Orchestrator GUI elements may require multiple attempts to load.
Node Status Fluctuations: Nodes might intermittently switch to an "Unknown" status before reverting to "Ready" on their own.
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
t8wl1zoke0vpxdl9fysqu9otb node1 Ready Active Reachable 18.03.0-ce
kyriihdfhy1k1tlggan6e1ahs * node2 Unknown Active Reachable 18.03.0-ce
yburwactxd86dorindmx8b4y1 node3 Ready Active Leader 18.03.0-ce
Heartbeat Failures: Transient misses in the Orchestrator logs may indicate communication issues.
node2 dockerd: [...] level=error msg="agent: session failed" backoff=100ms
error="rpc error: code = Canceled desc = context canceled" module=node/agent [...]
node2 dockerd: [...] level=error msg="heartbeat to manager [...] failed"
error="rpc error: code = Canceled desc = context canceled" [...]
Enabling NTP for Orchestrator Nodes
Clock synchronization is crucial for Orchestrator nodes. Without it, you might encounter issues like random GUI session log-offs due to expired authentication tokens.
Procedure to Enable NTP
1) Log in directly to one of the Orchestrator VMs.
Navigate to the Scripts Directory:
cd /opt/cisco/msc/scripts
Configure NTP Settings:
2) Use the svm-msc-tz-ntp script to set the time zone and enable NTP.
Parameters:
-tz <time-zone>: Specify your time zone (e.g., US/Pacific).
-ne: Enable NTP.
-ns <ntp-server>: Specify your NTP server (e.g., ntp.esl.cisco.com).
Example Command:
./svm-msc-tz-ntp -tz US/Pacific -ne -ns ntp.esl.cisco.com
svm-msc-tz-ntp: Start
svm-msc-tz-ntp: Executing timedatectl set-timezone US/Pacific
svm-msc-tz-ntp: Executing sed -i 's|^server|\# server|' /etc/ntp.conf
svm-msc-tz-ntp: Executing timedatectl set-ntp true
svm-msc-tz-ntp: Sleeping 10 seconds
svm-msc-tz-ntp: Checking NTP status
svm-msc-tz-ntp: Executing ntpstat;ntpq -p
unsynchronised
polling server every 64 s
remote refid st t when poll reach delay offset jitter
==============================================================================
mtv5-ai27-dcm10 .GNSS. 1 u - 64 1 1.581 -0.002 0.030
3) Verify NTP Configuration:
Check NTP Status:
ntpstat; ntpq -p
unsynchronised
polling server every 64 s
remote refid st t when poll reach delay offset jitter
==============================================================================
*mtv5-ai27-dcm10 .GNSS. 1 u 14 64 1 3.522 -0.140 0.128
4) Confirm Date and Time:
date
Mon Jul 8 14:19:26 PDT 2019
Repeat for All Orchestrator Nodes:
Ensure that each Orchestrator VM undergoes the same NTP configuration process.
Updating DNS for MSO OVA Deployments in VMware ESX
Note: This procedure is only for MSO OVA deployments in VMware ESX. It does not apply to Application Services Engine or Nexus Dashboard deployments.
Procedure to Update DNS
Access the Cluster Node:
SSH into one of the cluster nodes using the root user account.
Update DNS Configuration:
Use the nmcli command to set the DNS server IP address.
Single DNS Server:
nmcli connection modify eth0 ipv4.dns "<dns-server-ip>"
Multiple DNS Servers:
nmcli connection modify eth0 ipv4.dns "<dns-server-ip-1> <dns-server-ip-2>"
Restart the Network Interface:
Apply the DNS changes by restarting the eth0 interface.
nmcli connection down eth0 && nmcli connection up eth0
Reboot the Node:
Restart the node to ensure all changes take effect.
Repeat for Other Nodes:
Perform the same steps on the remaining two cluster nodes.
Restarting Cluster Nodes
Restarting a Single Node Temporarily Down
Restart the Affected Node:
Simply restart the node that is down.
No additional steps are needed; the cluster will automatically recover.
Restarting Two Nodes Temporarily Down
Backup MongoDB:
Before attempting recovery, back up the MongoDB to prevent data loss.
Important: Ensure at least two nodes are running to keep the cluster operational.
Restart the Two Affected Nodes:
Restart both nodes that are down.
No additional steps are needed; the cluster will automatically recover.
Backing Up MongoDB for Cisco ACI Multi-Site
Recommendation: Always back up MongoDB before performing any upgrades or downgrades of the Cisco ACI Multi-Site Orchestrator.
Procedure to Back Up MongoDB
Log In to the Orchestrator VM:
Access the Cisco ACI Multi-Site Orchestrator virtual machine.
Run the Backup Script:
Execute the backup script to create a backup file.
~/msc_scripts/msc_db_backup.sh
A backup file named msc_backup_<date+%Y%m%d%H%M>.archive will be created.
Secure the Backup File:
Copy the backup file to a safe location for future use.
Restoring MongoDB for Cisco ACI Multi-Site
Procedure to Restore MongoDB
Log In to the Orchestrator VM:
Access the Cisco ACI Multi-Site Orchestrator virtual machine.
Transfer the Backup File:
Copy your msc_backup_<date+%Y%m%d%H%M>.archive file to the VM.
Run the Restore Script:
Execute the restore script to restore the database from the backup file.
~/msc_scripts/msc_db_restore.sh
Push the Schemas:
After restoring, push the schemas to ensure everything is up to date.
msc_push_schemas.py
Custom Certificates Troubleshooting
How to resolve common issues when using custom SSL certificates with Cisco ACI Multi-Site Orchestrator.
Unable to Load the Orchestrator GUI
If you can't access the Orchestrator GUI after installing a custom certificate, the issue might be due to incorrect certificate placement on the Orchestrator nodes. Follow these steps to recover the default certificates and reinstall the new ones.
Steps to Recover Default Certificates and Reinstall Custom Certificates
Log In to Each Orchestrator Node:
Access each node directly using SSH or your preferred method.
Navigate to the Certificates Directory:
cd /data/msc/secrets
Restore Default Certificates:
Replace the existing certificate files with the backup copies.
mv msc.key_backup msc.key mv msc.crt_backup msc.crt
Restart the Orchestrator GUI Service:
docker service update msc_ui --force
Reinstall and Activate the New Certificates:
Follow the certificate installation procedure as outlined in previous documentation to ensure the new certificates are correctly applied.
Adding a New Orchestrator Node to the Cluster
When you add a new node to your Multi-Site Orchestrator cluster, ensure the key is activated to maintain cluster security and functionality.
Steps to Add a New Orchestrator Node
Log In to the Orchestrator GUI:
Access the GUI using your web browser.
Re-activate the Key:
Follow the key activation steps as described in the "Activating Custom Keyring" section to integrate the new node securely into the cluster.
Unable to Install a New Keyring After the Default Keyring Expired
If the default keyring has expired and you can't install a new one, it's likely that the custom keyring wasn't properly installed on the cluster nodes. Follow these steps to delete the old keyring and create a new one.
Steps to Create a New Keyring
Access All Cluster Nodes:
SSH into each node in the cluster.
Remove Old Keyring Files:
cd /data/msc/secrets rm -rf msc.key msc.crt msc.key_backup msc.crt_backup
Generate a New Keyring:
Create new key and certificate files using OpenSSL.
openssl req -newkey rsa:2048 -nodes -keyout msc.key -x509 -days 365 -out msc.crt -subj '/CN=MSC'
Backup the New Keyring Files:
cp msc.key msc.key_backup cp msc.crt msc.crt_backup
Set Proper Permissions:
chmod 777 msc.key msc.key_backup msc.crt msc.crt_backup
Force Update the Orchestrator GUI Service:
docker service update msc_ui --force
Re-install and activate the new certificates
Adding a New Orchestrator Node to the Cluster
If you're adding a new node to your Multi-Site Orchestrator cluster, follow these steps:
Log in to the Orchestrator GUI.
Re-activate the Key:
Unable to Install a New Keyring After the Default Keyring Expired
If you're unable to install a new keyring after the default one expired, it might be because the custom keyring isn't installed on the cluster nodes.
Solution: Delete the old default keyring and create a new one using the steps provided in the relevant section.
Replacing a Single Node of the Cluster with a New Node
If one node (e.g., node1) goes down and you need to replace it with a new node, follow these steps:
Step 1: Identify the Down Node
On any existing node, get the ID of the node that is down by running:
# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 11624powztg5tl9nlfoubydtp * node2 Ready Active Leader fsrca74nl7byt5jcv93ndebco node3 Ready Active Reachable wnfs9oc687vuusbzd3o7idllw node1 Down Active Unreachable
Note the ID of the down node (node1).
Step 2: Demote the Down Node
Demote the down node by running:
docker node demote <node ID>
Replace <node ID> with the ID from Step 1.
Example:
bash
Copy code
docker node demote wnfs9oc687vuusbzd3o7idllw
You'll see a message: Manager <node ID> demoted in the swarm.
Step 3: Remove the Down Node
Remove the down node from the swarm:
docker node rm <node ID>
Example:
bash
Copy code
docker node rm wnfs9oc687vuusbzd3o7idllw
Step 4: Navigate to the 'prodha' Directory
On any existing node, change to the prodha directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Replace <build_number> with your actual build number.
Step 5: Obtain the Swarm Join Token
Get the token needed to join the swarm:
docker swarm join-token manager
This command will display a join command containing the token and IP address.
Example Output:
docker swarm join --token SWMTKN-1-... <IP_address>:2376
Note: Copy the entire join command or at least the token and IP address for later use.
Step 6: Note the Leader's IP Address
Identify the leader node by running:
docker node ls
On the leader node, get its IP address:
ifconfig
Look for the inet value under the appropriate network interface (e.g., eth0).
inet 10.23.230.152 netmask 255.255.255.0 broadcast 192.168.99.255
Note: The IP address is 10.23.230.152.
Step 7: Prepare the New Node
Set the Hostname:
hostnamectl set-hostname <new_node_name>
Replace <new_node_name> with the desired hostname (e.g., node1).
Step 8: Navigate to the 'prodha' Directory on New Node
On the new node, change to the prodha directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Step 9: Join the New Node to the Swarm
Run the join command using the token and leader IP address obtained earlier:
./msc_cfg_join.py <token> <leader_IP_address>
Replace:
<token> with the token from Step 5.
<leader_IP_address> with the IP address from Step 6.
Example:
./msc_cfg_join.py SWMTKN-1-... 10.23.230.152
Step 10: Deploy the Configuration
On any node, navigate to the prodha directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Run the deployment script:
./msc_deploy.py
Result: All services should be up, and the database replicated.
Replacing Two Existing Nodes of the Cluster with New Nodes
If two nodes are down and you need to replace them, follow these steps:
Before You Begin
Important: Since there's a lack of quorum (only one node is up), the Multi-Site Orchestrator won't be available.
Recommendation: Back up the MongoDB database before proceeding.
Step 1: Prepare New Nodes
Bring Up Two New Nodes:
Ensure they are properly set up and connected.
Set Unique Hostnames for Each New Node:
hostnamectl set-hostname <new_node_name>
Repeat for both new nodes, using unique names.
Step 2: Remove Down Nodes from Swarm
SSH into the Only Live Node.
List All Nodes:
docker node ls
Example Output:
mathematica
Copy code
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS g3mebdulaed2n0cyywjrtum31 node2 Down Active Reachable ucgd7mm2e2divnw9kvm4in7r7 node1 Ready Active Leader zjt4dsodu3bff3ipn0dg5h3po * node3 Down Active Reachable
Remove Nodes with 'Down' Status:
docker node rm <node-id>
Example:
docker node rm g3mebdulaed2n0cyywjrtum31 docker node rm zjt4dsodu3bff3ipn0dg5h3po
Step 3: Re-initialize the Docker Swarm
Leave the Existing Swarm:
docker swarm leave --force
Navigate to the 'prodha' Directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Initialize a New Swarm:
./msc_cfg_init.py
This command will provide a new token and IP address.
Step 4: Join New Nodes to the Swarm
On Each New Node:
SSH into the Node.
Navigate to the 'prodha' Directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Join the Node to the Swarm:
./msc_cfg_join.py <token> <leader_IP_address>
Replace:
<token> with the token from Step 3.
<leader_IP_address> with the IP address of the first node (from Step 3).
Step 5: Deploy the Configuration
On Any Node:
Navigate to the 'prodha' Directory:
cd /opt/cisco/msc/builds/<build_number>/prodha
Run the Deployment Script:
./msc_deploy.py
Result: The new cluster should be operational with all services up.
Relocating Multi-Site Nodes to a Different Subnet
When you need to move one or more Multi-Site nodes from one subnet to another—such as spreading nodes across different data centers—you can follow this simplified procedure.
It's important to relocate one node at a time to maintain redundancy during the migration.
Scenario: Relocating node3 from Data Center 1 (subnet 10.1.1.1/24) to Data Center 2 (subnet 11.1.1.1/24).
Steps:
Demote node3 on node1:
On node1, run:
docker node demote node3
Power Down node3:
Shut down the virtual machine (VM) for node3.
Remove node3 from the Cluster:
On node1, execute:
docker node rm node3
Deploy a New VM for node3 in Data Center 2:
Install the Multi-Site VM (same version as node1 and node2).
Configure it with the new IP settings for the 11.1.1.1/24 subnet.
Set the hostname to node3.
Power Up node3 and Test Connectivity:
Start the new node3 VM.
Verify connectivity to node1 and node2:
ping [node1_IP] ping [node2_IP]
Obtain the Swarm Join Token from node1:
On node1, get the join token:
docker swarm join-token manager
Note the provided command and token.
Join node3 to the Swarm:
On node3, use the token to join the cluster:
docker swarm join --token [token] [node1_IP]:2377
Verify Cluster Health:
On any node, check the status:
docker node ls
Ensure each node shows:
STATUS: Ready
AVAILABILITY: Active
MANAGER STATUS: One node as Leader, others as Reachable
Update the Swarm Label for node3:
On node1, update the label:
docker node update node3 --label-add msc-node=msc-node3
Check Docker Services Status:
On any node, list services:
docker service ls
Confirm that services are running (e.g., REPLICAS show 1/1 or 3/3).
Wait up to 15 minutes for synchronization if necessary.
Delete the Original node3 VM:
After confirming everything is functioning, remove the old node3 VM from Data Center 1.
By following these steps, you can successfully relocate a Multi-Site node to a different subnet while maintaining cluster integrity and minimizing downtime
Comments