PBR TROUBLESHOOTING

Service node – An external device where PBR redirects traffic, like a firewall or load balancer.

Service leaf – An ACI leaf switch that connects to the service node.

Troubleshooting Unmanaged Mode Service Graph with PBR in Cisco ACI

Without PBR Service Graph:
- Make sure both consumer and provider endpoints are learned.
- Confirm that these endpoints can talk to each other.
Service Graph Deployment:
- Ensure that the deployed graph shows no errors.
- Check that VLANs and class IDs for the service node are set up.
- Verify that the service node endpoints are detected.
Forwarding Path:
- Confirm that the policy is correctly applied on the leaf nodes.
- Capture traffic on the service node to see if it’s being redirected.
- Capture traffic on the ACI leaf to check that traffic returns to the ACI fabric after PBR.
- Finally, verify that traffic reaches the consumer and provider endpoints and that return traffic is generated.
Troubleshooting When a Service Graph Is Not Deployed
- What is Expected:
  - After you define and apply a Service Graph policy to a contract, a graph instance should appear in the ACI GUI under 'Tenant > Services > L4-L7 > Deployed Graph Instances'.
- Step 1: Check Configuration and Look for Faults
  - Confirm that these basic configurations are complete and error-free:
    - VRF and BDs for the consumer EPG, provider EPG, and service node
    - Consumer and provider EPGs
    - Contract and filters
    Note: You do not need to manually create an EPG for the service node (Firewall or Load Balancer); it is automatically created during Service Graph deployment.
  - The configuration steps for Service Graph with PBR are:
    - Create the L4-L7 Device (Logical Device)
    - Create the Service Graph
    - Create the PBR policy
    - Create the Device Selection policy
    - Associate the Service Graph with the contract subject
- Step 2: Check for Service Graph Deployment Issues
  - In the ACI UI, verify that a deployed graph instance appears after associating the Service Graph with the contract.
  The location is 'Tenant > Services > L4-L7 > Deployed Graph Instances'
  - Common issues if it does not appear:
    - The contract is missing a consumer or provider EPG
    - The contract subject has no filters
    - The contract scope is set to VRF even though it is for inter-VRF or inter-tenant communication
- Step 3: Identify Faults in the Deployed Graph Instance
  - If there are faults, they indicate problems with the Service Graph configuration. Here are common fault F1690 raise for the below issues and their fixes:
    - "Configuration is invalid due to ID allocation failure"
      - Issue: The encapsulated VLAN for the service node is unavailable (no dynamic VLAN available in the VLAN pool associated to the VMM domain used in the Logical Device).
      - Fix: Check the VLAN pool for the Logical Device or encapsulated VLAN in AAEP Profile for the physical domain.
    - "Configuration is invalid due to no device context found for Logical Device"
      - Issue: The Logical Device cannot be found for Service Graph because no Device Selection Policy matches the contract.
      - Fix: Verify that the Device Selection Policy is defined correctly.
    Tenant > Services > L4-L7 > Device Selection Policy
    - Fault: "Configuration is invalid due to no cluster interface found"
      - Issue: The cluster interface for the service node is missing or not specified.
      - Fix: Ensure the cluster interface and correct connector name are set in the Device Selection Policy.
    - Fault: "Configuration is invalid due to no BD found"
      - Issue: The BD (Bridge Domain) for the service node is missing or not specified.
      - Fix: Confirm that the BD and correct connector name are set in the Device Selection Policy.
    - Fault: "Configuration is invalid due to invalid service redirect policy"
      - Issue: The PBR policy is not selected, even though the redirect is enabled on the service function.
      - Fix: Select the correct PBR policy in the Device Selection Policy.

Troubleshooting the PBR Forwarding Path

1. Check VLAN Deployment and Endpoint Learning on the Leaf Node
- After a Service Graph is successfully deployed without faults, EPGs and BDs for the service node are automatically created.
- You can find the encapsulated VLAN IDs and class IDs of service node interfaces in the ACI GUI under:
  - Tenant > Services > L4-L7 > Deployed Graph instances > Function Nodes
- Example:
  - Consumer side of a firewall: Class ID 16386 with VLAN 1000
  - Provider side of a firewall: Class ID 49157 with VLAN 1102
- Verify that these VLANs are deployed on the service leaf node interfaces using CLI commands like:
  - show vlan extended
  - show endpoint vrf Prod:VRF1
- If the service node endpoints are not learned, check for these potential issues:
  - Ensure the service node is connected to the correct leaf downlink port.
    - For a physical domain: The leaf static path end encapsulated VLAN must be defined in the Logical Device.
    - For a VMM domain: Confirm the VMM domain is functioning and that the port group is correctly attached to the service node VM.
  - Verify that the leaf downlink port (or the hypervisor port for a VM) is UP.
  - Confirm that the service node has the correct VLAN and IP address.
  - Make sure any intermediate switch between the service leaf and service node has the correct VLAN configuration.
2. Check the Expected Traffic Paths
- If end-to-end traffic stops working after PBR is enabled—even when the service node endpoints are learned—review the expected traffic paths.
- The endpoints must already be learned on the leaf nodes for traffic paths to work correctly.

Note: It's important to highlight that the PBR policy is enforced on either the consumer or provider leaf. ACI's PBR functionality performs a destination MAC rewrite. When reaching the PBR destination MAC, it always utilizes a spine proxy, even if both the source endpoint and PBR destination MAC are located under the same leaf.

The provided figures illustrate examples of where traffic might be redirected. However, the actual location where policy is enforced depends on:

The contract configuration.
The endpoint learning status.

Example scenarios:
1. External Endpoint in L3Out EPG (VRF1) → Web EPG (VRF1), Ingress Enforcement Mode:
  - If VRF1 is set to ingress enforcement mode and an external endpoint in L3Out EPG (VRF1) tries to reach a Web EPG endpoint in the same VRF (VRF1), the traffic is redirected by the leaf where the Web EPG endpoint is located.
    - This happens regardless of which direction the contract specifies.
2. Consumer Web EPG (VRF1) → Provider App EPG (VRF1), Both Endpoints Learned on Consumer and Provider Leafs:
  - If a consumer endpoint in Web EPG (VRF1) tries to reach a provider endpoint in App EPG (VRF1), and both endpoints are learned on their respective leaf nodes (consumer leaf and provider leaf), the traffic is redirected by the ingress leaf (the leaf that first receives the traffic).
3. Consumer Web EPG (VRF1) → Provider App EPG (VRF2):
  - If a consumer endpoint in Web EPG (VRF1) tries to access a provider endpoint in App EPG (VRF2), the traffic is redirected by the consumer leaf that hosts the consumer endpoint.
    - This occurs regardless of the VRF enforcement mode setting.
  3. Check the Expected Traffic Paths
After the expected forwarding path is set up, ELAM can be used to check if traffic arrives at the switch nodes and to analyze the forwarding decision on those nodes.

Check the Policies programmed on leaf nodes

If traffic isn't being forwarded or redirected as expected, the subsequent troubleshooting step is to examine the policies configured on the leaf nodes.

VRF scope id can be found in 'Tenant > Networking > VRF'.

Pod1-Leaf1# show zoning-rule scope 2752513

Once the Service Graph is deployed, EPGs for the service node are established, and policies are modified to direct traffic between the consumer and provider EPGs.

Pod1-Leaf1# show service redir info

If zoning rules are set up correctly but traffic isn't being redirected or forwarded as intended, consider the following common mistakes:

• Verify that the source or destination class ID is resolved correctly by using ELAM. If it isn't, check what the incorrect class ID is and review the EPG derivation criteria such as path and encap VLAN.

• If the source and destination class IDs are resolved correctly and the PBR policy is applied, but traffic does not reach the PBR node, ensure the IP, MAC, and VRF of the destgrp in the redir action ('show service redir info') are correct.

Contract_parser

The contract_parser tool can also help to verify the policies.

Pod1-Leaf1# contract_parser.py --vrf Prod:VRF1

The command contract_parser.py --vrf Prod:VRF1 was executed on Pod1-Leaf1 to display contract rules for the VRF "Prod:VRF1".

Detailed Rule Summaries:

Rule [7:4213]:
- VRF: Prod:VRF1
- Action: Permit
- Traffic: IP TCP traffic
- Source: Endpoint group "C-consumer" (with identifier 16386) from the ASAv-VM1 context.
- Destination: Endpoint group "epg-Web" (with identifier 32772).
- Port: Traffic on port 80.
- Contract: uni/tn-Prod/brc-web-to-app.
- Usage: This rule has not been hit (hit count = 0).
Rule [7:4237]:
- VRF: Prod:VRF1
- Action: Redirect
- Traffic: IP TCP traffic
- Source: Endpoint group "epg-Web" (with identifier 32772).
- Destination: Redirects to endpoint group "epg-App" (with identifier 32773).
- Port: Traffic on port 80.
- Contract: uni/tn-Prod/brc-web-to-app.
- Usage: This rule has not been hit (hit count = 0).
- Additional Info: The destination group labeled "destgrp-27" is associated with VRF Prod:VRF1, has IP 192.168.101.100, MAC 00:50:56:AF:3C:60, and belongs to the bridge domain "uni/tn-Prod/BD-Service-BD1".
Rule [7:4172]:
- VRF: Prod:VRF1
- Action: Redirect
- Traffic: IP TCP traffic
- Source: Endpoint group "epg-App" (with identifier 32773) for traffic on port 80.
- Destination: Redirects to endpoint group "epg-Web" (with identifier 32772).
- Contract: uni/tn-Prod/brc-web-to-app.
- Usage: This rule has not been hit (hit count = 0).
- Additional Info: The destination group labeled "destgrp-28" is associated with VRF Prod:VRF1, has IP 192.168.102.100, MAC 00:50:56:AF:1C:44, and belongs to the bridge domain "uni/tn-Prod/BD-Service-BD2".
Rule [9:4249]:
- VRF: Prod:VRF1
- Action: Permit
- Traffic: Any protocol traffic
- Source: Endpoint group "C-provider" (with identifier 49157) from the ASAv-VM1 context.
- Destination: Endpoint group "epg-App" (with identifier 32773).
- Contract: uni/tn-Prod/brc-web-to-app.
- Usage: This rule has been hit 15 times.

Other traffic flow examples

1. Load balancer without SNAT

Example Traffic Flow (Unidirectional PBR with Load Balancer Integration)

Incoming Flow (Consumer → Load Balancer → Provider)
- A consumer endpoint (in the Web EPG) sends traffic to the load balancer’s virtual IP (VIP).
- Since the VIP is reachable, the traffic does not need PBR to get to the load balancer.
- The load balancer changes only the destination IP to an App EPG endpoint (it does not change the source IP).
- Traffic then proceeds to the provider endpoint in the App EPG.
Return Flow (Provider → Load Balancer → Consumer)
- The provider endpoint responds to the original source IP (the consumer’s IP).
- Without PBR, the provider endpoint’s real IP would appear as the traffic’s source. The consumer would drop this traffic because it never initiated a session with that IP.
- With unidirectional PBR, the return traffic is redirected first to the load balancer.
- The load balancer changes the source IP to its VIP, preserving session consistency.
- Traffic is then forwarded back to the consumer endpoint, which sees the source as the VIP and accepts the response.

The diagram and the "show zoning-rule" output explain the zoning rules after the Service Graph is set up.

In this example:
- Traffic from the Web (pcTag 32772) to the Service Load Balancer (pcTag 16389) is allowed.
- Traffic from the Service Load Balancer (pcTag 16389) to the App (pcTag 32773) is allowed.
- Traffic from the App (pcTag 32773) to the Web (pcTag 32772) is sent to the load balancer (destgrp-31).

By default, there is no rule that allows traffic from the provider App (pcTag 32773) to the Service Load Balancer (pcTag 16389).
To allow two-way communication between these for load balancer health checks, you must set the "Direct Connect" option to True. You can find this setting at:
- Tenant > L4-L7 > Service Graph Templates > Policy
The default setting for "Direct Connect" is False.

It adds a permit rule for provider EPG(32773) to Service-LB(16389) as below.

2. Traffic flow example - Firewall and load balancer without

SNAT

Service Graph Setup:
- A firewall is the first node and a load balancer is the second node.
- This setup uses Policy Based Redirect (PBR) without using Source NAT (SNAT).
Incoming Traffic Flow:
- Traffic originates from a consumer endpoint in the Web group and is destined for the provider App group.
- The traffic has two legs:
  1. First Leg:
    - The traffic is sent to a Virtual IP (VIP) on the load balancer.
    - Before reaching the load balancer, the traffic is redirected to the firewall.
  2. Second Leg:
    - After passing through the firewall, the traffic goes to the load balancer.
    - The load balancer then changes the destination IP to one of the endpoints in the App group without altering the source IP.
    - Finally, the traffic reaches the selected provider endpoint.
Return Traffic Flow:
- The reply from the provider endpoint is addressed to the original source IP from the consumer endpoint.
- PBR is applied to ensure the return traffic is directed to the load balancer.
- The load balancer then changes the source IP to the VIP.
- The traffic is sent back to the firewall, which finally forwards it to the consumer endpoint.

-----------------------------------------------------------------------------------------------------

Troubleshooting Cisco ACI Policy-Based Redirect (PBR) Issues

Section 1: General PBR Troubleshooting Steps (Applicable to All Deployments)

Verify L4-L7 Service Device Health and Configuration:
- Reachability: Can the APIC and relevant Leaf switches ping the management and data interfaces of the service device (e.g., firewall, load balancer)?
- Health: Check the service device's health status (CPU, memory, interface status).
- Interface Configuration: Ensure interfaces on the service device connected to ACI Leafs are correctly configured (IP addresses, VLANs/VXLANs, VRF membership if applicable, static routes if needed).
- Routing: Confirm the service device has appropriate routes back to the source and destination subnets it's servicing. Does it know how to return traffic to the ACI fabric?
- Device Policies: Check policies on the service device itself (e.g., firewall rules allowing traffic, load balancer virtual server status).
Verify ACI L4-L7 Device Configuration:
- (APIC GUI: Tenant > Services > L4-L7 > Devices)
- Health: Check the health status of the L4-L7 device object in APIC. Are health checks passing?
- Credentials: Verify device manager credentials if used.
- Interfaces: Ensure the interfaces defined in ACI match the physical connections and configurations on the service device (Path, VLAN/VXLAN encapsulation).
- Function Profile (Go-To/Go-Through): Verify the mode matches the device's role (Routed/Bridged).
Verify Service Graph Template Configuration:
- (APIC GUI: GUI: Tenant > Services > L4-L7 > Service Graph Templates)
- Device Selection: Ensure the correct L4-L7 device is selected within the graph node.
- Function Node Connectors: Verify connector configuration (BDs, Subnets for PBR policy). Are the correct consumer and provider connectors linked?
- PBR Policy:
  - (APIC GUI: Tenant > Policies > Protocol > L4-L7 Policy Based Redirect)
  - Destination Check: Double-check the IP address, MAC address, and L3 Destination configuration for the redirect destinations (service device interfaces). Ensure the MAC is correct for the service IP.
  - Thresholds: Check health check thresholds if configured.
  - Resiliency: Verify settings for PBR resiliency (e.g., backup destinations).
  - Symmetric PBR: Is Symmetric PBR required and enabled/disabled correctly based on traffic flow needs?
Verify Service Graph Application and Contracts:
- (APIC GUI: Tenant > Application Profiles > Application EPGs > Provided/Consumed Contracts)
- Deployment: Is the Service Graph Template correctly applied to the Contract between the relevant Consumer and Provider EPGs?
- Contract Scope: Ensure the Contract scope (VRF, Tenant, Application Profile, Global) is appropriate.
- Filters: Verify the Contract Filters match the traffic intended for redirection. Mismatched filters will prevent traffic from matching the contract and thus the service graph.
- EPGs: Confirm the source and destination endpoints are members of the correct Consumer and Provider EPGs associated with the contract.
Verify Fabric Forwarding and Basic Connectivity:
- EPG Communication (No PBR): Temporarily remove the Service Graph from the contract. Can the Consumer and Provider EPGs communicate directly? This isolates the issue to the PBR configuration itself.
- BD Configuration: Check the Bridge Domain configuration associated with the EPGs and the service device connectors. Is the subnet defined? Is unicast routing enabled? Are ARP flooding/GARP enabled if needed?
- VRF Configuration: Verify the VRF configuration, including policy control enforcement direction/preference.
Check Operational State and Health Scores:
- (APIC GUI: Tenant > Services > L4-L7 > Deployed Graph Instances)
  - Check the status and health of the deployed graph instance. Look for faults.
- (APIC GUI: Tenant > Services > L4-L7 > Deployed Devices)
  - Verify the operational state of the deployed service device cluster/node.
- EPG Operational Tab: Check for faults or misconfigurations related to the EPGs involved.
- Endpoint Learning: Are the source/destination endpoints learned correctly in the fabric within their respective EPGs/BDs? Use the Endpoint Tracker tool.
- (APIC GUI: Fabric > Inventory > Select Leaf > Operational > Endpoint Tracker)
Utilize ACI Troubleshooting Tools:
- Visore (Managed Object Browser): Query specific Managed Objects (MOs) related to the PBR policy (pbrDest, pbrRedirectPol), service graph (vnsAbsGraph, vnsCDev, vnsRsRedirectHealthGroup), and contracts to check configuration and operational state details.
  
  Access --> https://<APIC_IP>/visore.html.
  
  CLI Commands (APIC & Switches):
  - Check for faults related to the tenant, service graph, or L4-L7 device.
    moquery -c faultInst (on APIC)
  - PBR policies programmed in hardware.
    show service redirect policy (on relevant Leaf switches)
  - Check contract rules programmed.
    show zoning-rule (on relevant Leaf switches)
  - Verify endpoint location and EPG
    show endpoint detail <ip/mac> (on relevant Leaf switches)
  - Check VRF routing tables
    show ip route vrf <tenant:vrf> (on relevant Leaf switches)
  - Check COOP database for endpoint info
  show coop internal info ip-db detail <ip> (on Spines):
- Traffic Map/Endpoint Tracker: Visualize traffic flow and endpoint locations.
- SPAN/ERSPAN: Configure SPAN sessions on Leaf switches connected to the service device or endpoints to capture traffic and analyze if it's being correctly redirected and returned.
Log Analysis:
- APIC Events/Faults: Review logs on the APIC for relevant events or faults.
- Switch Logs: Check logs on the relevant Leaf switches (show logging logfile).
- L4-L7 Device Logs: Crucial for understanding how the service device is processing (or dropping) the redirected traffic.

Section 2: Single-Pod PBR Troubleshooting

In addition to the general steps:

Focus: Issues are typically localized to the configuration within the single Pod, the Leafs connected to the service device, or the service device itself.
Leaf Programming: Pay close attention to the show service redirect policy output on the consumer and provider Leaf switches. Is the policy correctly programmed with the right destination MAC/IP?
VLAN/VXLAN Encapsulation: Verify the VLAN or VXLAN Network Identifier (VNID) programmed on the Leaf ports connected to the service device matches the L4-L7 Device configuration in APIC and the configuration on the service device itself.
Spine Proxy: If the consumer and provider are on different Leafs, traffic goes via the Spine. Ensure the Spine proxy behaviour for the BD/VRF is correct.

Section 3: Multi-Pod PBR Troubleshooting