Manual Troubleshooting (Common Issues)

** Please first confirm feature supports on each multi-nic-cni release version from here. **

Issues
Actions

Issues

There are commonly three steps of issue: at pod creation, simple ICMP (ping) communication, TCP/UDP communication. The most complicated one is at pod creation.

Before start troubleshooting, set common variables for reference simplicity.

export FAILED_POD= # pod that fails to run
export FAILED_POD_NAMESPACE= # namespace where the failed pod is supposed to run
export FAILED_NODE= # node where pod is deployed
export FAILED_NODE_IP = # IP of FAILED_NODE
export MULTI_NIC_NAMESPACE= # namespace where multi-nic cni operator is deployed, default=multi-nic-cni-operator

Multi-NIC CNI Controller gets OOMKilled

This is expected issue in a large cluster where the controller requires large amount of member to operate. Please adjust the resource limit in the controller deployment. For the case of installing via operator hub or operator bundle, please check the step to modify the deployment in Customize Multi-NIC CNI controller of operator.

HostInterface not created

There are a couple of reasons that the HostInterface is not created. First check the multi-nicd DaemonSet.

kubectl get ds multi-nicd -n $MULTI_NIC_NAMESPACE -oyaml

daemonsets.apps "multi-nicd" not found
- Check whether no config.multinic.fms.io deployed in the cluster.
```
kubectl get config.multinic multi-nicd -n $MULTI_NIC_NAMESPACE
```
  If no config.multinic.fms.io deployed, see Deploy multi-nicd
- The node has taint that the daemon is not tolerate.
```
kubectl get nodes $FAILED_NODE -o json|jq -r .spec.taints
```
  To tolerate the taint, add the tolerate manually to the multi-nicd DaemonSet.
```
kubectl edit $(kubectl get po -owide -A|grep multi-nicd\
    |grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')
```
Other cases, check controller log

No secondary interfaces in HostInterface

The HostInterface is created but there is no interface listed in the custom resource. There are two common root causes.

Communication between controller and multi-nicd is blocked.
- Check whether the controller can communicate with multi-nicd:
```
kubectl logs --selector control-plane=controller-manager \
  -n $MULTI_NIC_NAMESPACE -c manager| grep Join| grep $FAILED_NODE_IP
```
  - If no line shown up and the full controller log printing Fail to create hostinterface ... cannot update interfaces: Get "<node IP>/interface": dial tcp <node IP>:11000: i/o timeout, check set required security group rules
Network interfaces are not configured as expected.
- Check multi-nicd log.
  - If getting cannot list address on <SECONDARY INTERFACE>, please confirm whether IPv4 address on the host.
  - If getting cannot get PCI info: Get "https://pci-ids.ucw.cz/v2.2/pci.ids.gz": net/http: TLS handshake timeout, some environment variables need to be set in the config for the multi-nicd container to reach the above address via proxy settings.
```
apiVersion: multinic.fms.io/v1
kind: Config
metadata:
  name: multi-nicd
...
spec:
...
  daemon:
  env:    
- name: HTTP_PROXY
  value: <REPLACE WITH YOUR HTTPS_PROXY>
- name: HTTPS_PROXY
  value: <REPLACE WITH YOUR HTTPS_PROXY>
- name: NO_PROXY
  value: <REPLACE WITH YOUR NO_PROXY>
```
  - Otherwise, please refer to check interfaces at node's host network.

Pod failed to start

Issue: Pod stays pending in ContainerCreating status. Get more information from describe

kubectl describe $FAILED_POD -n $FAILED_POD_NAMESPACE

Find the following keyword from FailedCreatePodSandBox:

Network not found
CNI binary not found
IPAM ExecAdd: failed
No available IP address
IPAM plugin returned missing IP config
zero config

Pod failed to start (Summary Table)

For those who are familar to action command (e.g., list multinic CRs, list daemon pods), you may troubleshoot with the summary table:

Investigate source of issue from top to bottom

X refers to no relevance

If the issue cannot be solved by configuration (multinicnetwork, annotation, host network, config.multinic) and last patch of controller and multi-nicd, please report the issue with the corresponding log.

*The solved bug on CNI binary requires node restart.

Potential source of Issue	Network not found	CNI binary not found	- IPAM ExecAdd: failed - IPAM plugin returned missing IP config	zero config	Fail execPlugin
multinicnetwork definition/annotation	- annotation missing/mismatch - multinicnetwork wrong configured	X	- IPAM wrong configured - `masters` multinicnetwork spec missing (> 1 multinicnetwork)	non-IP host: - no master name provided via multi-config or annotation	X
host network	X	X	X	L3: - daemon communication blocked All: - interface missing	X
controller	- net-attach-def not created	- daemon not created due to wrong configured (config.multinic)	L3: - daemon/hostinterface not created - CIDR/IPPool not created/unsynced	X	X
daemon (multi-nicd)	X	X	L3: - failed to discover hostinterface - IP limit reach All cases: - hang on no-respond API server (should be fixed by #172)	X	X
main CNI binary (multi-nic)	X	X	- *failed to clean up previous pod network (should be fixed by #165)	host-device - *failed to clean up previous pod network (should be fixed by #152)	X
ipam CNI binary (multi-nic-ipam)	X	X	- *failed to clean up previous ip allocation (should be fixed by #104)	X	X
3rd-party CNI binary	X	- binary missing	- 3rd-party IPAM failure	X	- 3rd-party main plugin failure

Network not found

kubectl get multinicnetwork # multinicnetwork resource created
kubectl get $FAILED_POD -n $FAILED_POD_NAMESPACE -oyaml|grep "k8s.v1.cni.cncf.io/networks" # pod annotation matched
kubectl get net-attach-def # network-attachment-definition created

If net-attach-def is missing (No resources found in default namespace), check controller log to see whether the failure comes from misconfiguration in multinicnetwork (Marshal failure) or network-attachment-definition creation request to API server.

CNI binary not found

The binary file of CNI is not in the expected location read by Multus. The expected location can be found in Multus daemonset as below.

kubectl get ds $(kubectl get ds -A\
|grep multus|head -n 1|awk '{printf "%s -n %s", $2, $1}')  -ojson\
|jq .spec.template.spec.volumes

Example output:

[
...
  {
    "hostPath": {
      "path": "/var/lib/cni/bin",
      "type": ""
    },
    "name": "cnibin"
  },
...
]

The expected location is in hostPath of cnibin.

missing multi-nic/multi-nic-ipam CNI

The CNI directory is probably mounted to a wrong location in the configs.multinic.fms.io CR. Modify mount path ( hostpath attribute ) in spec.daemon.mounts of cnibin to the target location above.

kubectl edit config.multinic multi-nicd -n $MULTI_NIC_NAMESPACE

missing other CNI such as ipvlan The missing CNI may not be supported.

IPAM ExecAdd: failed

This error occurs when CNI cannot execute Multi-NIC IPAM which can be caused by multiple reasons as follows.

failed to load netconf

The configuration cannot be loaded. This is delegated CNI (such as IPVLAN) issue. Find more details from CNI log.
"netx": address already in use

There are a couple of reasons to cause this issue such as IPPool is unsync due to unexpected removal (from operator reinstalltion) or modification of IPPool resource when some assigned pods are still running. IP Address is previously assigned to other pods.

This should be handled by this commit. This commit will try assigning the next available address to prevent infinite failure assignment to the same already-in-use IP address. Try updating to latest image of daemon.
failed to request ip Response nothing
- get more information from HostInterface CR:
```
kubectl get HostInterface $FAILED_NODE -oyaml
```
- If hostinterfaces.multinic.fms.io "FAILED_NODE" not found, check HostInterface is not be created.
- If no interfaces in .spec.interfaces, check HostInterface does not show the secondary interfaces.
- Check whether it reaches CIDR block limit, confirm no available IP address
- Other cases, find more details from multi-nicd log
- Multi-nicd daemon pod has no response, restart multi-nicd might help.
other CNI plugin (such as aws-vpc-cni, sr-iov) failure, check each CNI log.
- aws-vpc-cni: /host/var/log/aws-routed-eni

No available IP address

List corresponding Pod CIDR from HostInterface.

kubectl get HostInterface $FAILED_NODE -oyaml

Check ippools.multinic.fms.io of the corresponding pod CIDR whether the IP address actually reach the limit. If yes, consider changing the host block and interface block in multinicnetworks.multinic.fms.io.

IPAM plugin returned missing IP config

No IP address set from the multi-nic type IPAM without throwing an error. To troubleshoot, we need additional information from IPAM CNI log.

Zero config

Zero config occurs when CNI cannot generate configurations from the network-attachment-definition. To troubleshoot, we need additional information from CNI log.

Ping failed

Issue: Pods cannot ping each other.

If the CNI operates at Layer 2 (such as MACVLAN or IPVLAN with L2), please confirm whether the defined Pod CIDR is routable within your cluster.

For bare metal cluster which has only a certain VLAN range opened on the switch, please define a VLAN interface instead of the physical NIC on the node. Usually for a bare metal node with a secondary interface, the two ports of NIC2 will be defined as tenant-bond for redundancy, the VLAN interface should be defined following the naming vlanXXX@tenant-bond, where XXX represents a valid open VLAN ID.

Please see the following example:
```
13769: tenant-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group
default qlen 1000 link/ether 98:03:9b:8c:55:e4 brd ff:ff:ff:ff:ff:ff
14688: vlan1134@tenant-bond: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group
default qlen 1000 inet 172.11.3.3/16 brd 172.11.255.255 scope global noprefixroute vlan1134
```
If the CNI operates at Layer 3, check route status in multinicnetworks.multinic.fms.io.
```
kubectl get multinicnetwork.multinic.fms.io multinic-ipvlanl3 -o json\ 
| jq -r .status.routeStatus
```
- WaitForRoutes: the new cidr is just recomputed and waiting for route update.
- Failed: some route cannot be applied, need attention. Check multi-nicd log
- Unknown: some daemon cannot be connected.
- N/A: there is no L3 configuration applied. Check whether multinicnetwork.multinic.fms.io is defined with L3 mode and cidrs.multinic.fms.io is created.
```
kubectl get cidrs.multinic.fms.io
```
- Success: check set required security group rules

TCP/UDP communication failed.

Issue: Pods can ping each other but do not get response from TCP/UDP communication such as iPerf.

Check whether the multi-nicd detects the other host interfaces.

kubectl get po $(kubectl get po -owide -A|grep multi-nicd\
   |grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}') -o json\
   |jq -r .metadata.labels

The nubmer in multi-nicd-join should be equal to accumulated number of interfaces from each host in the same zone.

Check whether the host secondary interfaces between hosts are connected. If yes, try restarting multi-nic-cni controller node to forcefully synchronize host interfaces.

Actions

Available configurations on config.multinic/multi-nicd:

Controller configuration

These following controller configuration values will be applied on-the-fly (no need to restart the controller pod).

Configuration	Description	Default Value
.spec.logLevel	controller's verbose log level	4
.spec.urgentReconcileSeconds	time to requeue reconcile after instant failure in second unit	5 seconds
.spec.normalReconcileMinutes	time to requeue reconcile while waiting for initial configuration in minute unit	1 minute
.spec.longReconcileMinutes	time to requeue reconcile when sensing control traffic failure in minute unit	10 minutes
.spec.contextTimeoutMinutes	time out for API server call context in minute unit	2 minutes

Log Levels

Verbose Level	Information
1	- critical error (cannot create/update resource by k8s API) - "Set Config" key - set up log - config error
2	- significant events/failures of multinicnetwork
3	- significant events/failures of cidr
4 (default)	- significant events/failures of hostinterface
5	- significant events/failures of ippools
6	- significant events/failures of route configurations
7	- requeue - get deleted resource - debug pointers (e.g., start point of function call)

Daemon configuration

Configuration	Description	Type	Default Value
.spec.daemon.port	multi-nicd serving port	int	11000
.spec.daemon.mounts	additional host-path mount	HostPathMount

# HostPathMount
mounts:
- name: mountName
  podpath: path/on/pod
  hostpath: path/on/host

Additionally, the following common apps/DaemonSet configurations are also available under .spec.daemon. - nodeSelector - image - imagePullSecret - imagePullPolicy - securityContext - env - envFrom - resources - tolerations

List in-use pods

modify '< MULTINICNETWORK NAME HERE >' in the following command with your target.multinicnetwork name

kubectl get po -A -ojson| jq -r '.items[]|select(.metadata.annotations."k8s.v1.cni.cncf.io/networks"=="< MULTINICNETWORK NAME HERE >")|.metadata.namespace + " " + .metadata.name'

Get CNI log (available after v1.0.3)

To make CNI log available on the daemon pod, you may mount the the host log path to the daemon pod:

Run

kubectl edit config.multinic multi-nicd

Add the following mount items

# config/multi-nicd
spec:
  daemon:
    mounts:
    ...
    - hostpath: /var/log/multi-nic-cni.log
      name: cni-log
      podpath: /host/var/log/multi-nic-cni.log
    - hostpath: /var/log/multi-nic-ipam.log
      name: ipam-log
      podpath: /host/var/log/multi-nic-ipam.log
    # For AWS-IPVLAN main plugin log also add the following lines:
    # - hostpath: /var/log/multi-nic-aws-ipvlan.log
    #   name: ipam-log
    #   podpath: /host/var/log/multi-nic-aws-ipvlan.log

Then, you can get CNI log from the following commands:

# default main plugin
kubectl exec $(kubectl get po -owide -A|grep multi-nicd\
|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')\
-- cat /host/var/log/multi-nic-cni.log

# multi-nic on aws main plugin
kubectl exec $(kubectl get po -owide -A|grep multi-nicd\
|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')\
-- cat /host/var/log/multi-nic-aws-ipvlan.log

# IPAM plugin
kubectl exec $(kubectl get po -owide -A|grep multi-nicd\
|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')\
-- cat /host/var/log/multi-nic-ipam.log

Get Controller log

kubectl logs --selector control-plane=controller-manager \
-n $MULTI_NIC_NAMESPACE -c manager

Get multi-nicd log

kubectl logs $(kubectl get po -owide -A|grep multi-nicd\
|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')

Deploy multi-nicd config

Restart the controller pod should create the multi-nicd config automatically.

kubectl delete po --selector control-plane=controller-manager \
-n $MULTI_NIC_NAMESPACE

If not, update the controller to the latest image and restart the controller (recommended). Otherwise, deploy config manually.

kubectl create -f https://raw.githubusercontent.com/foundation-model-stack/multi-nic-cni/main/config/samples/config.yaml \
-n $MULTI_NIC_NAMESPACE

Set security groups

There are four security group rules that must be opened for Multi-nic CNI.

outbound/inbound communication within the same security group
outbound/inbound communication of Pod networks
inbound multi-nicd serving TCP port (default: 11000)

Add secondary interfaces

Prepare secondary subnets with required security group rules and enable multiple source IPs from a single vNIC (i.e.g, enable IP spoofing on IBM Cloud)
Attach the secondary subnets to instance
- manual attachment: follows Cloud provider instruction
- by machine-api-operator: updates an image of machine api controller of the provider to support secondary interface on provider spec.
  
  Check example commit in modified controller

Restart controller

kubectl delete --selector control-plane=controller-manager \
-n $MULTI_NIC_NAMESPACE

Restart multi-nicd

kubectl delete po $(kubectl get po -owide -A|grep multi-nicd\
|grep $FAILED_NODE|awk '{printf "%s -n %s", $2, $1}')

Check host secondary interfaces

Log in to FAILED_NODE with oc debug node/$FAILED_NODE or using nettools with hostNetwork: true. If secondary interfaces do not exist at the host network or an IPv4 address has not been assigned, add the secondary interfaces

Update daemon pod to use latest version

Check whether the using image set with latest version tag and imagePullPolicy: Always
```
kubectl get daemonset multi-nicd -o yaml -n $MULTI_NIC_NAMESPACE|grep image
```
If not, modify image with the latest version tag and change imagePullPolicy to Always
```
kubectl edit daemonset multi-nicd -n $MULTI_NIC_NAMESPACE
```

Delete the current multi-nicd pods with selector

kubectl delete po --selector app=multi-nicd -n $MULTI_NIC_NAMESPACE

Check readiness

kubectl get po --selector app=multi-nicd -n $MULTI_NIC_NAMESPACE

Update controller to use latest version

Check whether the using image set with latest version tag and imagePullPolicy: Always

kubectl get deploy multi-nic-cni-operator-controller-manager -o yaml -n $MULTI_NIC_NAMESPACE|grep multi-nic-cni-controller -A 2|grep image

If not, modify image with the latest version tag and change imagePullPolicy to Always

kubectl edit deploy multi-nic-cni-operator-controller-manager -n $MULTI_NIC_NAMESPACE

Delete the current multi-nicd pods with selector

kubectl delete po --selector control-plane=controller-manager -n $MULTI_NIC_NAMESPACE

Check readiness

kubectl get po --selector control-plane=controller-manager -n $MULTI_NIC_NAMESPACE

Safe upgrade Multi-NIC CNI operator

Before bundle version on Operator Hub to v1.0.2

There are three significant changes:
- Change API group from net.cogadvisor.io to multinic.fms.io. To check API group,
```
kubectl get crd|grep multinicnetworks
multinicnetworks.multinic.fms.io                                  2022-09-27T08:47:35Z
```
- Change route configuration logic for handling fault tolerance issue. To check route configuration logic. Run ip rule in any worker host by running oc debug node or using nettools with hostNetwork: true.
```
> ip rule
0:  from all lookup local
32765:  from 192.168.0.0/16 lookup multinic-ipvlanl3
32766:  from all lookup main
32767:  from all lookup default
```
  If it shows similar rules as above, the route configuration logic is up-to-date.
- Add multinicnetwork CR to show routeStatus. To check routeStatus key in multinicnetwork CR
```
kubectl get multinicnetwork -o yaml|grep routeStatus
  routeStatus: Success
```
If all changes are applied (up-to-date) in your current version, there is no need to stop the running workload to reinstall the operator. Check update the daemon pods and update the controller to get the image with latest minor updates and bug fixes.

Otherwise, check live migration

Customize Multi-NIC CNI controller of operator

If the multi-nic-cni operator has been managed by the Operator Lifecycle Manager (olm) (installed by operator-sdk run bundle or via operator hub), the modification to the controller deployment (multi-nic-cni controller pod) will be overriden by the olm.

To modify the value such as resource request/limit to the controller pod, you need to edit the .spec.install.spec.deployments section in the ClusterServiceVersion (csv) resource of the multi-nic-cni operator.

You can locate the csv resource of multi-nic-cni operator in your cluster from the following command.

kubectl get csv -l operators.coreos.com/multi-nic-cni-operator.multi-nic-cni-operator -A

Before v1.0.5, the csv are created in all namespaces. You need to edit the csv in the namespace that the controller has been deployed. The modification of csv in the other namespace will not be applied.