ChatGPT I have 2 Asus Gx10s and a 0.4m 400G QSFP112 DAC Cable. Please walk me through clustering the GX10s to run larger models I’ll lay this out as a practical build path: first the physical/network layer over the 400G DAC, then OS/NVIDIA checks, then the realistic software options for multi-node inference. I’ll also flag where “clustering” helps versus where current local-LLM tooling still won’t automatically split one model across two machines. Thought for 15s Yes — with 2 ASUS Ascent GX10s and the QSFP112 DAC, the sensible path is: Use the 10G/Wi-Fi/Ethernet port for normal internet/LAN access Use the QSFP112 DAC only as the private high-speed cluster link Verify the direct 200GbE/ConnectX link Set up passwordless SSH Then run a multi-node inference stack: vLLM/SGLang/TensorRT-LLM for serious model sharding, or llama.cpp RPC for simpler experiments A single GX10 has 128GB unified memory and is advertised for models up to around 200B parameters; NVIDIA says two Spark/GX10-class systems can be linked over high-performance ConnectX networking for models up to roughly 405B parameters. ASUS’s GX10 support page gives an official “connect two GX10s” process using the QSFP cable and a netplan file. 1. Name the two machines On each GX10, set a clear hostname. On the first GX10: sudo hostnamectl set-hostname gx10a On the second GX10: sudo hostnamectl set-hostname gx10b Reboot both, or log out and back in: sudo reboot After reboot: hostname 2. Connect the cable Plug the QSFP112 DAC directly between the ConnectX ports on the two GX10s. Keep both machines also connected to the internet through their normal LAN/Wi-Fi/10G network, because the QSFP link is just for private node-to-node traffic. 3. Apply ASUS/NVIDIA’s official GX10 direct-link netplan ASUS’s official instructions currently use this netplan file from NVIDIA’s DGX Spark playbooks and note that their example interface is enp1s0f1np1. Run this on both GX10s: sudo wget -O /etc/netplan/40-cx7.yaml \ https://github.com/NVIDIA/dgx-spark-playbooks/raw/main/nvidia/connect-two-sparks/assets/cx7-netplan.yaml sudo chmod 600 /etc/netplan/40-cx7.yaml sudo netplan apply Then check the high-speed interface: ip addr You are looking for a 169.254.x.x address on the ConnectX interface. ASUS’s example shows one node as 169.254.42.38 and the other as 169.254.213.227, but yours will likely be different. On each machine, also run: ip link Look for the interface that is UP and associated with the ConnectX/QSFP connection. 4. Confirm the two GX10s can see each other Suppose you find: gx10a QSFP IP: 169.254.42.38 gx10b QSFP IP: 169.254.213.227 From gx10a: ping 169.254.213.227 From gx10b: ping 169.254.42.38 If ping works, the direct cluster link is alive. For convenience, add both private IPs to /etc/hosts on both machines: sudo nano /etc/hosts Add something like: 169.254.42.38 gx10a-cluster 169.254.213.227 gx10b-cluster Then test: ping gx10a-cluster ping gx10b-cluster 5. Set up passwordless SSH NVIDIA’s two-Spark playbook says passwordless SSH between nodes is part of the functional distributed setup. Use the same Linux username on both GX10s if possible. On gx10a: ssh-keygen -t ed25519 ssh-copy-id gx10b-cluster On gx10b: ssh-keygen -t ed25519 ssh-copy-id gx10a-cluster Test both directions: ssh gx10b-cluster hostname ssh gx10a-cluster hostname You want this to work without asking for a password. 6. Verify NVIDIA/GPU/driver stack on both On each GX10: nvidia-smi Also check Python, CUDA, and basic GPU visibility: python3 --version nvcc --version || true If nvidia-smi is missing or broken, stop there and fix the NVIDIA/DGX OS stack first. 7. Decide what kind of “clustering” you actually want This is the important part. Option A — easiest useful cluster: two independent model servers This does not let one huge model span both GX10s, but it is often the most useful production setup. Run one model server on each GX10, then put a simple load balancer or router in front of them. This is great for: many users agent workloads batch jobs multiple models RAG pipelines parallel experiments For example, run Ollama, llama.cpp server, LM Studio server, vLLM, or SGLang independently on each GX10. Option B — true larger-model sharding across both GX10s This is what you asked for: one model split across two machines. For that, the strongest candidates are: Stack Best use vLLM + Ray Good general-purpose OpenAI-compatible serving with multi-node support SGLang High-performance serving, structured generation, multi-node tensor parallelism TensorRT-LLM / Triton NVIDIA-optimized, serious production path, more setup llama.cpp RPC Simpler experimental distributed inference, especially GGUF models vLLM’s current docs describe Ray as the default runtime for multi-node inference and tensor_parallel_size as the way to shard across GPUs. SGLang’s multi-node docs describe multi-node deployment for models that exceed a single node’s GPU memory and call out high-speed interconnects such as InfiniBand/RoCE/high-bandwidth Ethernet. TensorRT-LLM also supports single-GPU, multi-GPU, and multi-node deployments. For your first successful “real cluster” test, I would probably start with vLLM + Ray or llama.cpp RPC, then move to TensorRT-LLM later. 8. Install Ray + vLLM on both GX10s On both machines, create a clean environment: python3 -m venv ~/vllm-cluster source ~/vllm-cluster/bin/activate pip install --upgrade pip wheel setuptools pip install vllm ray Because the GX10 is ARM/Grace Blackwell/DGX OS territory, some packages may already be NVIDIA-optimized or container-based. If pip install vllm fails, the better path may be NVIDIA’s recommended container for Spark/GX10 rather than a plain pip install. 9. Start a Ray cluster over the QSFP link On gx10a, using its QSFP/private IP: ray start --head \ --node-ip-address=169.254.42.38 \ --port=6379 It will print a join command. On gx10b, join using the first node’s private QSFP address: ray start --address='169.254.42.38:6379' \ --node-ip-address=169.254.213.227 Then on gx10a: ray status You should see both nodes. 10. Try a smaller distributed vLLM test first Do not start with a 405B model. First prove that multi-node serving works with a smaller model. Example shape: source ~/vllm-cluster/bin/activate vllm serve meta-llama/Llama-3.1-70B-Instruct \ --tensor-parallel-size 2 \ --distributed-executor-backend ray \ --host 0.0.0.0 \ --port 8000 That command is the conceptual structure; the exact model name, quantization format, and flags may need adjustment depending on what models you have access to and what vLLM currently supports cleanly on the GX10. Then test from another terminal: curl http://localhost:8000/v1/models And a completion/chat request: curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ {"role": "user", "content": "Say hello from the two-node GX10 cluster."} ], "max_tokens": 80 }' 11. When that works, move up to larger models For two GX10s, realistic target experiments are: Llama 3.1 405B, heavily quantized / FP8 / FP4 depending on stack support DeepSeek/Qwen MoE models, if supported by the serving stack Large Nemotron / NVIDIA-optimized models from NGC The “up to 405B” claim depends heavily on model format, quantization, KV cache size, context length, serving engine, and batch size. Long context can eat a shocking amount of memory even if the weights fit. Start with conservative settings: --max-model-len 4096 --gpu-memory-utilization 0.85 Then increase context after it works. 12. Alternative: llama.cpp RPC path If you want a more “bare metal” test with GGUF models, llama.cpp has an RPC backend that can distribute model weights and KV cache across local and remote devices; its RPC README says llama.cpp distributes weights and KV cache across available local and remote devices in proportion to available memory by default. Conceptually: On gx10b, start the RPC server: cd ~/llama.cpp ./build/bin/rpc-server -H 169.254.213.227 -p 50052 On gx10a, run the main server using the remote RPC backend: cd ~/llama.cpp ./build/bin/llama-server \ -m /models/your-large-model.gguf \ --rpc 169.254.213.227:50052 \ --host 0.0.0.0 \ --port 8080 The exact flags can change as llama.cpp evolves, so treat this as the path to test, not a guaranteed copy/paste final command. llama.cpp’s multi-GPU docs also explain the difference between layer and tensor splitting: layer splitting is generally more compatible and less interconnect-hungry, while tensor splitting can improve token latency but is much more dependent on fast GPU interconnects. 13. Recommended first-day success checklist I would do this in this order: # both nodes hostname ip addr nvidia-smi # after QSFP netplan ping gx10a-cluster ping gx10b-cluster # SSH ssh gx10b-cluster hostname ssh gx10a-cluster hostname # Ray ray status # serving curl http://gx10a-cluster:8000/v1/models 14. Practical expectation The cable and hardware make this possible, but software support is still the tricky part. A lot of local tools — LM Studio, Ollama, Open WebUI, etc. — are excellent on one machine but do not automatically turn two GX10s into one giant shared-memory model host. For true one-model-across-two-GX10s, focus on: vLLM + Ray SGLang multi-node TensorRT-LLM/Triton llama.cpp RPC For most reliable everyday usefulness, also run both GX10s as separate inference servers behind Open WebUI or a small routing layer. That gives you immediate value while you work through the more delicate distributed model-sharding setup. “Over at some point, I'm sure.” 00:02 No problem! When you're ready to pick up from where we left off, just let me know. I’ll be “Please stop speaking to me. I don't want to talk to me.” 00:03 No worries at all! If you ever feel “Dorothy...” 00:00 No worries at all! If anything else comes up or you need help later, just let me know. In the meantime, you’re all set to take things at your own pace! Pasted text(71).txt Document gb10a@gx10a:~$ ssh gx10b@gx10b-cluster hostname gx10b gx10b@gx10b:~$ ssh-copy-id -i ~/.ssh/id_ed25519.pub gb10a@gx10a-cluster /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/gx10b/.ssh/id_ed25519.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys gb10a@gx10a-cluster's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'gb10a@gx10a-cluster'" and check to make sure that only the key(s) you wanted were added. gx10b@gx10b:~$ ssh gb10a@gx10a-cluster hostname gx10a gb10a@gx10a:~$ ssh gx10b@gx10b-cluster "ip route get 169.254.10.1" 169.254.10.1 dev enp1s0f0np0 src 169.254.10.2 uid 1000 cache gx10b@gx10b:~$ ssh gb10a@gx10a-cluster "ip route get 169.254.10.2" 169.254.10.2 dev enp1s0f0np0 src 169.254.10.1 uid 1000 cache (ray-test) gx10b@gx10b:~$ source ~/ray-test/bin/activate ray start --address='169.254.10.1:6379' --node-ip-address=169.254.10.2 Local node IP: 169.254.10.2 -------------------- Ray runtime started. -------------------- To terminate the Ray runtime, run ray stop(ray-test) gb10a@gx10a:~$ ray status ======== Autoscaler status: 2026-05-27 15:08:06.687999 ======== Node status --------------------------------------------------------------- Active: 1 node_d3b5024ec56577baf039205710339ea8d2dbc3ae8c7f4497992f3215 1 node_f99375afb1ca83db4ac9423c69338200fba5c658abf28ff272ca4f1a Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Total Usage: 0.0/40.0 CPU 0.0/2.0 GPU 0B/163.27GiB memory 0B/69.97GiB object_store_memory From request_resources: (none) Pending Demands: (no resource demands) (ray-test) gb10a@gx10a:~$ Fantastic! Both machines are now linked, and Ray is running across your two GX10s. The cluster shows both nodes and no issues. Now you’re ready to try distributed workloads! If you need help running a model or scaling tasks across both machines, I’m here. Otherwise, congrats on getting the cluster set up! Pasted text(72).txt Document (ray-test) gb10a@gx10a:~$ source ~/ray-test/bin/activate (ray-test) gb10a@gx10a:~$ python - <<'PY' import ray import socket import os ray.init(address="auto") @ray.remote(num_gpus=1) def gpu_task(i): return { "task": i, "hostname": socket.gethostname(), "CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES"), "ray_gpu_ids": ray.get_gpu_ids(), } results = ray.get([gpu_task.remote(i) for i in range(2)]) for r in results: print(r) PY 2026-05-27 18:40:58,271 INFO worker.py:1814 -- Connecting to existing Ray cluster at address: 169.254.10.1:6379... 2026-05-27 18:40:58,287 INFO worker.py:2012 -- Connected to Ray cluster. /home/gb10a/ray-test/lib/python3.12/site-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 warnings.warn( {'task': 0, 'hostname': 'gx10a', 'CUDA_VISIBLE_DEVICES': '0', 'ray_gpu_ids': [0]} {'task': 1, 'hostname': 'gx10b', 'CUDA_VISIBLE_DEVICES': '0', 'ray_gpu_ids': [0]} Pasted text(73).txt Document (ray-test) gb10a@gx10a:~$ source ~/ray-test/bin/activate vllm --version python -c "import vllm; print(vllm.__version__)" 0.21.0 0.21.0 (ray-test) gb10a@gx10a:~$ In another terminal window: gb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/models curl: (7) Failed to connect to 127.0.0.1 port 8000 after 0 ms: Couldn't connect to server gb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/models curl: (7) Failed to connect to 127.0.0.1 port 8000 after 0 ms: Couldn't connect to server gb10a@gx10a:~$ Pasted text(74).txt Document gb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "Write one paragraph explaining that this response is coming from a two-node GX10 Ray/vLLM cluster."} ], "max_tokens": 200, "temperature": 0.2 }' curl: (7) Failed to connect to 127.0.0.1 port 8000 after 0 ms: Couldn't connect to server Pasted text(75).txt Document It still doesn't seem to have finished Pasted text(76).txt Document Pasted text(77).txt Document Stopped thinking Quick answer Please continue Thought for 5s Great — this latest log is progress. The shared /opt/models/... path fix worked: both workers found and loaded the model files this time. The failure moved later, during/after CUDA graph profiling / KV cache initialization. The key lines are: Worker_TP1 ... Model loading took 7.16 GiB memory and 2.839555 seconds Worker_TP0 ... Model loading took 7.16 GiB memory and 25.972658 seconds world_size=2 ... TP rank 0 / TP rank 1 So the model is now loading on both GX10s. Then the remote Ray worker died during the cache/config phase: RayWorkerProc rank=[1] died unexpectedly RuntimeError: cancelled The log also shows vLLM was doing CUDA graph memory profiling and warned about shared-memory broadcast timing while compilation/profiling was underway. Try a more conservative vLLM startup Stop any leftover vLLM/Ray worker processes first. On both machines: source ~/ray-test/bin/activate ray stop pkill -f vllm || true Then restart Ray. On gx10a: source ~/ray-test/bin/activate ray start --head --node-ip-address=169.254.10.1 --port=6379 On gx10b: source ~/ray-test/bin/activate ray start --address='169.254.10.1:6379' --node-ip-address=169.254.10.2 On gx10a, verify: ray status Now try this safer server command on gx10a: source ~/ray-test/bin/activate VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \ vllm serve /opt/models/Qwen2.5-7B-Instruct \ --served-model-name Qwen/Qwen2.5-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --distributed-executor-backend ray \ --tensor-parallel-size 2 \ --max-model-len 4096 \ --gpu-memory-utilization 0.80 \ --enforce-eager Why this version: --enforce-eager disables CUDA graph execution. vLLM’s engine-argument docs describe enforce_eager as always using eager-mode PyTorch and disabling CUDA graph execution. VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 turns off the CUDA graph memory profiling behavior that your log says is enabled by default in vLLM 0.21.0. Your log explicitly says CUDA graph memory profiling is enabled and suggests this environment variable as the way to disable it. --gpu-memory-utilization 0.80 leaves more headroom while we stabilize distributed startup. If that starts, test it In a second terminal on gx10a: curl http://127.0.0.1:8000/v1/models Then: curl http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "Write one paragraph explaining that this response is coming from a two-node GX10 Ray/vLLM cluster."} ], "max_tokens": 200, "temperature": 0.2 }' If it still fails, try pipeline parallel instead of tensor parallel With one GPU per node, tensor_parallel_size=2 does work conceptually, but it forces tensor-parallel all-reduce across machines. The warnings say vLLM is spreading TP across nodes. Since you have 200G networking that may be okay, but a more natural one-GPU-per-node layout can be pipeline parallelism: source ~/ray-test/bin/activate VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \ vllm serve /opt/models/Qwen2.5-7B-Instruct \ --served-model-name Qwen/Qwen2.5-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --distributed-executor-backend ray \ --tensor-parallel-size 1 \ --pipeline-parallel-size 2 \ --max-model-len 4096 \ --gpu-memory-utilization 0.80 \ --enforce-eager What I think is happening You solved the first real blocker: the remote worker could not see /home/gb10a/models/...; /opt/models/... fixed that. Now the model loads on both GPUs, then the remote worker dies during vLLM’s V1 startup/profiling/cache initialization. The quickest practical workaround is to remove CUDA graph/profiling complexity with --enforce-eager and reduced memory utilization. If that works, we can later remove one flag at a time to regain performance. gb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/models {"object":"list","data":[{"id":"Qwen/Qwen2.5-7B-Instruct","object":"model","created":1779927940,"owned_by":"vllm","root":"/opt/models/Qwen2.5-7B-Instruct","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-8b062bed74594768","object":"model_permission","created":1779927940,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}gb10a@gx10a:~$ gb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/models {"object":"list","data":[{"id":"Qwen/Qwen2.5-7B-Instruct","object":"model","created":1779927940,"owned_by":"vllm","root":"/opt/models/Qwen2.5-7B-Instruct","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-8b062bed74594768","object":"model_permission","created":1779927940,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":fgb10a@gx10a:~$ curl http://127.0.0.1:8000/v1/chat/completions \letions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "Write one paragraph explaining that this response is coming from a two-node GX10 Ray/vLLM cluster."} ], "max_tokens": 200, "temperature": 0.2 }' {"id":"chatcmpl-b68f1806856177f0","object":"chat.completion","created":1779927965,"prompt_routed_experts":null,"model":"Qwen/Qwen2.5-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"This response is generated by a two-node GX10 Ray/vLLM cluster, which consists of two powerful GX10 nodes optimized for efficient distributed inference using the Ray framework and the vLLM library. This setup allows for parallel processing and significantly enhances the speed and scalability of generating text responses, ensuring both efficiency and performance in handling complex queries.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.21.0-tp2-nohash","usage":{"prompt_tokens":51,"total_tokens":122,"completion_tokens":71,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"prompt_text":null,"kv_transfer_params":null}gb10a@gx10a:~$ gb10a@gx10a:~$ Screenshot (51).png Screenshot (52).png I'm now getting these connection error messages popping up, which weren't there before setting all this up Pasted text(78).txt Document gb10a@gx10a:~$ nmcli device status DEVICE TYPE STATE CONNECTION wlP9s9 wifi connected TMOBILE-793A enp1s0f0np0 ethernet connected netplan-enp1s0f0np0 lo loopback connected (externally) lo docker0 bridge connected (externally) docker0 enP2p1s0f0np0 ethernet disconnected -- p2p-dev-wlP9s9 wifi-p2p disconnected -- enP2p1s0f1np1 ethernet unavailable -- enP7s7 ethernet unavailable -- enp1s0f1np1 ethernet unavailable -- NAME UUID TYPE DEVICE AUTOCONNECT TMOBILE-793A 82340a3c-655d-436d-b226-451ca41f26bb wifi wlP9s9 yes netplan-enp1s0f0np0 05ec54af-9f53-3227-867e-c7e879b25c37 ethernet enp1s0f0np0 yes lo c8b8418e-bc32-4319-a98e-9ded46ac125c loopback lo no docker0 c2e774f1-5975-4545-b2a0-11bd131323a4 bridge docker0 no OobeHotspot ea258393-ed46-4721-89df-ddfcec79d083 wifi -- no OobeHotspot a93a979d-6206-4007-8def-36dba55e74b2 wifi -- no OobeHotspot 8d5e070b-9add-40de-aa90-c56bb32c1381 wifi -- no Wired connection 1 ddae69d0-8873-3a7b-811d-b1163cdf287e ethernet -- yes Wired connection 2 03b2a174-ba20-3a3e-ad59-669f4436d481 ethernet -- yes Wired connection 3 abad5efa-e2b6-3964-ac3f-9d6e8705b43a ethernet -- yes Wired connection 4 5a9bbdaf-5b23-34fc-9313-709e5b02e0c3 ethernet -- yes Wired connection 5 7d41b90e-0d5b-3c67-b465-fe987bfcc23f ethernet -- yes Thought for a couple of seconds Perfect — the logs identify the exact culprit. The popup is coming from: Wired connection 4 device: enP2p1s0f0np0 reason: ip-config-unavailable NetworkManager is repeatedly trying DHCP on enP2p1s0f0np0, timing out, and showing the desktop notification. Your actual working cluster link is not that interface. Your working cluster interface is: enp1s0f0np0 -> netplan-enp1s0f0np0 The log repeatedly shows NetworkManager auto-activating Wired connection 4 on enP2p1s0f0np0, then failing DHCP. Run this on gx10a: sudo nmcli connection modify "Wired connection 4" connection.autoconnect no sudo nmcli connection down "Wired connection 4" || true Then verify: nmcli -f NAME,UUID,TYPE,DEVICE,AUTOCONNECT con show nmcli device status You want to see: Wired connection 4 ... ethernet -- no And you still want: wlP9s9 wifi connected TMOBILE-793A enp1s0f0np0 ethernet connected netplan-enp1s0f0np0 Then confirm the cluster link still works: ping -I enp1s0f0np0 169.254.10.2 If you get the same popup on gx10b, run the same diagnostic there: nmcli device status nmcli -f NAME,UUID,TYPE,DEVICE,AUTOCONNECT con show journalctl -u NetworkManager -b --since "30 minutes ago" | egrep -i "failed|activation|enp|enP|wlP|connection" Then disable whichever unused Wired connection X is trying DHCP and failing. gb10a@gx10a:~$ sudo nmcli connection modify "Wired connection 4" connection.autoconnect no sudo nmcli connection down "Wired connection 4" || true [sudo] password for gb10a: Error: 'Wired connection 4' is not an active connection. Error: no active connection provided. Thought for a couple of seconds That is okay — the important first command probably worked. This part: Error: 'Wired connection 4' is not an active connection. Error: no active connection provided. just means Wired connection 4 was not active at that exact moment, so NetworkManager could not bring it “down.” That is harmless. Now verify that autoconnect is disabled: nmcli -f NAME,DEVICE,AUTOCONNECT con show | grep -E "Wired connection 4|netplan-enp1s0f0np0|TMOBILE" You want to see something like: TMOBILE-793A wlP9s9 yes netplan-enp1s0f0np0 enp1s0f0np0 yes Wired connection 4 -- no Then wait a minute or two and check whether NetworkManager is still trying it: journalctl -u NetworkManager -b --since "2 minutes ago" | egrep -i "Wired connection 4|enP2p1s0f0np0|failed|activation" If no new Wired connection 4 activation attempts appear, the popup should stop. Also confirm the cluster link still works: ping -I enp1s0f0np0 169.254.10.2 If the popup persists, run this stronger version: sudo nmcli connection delete "Wired connection 4" That only deletes the stale NetworkManager profile. It should not affect your working cluster link, which is: netplan-enp1s0f0np0 on enp1s0f0np0 gb10a@gx10a:~$ nmcli -f NAME,DEVICE,AUTOCONNECT con show | grep -E "Wired connection 4|netplan-enp1s0f0np0|TMOBILE" TMOBILE-793A wlP9s9 yes netplan-enp1s0f0np0 enp1s0f0np0 yes Wired connection 4 -- no gb10a@gx10a:~$ Perfect — that fixed the offending autoconnect profile on gx10a. You now have: TMOBILE-793A wlP9s9 yes netplan-enp1s0f0np0 enp1s0f0np0 yes Wired connection 4 -- no That is exactly what we wanted. Now check whether new errors stopped: journalctl -u NetworkManager -b --since "2 minutes ago" | egrep -i "Wired connection 4|enP2p1s0f0np0|failed|activation" If that returns no new activation failures, the popups should stop on gx10a. Then confirm the 200G cluster link is still okay: ping -I enp1s0f0np0 169.254.10.2 If the same popup appears on gx10b, repeat the same diagnostic there and disable the corresponding stale Wired connection X profile. gb10a@gx10a:~$ ping -I enp1s0f0np0 169.254.10.2 PING 169.254.10.2 (169.254.10.2) from 169.254.10.1 enp1s0f0np0: 56(84) bytes of data. 64 bytes from 169.254.10.2: icmp_seq=1 ttl=64 time=0.216 ms 64 bytes from 169.254.10.2: icmp_seq=2 ttl=64 time=0.788 ms 64 bytes from 169.254.10.2: icmp_seq=3 ttl=64 time=0.861 ms ^C --- 169.254.10.2 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2057ms rtt min/avg/max/mdev = 0.216/0.621/0.861/0.288 ms Please provide 1 concise copy-paste block for gx10b Run this on gx10b: nmcli device status nmcli -f NAME,DEVICE,AUTOCONNECT con show journalctl -u NetworkManager -b --since "30 minutes ago" | egrep -i "failed|activation|enp|enP|wlP|connection" | tail -80 sudo nmcli connection modify "Wired connection 4" connection.autoconnect no sudo nmcli connection down "Wired connection 4" || true nmcli -f NAME,DEVICE,AUTOCONNECT con show | grep -E "Wired connection 4|netplan-enp1s0f0np0|TMOBILE" ping -I enp1s0f0np0 169.254.10.1 If the log shows a different failing profile name than "Wired connection 4", replace "Wired connection 4" with that exact name. nmcli device status nmcli -f NAME,DEVICE,AUTOCONNECT con show journalctl -u NetworkManager -b --since "30 minutes ago" | egrep -i "failed|activation|enp|enP|wlP|connection" | tail -80 sudo nmcli connection modify "Wired connection 4" connection.autoconnect no sudo nmcli connection down "Wired connection 4" || true nmcli -f NAME,DEVICE,AUTOCONNECT con show | grep -E "Wired connection 4|netplan-enp1s0f0np0|TMOBILE" ping -I enp1s0f0np0 169.254.10.1 That block is okay only if the failing profile on gx10b is also Wired connection 4. A safer copy-paste block for gx10b is this: nmcli device status nmcli -f NAME,DEVICE,AUTOCONNECT con show echo echo "Recent failing NetworkManager activations:" journalctl -u NetworkManager -b --since "30 minutes ago" | egrep -i "failed|activation|enp|enP|wlP|connection" | tail -80 echo echo "Disabling likely stale wired autoconnect profiles, preserving Wi-Fi and netplan cluster link..." for c in "Wired connection 1" "Wired connection 2" "Wired connection 3" "Wired connection 4" "Wired connection 5"; do if nmcli -t -f NAME con show | grep -Fxq "$c"; then sudo nmcli connection modify "$c" connection.autoconnect no sudo nmcli connection down "$c" || true fi done echo echo "Current important connection autoconnect status:" nmcli -f NAME,DEVICE,AUTOCONNECT con show | grep -E "Wired connection|netplan-enp1s0f0np0|TMOBILE|wlP9s9" echo echo "Testing cluster link back to gx10a:" ping -I enp1s0f0np0 169.254.10.1 That disables only the stale Wired connection X profiles and leaves the important ones alone: TMOBILE / wlP9s9 netplan-enp1s0f0np0 / enp1s0f0np0 Pasted text(79).txt Document Thought for a couple of seconds That looks good. On gx10b, the stale wired profiles are now disabled and the cluster link still works. You now have the right state on gx10b: TMOBILE-793A wlP9s9 yes netplan-enp1s0f0np0 enp1s0f0np0 yes Wired connection 1 -- no Wired connection 2 -- no Wired connection 3 -- no Wired connection 4 -- no Wired connection 5 -- no And the 200G link is still working: 169.254.10.2 -> 169.254.10.1 0% packet loss The earlier errors are harmless because the down command was attempted on profiles that were not active at that moment. The key part is that connection.autoconnect is now no. The log also confirms NetworkManager successfully updated and deactivated Wired connection 4. The connection popups should stop now on both machines. If one appears again, check only the last few minutes: journalctl -u NetworkManager -b --since "5 minutes ago" | egrep -i "failed|activation|Wired connection|enP2p1s0f0np0" Voice chat ended 3s