Bug: Behavior Freezes On Dual Nodes - Troubleshooting Guide

Nov 21, 2025 by Alex Johnson 60 views

Experiencing freezes during behavior execution on dual nodes can be a frustrating issue. This comprehensive guide dives into a specific case where the program stalls during the rollout phase, consuming VRAM but showing zero GPU power usage. We will analyze the configuration, examine the logs, and explore potential solutions to get your simulations running smoothly again. If you're encountering similar problems with distributed training or OmniGibson environments, this article is for you. Understanding the intricacies of multi-GPU setups and behavior cloning is crucial for resolving such issues.

Problem Description

The core problem lies in the program freezing at the "Generating Rollout Epochs" stage when the environment (env) is placed on the 4090 node, while the rollout and actor processes reside on the A100 node. This peculiar behavior, where VRAM is consumed without corresponding GPU activity, suggests a potential bottleneck or communication breakdown between the nodes. Identifying the root cause requires a meticulous examination of the configuration and log files. This is a common issue when dealing with robotics simulation and reinforcement learning, especially in complex environments.

Configuration Analysis

To understand the potential causes, let's dissect the provided configuration. The configuration outlines a setup using two nodes with specific component placements: actor processes on cores 0-7, environment on cores 8-11, and rollout processes on cores 0-3. Key aspects of the configuration include:

Environment Settings: The environment is configured for a "behavior" simulator type, utilizing the R1Pro robot in an InteractiveTraversableScene, specifically a house with living room and kitchen areas. The tasks involve activities like "picking_up_trash," with a maximum episode length of 2000 steps.
Cluster Configuration: The cluster consists of two nodes, with the actor, environment, and rollout components distributed across them. This distributed setup is intended to leverage the computational power of both nodes, but it also introduces complexities in inter-node communication.
Algorithm Settings: The algorithm employs several advanced techniques, including GAE for advantage estimation, actor-critic loss, and token-level entropy and log probability calculations. These settings are designed to optimize the learning process but may also contribute to instability if not configured correctly.
Rollout and Actor Settings: The rollout process uses a Hugging Face backend and the OpenVLA-OFT-Behavior model, with bfloat16 precision and pipeline stage parallelism. The actor process uses FSDP for distributed training, with specific configurations for sharding strategy, mixed precision, and optimization parameters. The use of Hugging Face models and FSDP highlights the advanced nature of the setup and the potential for complex interactions between components.

Analyzing these settings, we can identify several areas that might be contributing to the freeze. For instance, the communication overhead between nodes, the FSDP configuration, and the memory management settings could all be potential culprits. A deep dive into each of these aspects is essential for pinpointing the exact cause.

Log File Examination

The provided log file snippets offer valuable clues. Several warnings and errors indicate potential issues:

File System Full: The repeated "/tmp/ray/session… is over 95% full" errors suggest a disk space issue on the node hosting the Ray session. This can impede object creation and potentially lead to hangs.
Gloo ConnectFullMesh Failure: The "Gloo connectFullMesh failed" error indicates a problem with the Gloo communication backend, which is used for distributed training in PyTorch. This error is critical and likely contributes to the program freeze.
OmniGibson Warnings: Several warnings related to OmniGibson, such as "SdRenderVarPtr missing valid input renderVar LdrColorSDhost" and "Failed to startup plugin carb.windowing-glfw.plugin," might indicate issues with the environment setup or rendering configurations.
FSDP Warnings: The UserWarnings about full_state_dict being returned when using NO_SHARD for ShardingStrategy suggest that the FSDP sharding strategy might not be optimal for this setup. This could lead to memory inefficiencies and performance bottlenecks.

These log entries collectively paint a picture of a system under stress, with potential issues ranging from disk space limitations to communication failures and suboptimal configurations. Addressing these warnings and errors is crucial for resolving the freeze.

Potential Solutions and Troubleshooting Steps

Based on the analysis of the configuration and log files, here are several potential solutions and troubleshooting steps to address the freezing issue:

1. Disk Space Management

The "file system full" errors are a clear indicator of a problem. Insufficient disk space can prevent the creation of temporary files and hinder inter-process communication. To resolve this:

Clear Temporary Files: Delete unnecessary files from the /tmp directory or other temporary storage locations.
Increase Disk Space: If possible, increase the disk space allocated to the affected node.
Adjust Ray Spilling: Configure Ray's spilling mechanism to use a different storage location with more space.

2. Gloo Communication Issues

The "Gloo connectFullMesh failed" error points to a failure in the distributed communication backend. Gloo is a collective communications library, and its failure can disrupt the entire distributed training process. To troubleshoot this:

Network Configuration: Ensure that the nodes can communicate with each other over the network. Check firewall settings and network configurations.
Gloo Backend Settings: Verify that the Gloo backend is correctly configured. Ensure that the environment variables related to Gloo (e.g., GLOO_SOCKET_IFNAME) are set appropriately.
NCCL as an Alternative: Consider using NCCL (NVIDIA Collective Communications Library) as an alternative backend, as it is often more performant for GPU-based communication. This involves changing the distributed backend settings in your configuration.

3. FSDP Configuration Optimization

The warnings related to FSDP's NO_SHARD sharding strategy suggest that this configuration might not be optimal. FSDP is designed to shard model parameters, gradients, and optimizer states across multiple GPUs, but NO_SHARD disables this sharding. To improve FSDP performance:

Experiment with Sharding Strategies: Try different sharding strategies, such as FULL_SHARD or SHARD_GRAD_OP. These strategies shard the model parameters and gradients, reducing memory footprint on each GPU.
Adjust FSDP Size: If using hybrid sharding, ensure that the fsdp_size parameter is correctly set. This parameter determines the number of GPUs per FSDP group.
Enable Gradient Accumulation: Gradient accumulation can improve training stability, especially with sharded models. Ensure that enable_gradient_accumulation is set to True in the FSDP configuration.

4. OmniGibson Environment Issues

The OmniGibson warnings might indicate problems with the environment setup or rendering configurations. These warnings, while not directly causing the freeze, can lead to performance bottlenecks or unexpected behavior. To address these:

Plugin Dependencies: Ensure that all required OmniGibson plugins are correctly installed and loaded. The "Failed to startup plugin carb.windowing-glfw.plugin" warning suggests a potential plugin loading issue.
Rendering Settings: Review the rendering settings in the OmniGibson configuration. Adjust parameters such as viewer width and height, and ensure that the rendering device is correctly configured.
Scene Configuration: Verify that the scene is correctly loaded and that all assets are present. Issues with scene loading can lead to rendering errors and performance degradation.

5. Memory Management and Offloading

Given that the program consumes VRAM without corresponding GPU activity, memory management is a critical area to investigate. Consider the following:

Enable Offloading: Ensure that offloading is enabled for both the rollout and actor processes. Offloading moves model parameters and optimizer states to CPU memory, reducing GPU memory usage.
Adjust Micro-Batch Size: Experiment with different micro-batch sizes. Smaller batch sizes reduce memory consumption but can increase communication overhead.
Gradient Checkpointing: Gradient checkpointing reduces memory usage by recomputing activations during the backward pass. Enable gradient checkpointing in the model configuration.

6. Debugging and Logging

Effective debugging requires detailed logs and monitoring. Consider the following:

Increase Logging Verbosity: Increase the logging verbosity to capture more detailed information about the program's execution.
Monitor GPU Usage: Use tools like nvidia-smi to monitor GPU utilization and memory consumption in real-time.
Profiling: Use profiling tools to identify performance bottlenecks in the code.

YAML Configuration Review

Let's examine specific parts of the provided YAML configuration to identify potential areas for optimization.

The FSDP configuration is a key area to scrutinize. The sharding_strategy is set to `