VLLM-ascend: Using `cudagraph_mode`: `FULL_DECODE_ONLY`
Are you encountering issues while trying to use cudagraph_mode: FULL_DECODE_ONLY in your vLLM-ascend setup? This comprehensive guide will walk you through the potential problems, their solutions, and best practices for utilizing this mode effectively. We'll explore a real-world scenario, dissect the error logs, and provide actionable steps to resolve the TimeoutError you might be facing.
Understanding the Issue
The core problem revolves around a TimeoutError that occurs when cudagraph_mode is set to FULL_DECODE_ONLY within the compilation configuration of your vLLM-ascend server. This issue seems to be triggered specifically when running with a high --max-num-seqs value (e.g., 128). However, the error disappears when either cudagraph_mode is removed or --max-num-seqs is reduced to 1. Let's dive deeper into the specifics.
To effectively utilize cudagraph_mode: FULL_DECODE_ONLY in vLLM-ascend, it’s crucial to grasp its intended behavior and the underlying mechanisms that might lead to errors. This mode is designed to optimize the decoding phase of large language models (LLMs) by leveraging CUDA graphs, a feature that allows for the capture and replay of GPU operations. By capturing the decoding process as a graph, subsequent executions can bypass the usual overhead of launching individual operations, resulting in significant performance gains. However, the intricacies of graph capture and replay can sometimes lead to unexpected issues, especially in distributed environments or when dealing with complex model configurations.
Key Components and Configuration
Before we delve into the error analysis, let's recap the key components and configuration settings involved:
- vLLM-ascend: This is the optimized version of vLLM (a fast and easy-to-use library for LLM inference) tailored for Ascend AI processors.
cudagraph_mode: This configuration parameter dictates how CUDA graphs are used. Setting it toFULL_DECODE_ONLYmeans that only the decoding phase is captured in the CUDA graph.cudagraph_capture_sizes: This array specifies the sequence lengths at which CUDA graphs should be captured. It allows the system to adapt to varying input sizes efficiently.--max-num-seqs: This parameter controls the maximum number of sequences that can be processed concurrently. A higher value typically means greater throughput but can also increase memory consumption and the likelihood of encountering resource-related issues.- Distributed Setup: The provided setup utilizes a multi-device configuration (
-dp 2 -tp 4), indicating data parallelism across 2 devices and tensor parallelism across 4 devices. This adds complexity to the system and requires careful coordination between processes.
Error Manifestation
The error logs paint a clear picture of what's going wrong. The core issue is a TimeoutError originating from the EngineCore process. This process is responsible for the high-level coordination of the inference pipeline. The error message “RPC call to execute_model timed out” suggests that the EngineCore is unable to communicate with the worker processes within the expected timeframe. This timeout is further linked to the shared memory broadcast mechanism (shm_broadcast.py), which is used for inter-process communication.
Specifically, the logs repeatedly show the message “No available shared memory broadcast block found in 60 seconds”. This indicates that the EngineCore is waiting for data from the workers, but the workers are not responding in time. This could be due to various reasons, including:
- Worker Hang: The worker processes might be hanging due to an unhandled exception, deadlock, or other internal issues.
- Resource Contention: The workers might be experiencing resource contention, such as memory or compute bottlenecks, preventing them from processing data and responding to the
EngineCore. - CUDA Graph Issues: The CUDA graph capture or replay mechanism might be failing, causing the workers to stall.
- Communication Problems: There might be issues with the shared memory broadcast mechanism itself, such as insufficient buffer sizes or synchronization problems.
Analyzing the Stack Trace
The stack trace provides valuable clues about the exact point of failure. It shows that the TimeoutError occurs during the dequeue operation in the shm_broadcast.py module. This operation is part of the process of receiving data from the workers. The acquire_read function, which is called before dequeue, raises a TimeoutError because it cannot acquire a read lock on the shared memory buffer within the specified timeout period.
This suggests that the workers are either not writing data to the shared memory buffer or are taking too long to do so. The fact that the error disappears when cudagraph_mode is disabled or --max-num-seqs is reduced points towards a potential interaction between CUDA graphs and the distributed execution environment.
Diagnosing the Root Cause
Based on the error messages and the system configuration, several factors could be contributing to the problem. Let's explore some of the most likely causes and how to investigate them:
-
CUDA Graph Compatibility: The Ascend AI processors might have specific requirements or limitations regarding CUDA graph usage. The error could stem from an incompatibility between the CUDA graph implementation in vLLM and the Ascend hardware or drivers. It’s crucial to ensure that the CUDA graph features used by vLLM are fully supported on the Ascend platform.
- Action: Consult the vLLM-ascend documentation and Ascend hardware specifications to verify CUDA graph compatibility. Check for any known issues or limitations related to CUDA graphs on Ascend processors.
-
Resource Constraints: Running with a high
--max-num-seqsvalue (128) can place significant demands on GPU memory and compute resources. Whencudagraph_modeis enabled, the CUDA graph capture process can further increase memory usage. If the system runs out of memory or compute capacity, worker processes might stall, leading to timeouts.- Action: Monitor GPU memory usage and CPU utilization during inference. Try reducing
--max-num-seqsto see if it alleviates the problem. Experiment with different values forgpu-memory_utilizationto fine-tune memory allocation.
- Action: Monitor GPU memory usage and CPU utilization during inference. Try reducing
-
Distributed Synchronization: In a distributed environment, proper synchronization between processes is critical. CUDA graph capture and replay involve complex coordination between the
EngineCoreand worker processes. If there are synchronization issues, such as mismatched graph versions or incorrect data transfers, timeouts can occur.- Action: Review the distributed execution logic in vLLM-ascend. Ensure that all processes are correctly synchronized during CUDA graph capture and replay. Check for any potential race conditions or deadlocks.
-
Driver Issues: The warning messages in the logs indicate potential issues with the Ascend driver version. An outdated or incompatible driver can lead to various problems, including CUDA graph failures. The warning “
Driver Version: is invalid or not supported yet” is a clear indicator of a driver-related problem.- Action: Update the Ascend drivers to the latest recommended version. Ensure that the driver version is compatible with the vLLM-ascend version and the Ascend hardware.
-
Shared Memory Broadcast Bottleneck: The repeated “
No available shared memory broadcast block found” messages suggest that the shared memory broadcast mechanism might be a bottleneck. The buffer size might be insufficient to handle the data volume, or there might be contention for access to the shared memory.- Action: Investigate the shared memory broadcast implementation in vLLM-ascend. Try increasing the buffer size if possible. Check for any locking or synchronization issues that might be causing contention.
Resolving the Issue: Step-by-Step Guide
Based on the potential causes identified above, here’s a structured approach to troubleshoot and resolve the TimeoutError:
Step 1: Verify CUDA Graph Compatibility
- Consult Documentation: Refer to the official vLLM-ascend documentation and Ascend hardware specifications to confirm that CUDA graphs are fully supported on your hardware and software configuration.
- Check Known Issues: Look for any known issues or limitations related to CUDA graph usage on Ascend processors. Online forums, issue trackers, and community discussions can be valuable resources.
Step 2: Update Ascend Drivers
- Identify Driver Version: Determine the current version of the Ascend drivers installed on your system.
- Download Latest Drivers: Visit the Ascend support website or repository to download the latest recommended drivers for your hardware and operating system.
- Install Drivers: Follow the installation instructions provided with the driver package. Ensure that the installation process completes successfully.
- Verify Installation: After installation, verify that the driver version has been updated correctly.
Step 3: Monitor Resource Usage
- GPU Memory: Use monitoring tools (e.g.,
npu-smiorrocm-smi) to track GPU memory usage during inference. - CPU Utilization: Monitor CPU utilization using system tools (e.g.,
toporhtop). - Identify Bottlenecks: Look for any spikes or sustained high usage that might indicate resource contention.
Step 4: Adjust --max-num-seqs and gpu-memory_utilization
- Reduce
--max-num-seqs: Start by reducing--max-num-seqsto a lower value (e.g., 16 or 32) to see if it resolves the timeout issue. - Tune
gpu-memory_utilization: Experiment with different values forgpu-memory_utilization(e.g., 0.8 or 0.7) to optimize memory allocation. - Iterate and Test: Gradually increase
--max-num-seqswhile monitoring resource usage. Find the optimal balance between throughput and stability.
Step 5: Investigate Shared Memory Broadcast
- Review Implementation: Examine the code related to shared memory broadcast in vLLM-ascend.
- Increase Buffer Size: If possible, try increasing the buffer size used for shared memory communication.
- Check Synchronization: Look for any potential synchronization issues, such as race conditions or deadlocks.
Step 6: Debug CUDA Graph Capture
- Disable
cudagraph_mode: Temporarily disablecudagraph_modeto confirm whether it’s the root cause of the issue. - Capture Graph Manually: If possible, try capturing the CUDA graph manually using the Ascend CUDA graph APIs.
- Inspect Graph Execution: Analyze the execution of the CUDA graph to identify any potential errors or performance bottlenecks.
Applying the Solution to Your Scenario
Based on the provided information, the most likely culprits in your scenario are the Ascend driver version and resource constraints due to the high --max-num-seqs value. Here’s how you can apply the troubleshooting steps:
- Update Ascend Drivers: Ensure that you are using the latest recommended Ascend drivers compatible with your hardware and vLLM-ascend version. The warning messages in the logs strongly suggest a driver-related issue.
- Reduce
--max-num-seqs: Try reducing--max-num-seqsfrom 128 to a smaller value (e.g., 32 or 16). This should alleviate the memory pressure and reduce the likelihood of timeouts. - Monitor GPU Memory Usage: Use
npu-smior a similar tool to monitor GPU memory usage while running inference. This will help you identify whether you are hitting memory limits. - Tune
gpu-memory_utilization: Experiment with different values forgpu-memory_utilizationto optimize memory allocation. A value of 0.8 or 0.7 might provide a better balance between memory usage and performance. - Test CUDA Graph Compatibility: If the issue persists after updating drivers and reducing
--max-num-seqs, consider temporarily disablingcudagraph_modeto rule out any potential CUDA graph compatibility issues.
Modified Server Script for Testing
Here’s a modified version of your server script incorporating the recommended changes:
export HCCL_IF_IP=$(ifconfig | grep '10.127.' | awk '{print $2}')
export HCCL_BUFFSIZE=512
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_USE_V1=1
export VLLM_ENABLE_MC2=1
export HCCL_DETERMINISTIC=true
export CLOSE_MATMUL_K_SHIFT=1
weight_path=/data01/huawei-2025/lfk/convert/1106/tele-105b_hf/
python -m vllm.entrypoints.openai.api_server \
--model $weight_path \
--served-model-name deepseekv3 \
--trust-remote-code \
--max-num-seqs 32 \
-dp 2 \
-tp 4 \
--enable_expert_parallel \
--port 8005 \
--max_model_len 4096 \
--max-num-batched-tokens 4096 \
--gpu-memory_utilization 0.8 \
--no-enable-prefix-caching \
--compilation-config '{"cudagraph_capture_sizes": [1,4,8,16,32,64,128],"cudagraph_mode": "FULL_DECODE_ONLY"}'
This script reduces --max-num-seqs to 32 and sets gpu-memory_utilization to 0.8. After applying these changes, run your custom script again to test if the issue is resolved.
Additional Tips and Considerations
- Logging: Enable detailed logging in vLLM-ascend to capture more information about the inference process. This can help you pinpoint the exact point of failure.
- Profiling: Use profiling tools to analyze the performance of your model. This can help you identify performance bottlenecks and areas for optimization.
- Community Support: Engage with the vLLM community and Ascend support forums. Other users might have encountered similar issues and can provide valuable insights.
Conclusion
Troubleshooting cudagraph_mode: FULL_DECODE_ONLY in vLLM-ascend requires a systematic approach. By understanding the underlying mechanisms, analyzing error logs, and following the steps outlined in this guide, you can effectively diagnose and resolve TimeoutError issues. Remember to update drivers, monitor resource usage, adjust configuration parameters, and leverage community support to optimize your vLLM-ascend deployment. For more in-depth information on CUDA graphs and their optimization techniques, you can refer to the official NVIDIA documentation here. With careful attention to detail and a methodical approach, you can harness the power of cudagraph_mode to achieve significant performance gains in your LLM inference workloads.