Fixing Flaky E2E Tests: Server-Side Timeout Issue

Nov 28, 2025 by Alex Johnson 50 views

Fixing Flaky E2E Tests: A Deep Dive into Server-Side Timeout Issues

Flaky end-to-end (E2E) tests can be a major headache in software development. They fail intermittently, making it difficult to determine whether a failure is due to a genuine bug or just a transient issue. This article delves into a specific case of a flaky E2E test, focusing on a missing server-side timeout in sandbox execution. Understanding the root cause and implementing a proper fix is crucial for maintaining the reliability of your test suite and ensuring the quality of your software.

Summary: Identifying the Flaky E2E Test

We're addressing a flaky E2E test identified as VM0 agent session: session persists across runs with same config and artifact. This test intermittently fails due to a CLI timeout, even though the server doesn't send a vm0_error event. This means the client-side test expects a response from the server within a specific timeframe, but the server-side process is not completing or signaling an error within that window. This type of issue highlights the importance of having robust timeout mechanisms on both the client and server sides to prevent tests from hanging indefinitely and to provide clear indications of failures.

The core issue is that the test, designed to ensure session persistence across multiple runs with the same configuration and artifact, sometimes times out. The CLI (Command Line Interface), which is part of the testing infrastructure, times out while waiting for events from the server. This timeout occurs because the server, in certain scenarios, doesn't send a vm0_error event, leaving the CLI hanging. Pinpointing the exact reasons why this occurs involves a detailed examination of the system's architecture and how different components interact during the test execution. Addressing such flakiness is critical as it erodes confidence in the test suite, making it harder to identify genuine issues and slowing down the development process. The investigation into this flaky test not only resolves an immediate problem but also provides valuable insights into the overall system's resilience and error handling mechanisms.

Observed Behavior: Decoding the CI Logs

The specific error observed in the CI (Continuous Integration) logs provides valuable clues. The log message # Step 2: First run (creates session)... ✗ Agent execution timed out after 120 seconds without receiving events clearly indicates that the agent execution timed out after 120 seconds without receiving any events. This is a critical piece of information, suggesting that the CLI initiated a process on the server, but the server failed to respond within the expected timeframe. The CLI received the vm0_start event, which signifies that a Run ID was generated, indicating that the initial request reached the server and a process was started. However, the absence of subsequent events within the 120-second window points to a problem in the server-side execution or event emission. The timeout suggests that the process on the server might be hanging, encountering an error that is not properly reported, or experiencing a network issue that prevents the events from being sent back to the CLI. To understand the nature of the problem, it's necessary to examine the server-side logic and how it handles the execution of the agent session, particularly the mechanisms for error reporting and event emission. The timeout issue not only disrupts the test execution but also highlights a potential weakness in the system's ability to handle long-running or potentially problematic processes, which could have implications beyond the testing environment.

Root Cause Analysis: Unraveling the Mystery

To effectively address the flaky test, a thorough root cause analysis is essential. This involves understanding the system's architecture, the flow of execution, and the potential points of failure. The investigation revealed that the core issue lies in the interaction between the CLI, the server, and the sandbox environment where the agent code is executed. The analysis also highlights the critical role of timeout mechanisms and error handling in ensuring the robustness of the system. By dissecting the architecture and execution flow, we can identify the precise location where the failure occurs and the conditions that trigger it, ultimately leading to a more reliable and resilient testing process.

Architecture Flow: Tracing the Execution Path

Understanding the architecture flow is crucial for pinpointing the root cause. Let's break down the steps involved:

CLI creates a run via POST /api/agent/runs: The CLI initiates the process by sending a request to the server's API endpoint to create a new run. This is the starting point of the agent session execution. This step is critical as it sets the context for the entire process, including the configuration and artifacts that the agent will use. Ensuring this initial step is successful and efficient is paramount for the subsequent stages.
Server inserts run record, sends vm0_start event, returns runId: Upon receiving the request, the server inserts a record of the run into its database, generates a vm0_start event, and returns a unique runId to the CLI. This step signifies the successful initiation of the run on the server side. The runId serves as a crucial identifier for tracking the progress and status of the run. The server's ability to quickly and reliably perform these actions is vital for the overall performance and responsiveness of the system.
Server async calls sandbox.commands.run() with timeoutMs: 0 (no timeout): The server then asynchronously calls the sandbox.commands.run() function, which is responsible for executing the agent code within the sandbox environment. Critically, the timeoutMs parameter is set to 0, indicating that there is no timeout for this operation. This is a key point in the root cause analysis, as it means the server will wait indefinitely for the sandbox execution to complete. The implications of this setting are that if the sandbox execution encounters an issue and doesn't terminate, the server will hang, leading to the observed timeout on the CLI side.
CLI polls for events every 500ms, times out after 120s: The CLI polls the server for events every 500 milliseconds. This is the mechanism by which the CLI receives updates and status information from the server about the execution of the agent session. The CLI has a timeout of 120 seconds, meaning if it doesn't receive any events within this timeframe, it will terminate the process and report a timeout error. This timeout is intended to prevent the CLI from hanging indefinitely if something goes wrong on the server side. However, in this case, the absence of a timeout on the server side means that the CLI timeout is reached while the server is still waiting for the sandbox execution to complete, leading to the flaky test failures.

Why Server Doesn't Send `vm0_error`: Identifying the Code Gap

The absence of a vm0_error event from the server is a critical piece of the puzzle. This event should be sent whenever an error occurs during the sandbox execution, allowing the CLI to properly handle the failure. However, the analysis reveals that the conditions under which this event is sent are limited, leading to the observed behavior where the CLI times out without receiving an error.

In the code snippet from e2b-service.ts:484-487:

const result = await sandbox.commands.run(scriptPath, {
  envs,
  timeoutMs: 0, // ⚠️ No timeout - waits indefinitely
});

The vm0_error event is only sent under specific circumstances:

sandbox.commands.run() returns with exitCode !== 0: If the sandbox execution completes and returns a non-zero exit code, indicating an error, the vm0_error event will be sent. This is the expected behavior for handling errors that occur during the execution of the agent code.
Exception is caught in the catch block: If an exception is thrown during the execution of sandbox.commands.run(), the catch block will handle the exception and send the vm0_error event. This mechanism is designed to catch unexpected errors or exceptions that might occur during the execution.

The critical issue is that if the sandbox execution hangs without failing, neither of these conditions is met. This means that if the agent code gets stuck in an infinite loop, encounters a deadlock, or experiences any other issue that prevents it from terminating, the sandbox.commands.run() function will never return, and no exception will be thrown. As a result:

sandbox.commands.run() never returns (due to timeoutMs: 0): The absence of a timeout means the server will wait indefinitely for the sandbox execution to complete.
No exception is thrown: If the process hangs without explicitly throwing an exception, the catch block will not be triggered.
vm0_error is never sent: Because neither of the conditions for sending the vm0_error event is met, the CLI will never receive the error notification.
CLI times out waiting: The CLI, which is polling for events, will eventually time out after 120 seconds, leading to the flaky test failure. This scenario highlights the importance of having a timeout mechanism on the server side to prevent such indefinite hangs and ensure that errors are properly reported to the CLI.

Flaky Causes: Identifying the Triggers

Several factors can contribute to the flaky behavior of the test. Understanding these factors is crucial for implementing a robust solution that addresses the underlying issues and prevents future occurrences. These flaky causes underscore the complexity of testing in a distributed environment and the need for comprehensive error handling and resource management strategies.

E2B Sandbox startup delay: CI resource contention can slow sandbox creation/startup. The E2B sandbox environment provides isolated execution environments for running the agent code. However, the creation and startup of these sandboxes can be affected by resource contention in the CI environment. If the CI system is under heavy load, the sandbox creation process might be delayed, leading to longer startup times. This delay can impact the overall test execution time and increase the likelihood of timeouts. The variability in sandbox startup times due to resource contention contributes to the flakiness of the test, as the timing of the test execution becomes unpredictable.
Network issues: send_event() uses curl to POST to Vercel preview deployment; failures only log errors without interrupting execution. The send_event() function is responsible for sending events from the server to the CLI, providing updates and status information about the sandbox execution. This function uses curl to make POST requests to the Vercel preview deployment. Network issues, such as connectivity problems or temporary outages, can disrupt the event sending process. Critically, failures in the send_event() function only log errors without interrupting the execution. This means that if an event fails to be sent due to a network issue, the server will continue to execute the sandbox process without notifying the CLI. This can lead to a situation where the CLI is waiting for an event that will never arrive, eventually timing out. The lack of interruption in the execution flow when network issues occur contributes to the flaky behavior of the test, as transient network problems can lead to test failures without a clear indication of the underlying cause.
Timing: Previous test (28) may leave resources not fully released. The timing of test execution and the state of the system after previous tests can also contribute to flakiness. In this case, it's possible that a previous test (test 28) might leave resources not fully released, such as lingering processes or network connections. These unreleased resources can interfere with the execution of subsequent tests, leading to unexpected behavior or delays. For example, if a previous test fails to properly clean up its resources, it might consume excessive memory or CPU, slowing down the sandbox creation process or causing timeouts in other tests. The potential for resource contention and interference between tests highlights the importance of ensuring proper test isolation and resource management in the testing environment. This includes implementing mechanisms for cleaning up resources after each test and ensuring that tests are designed to be independent and not rely on the state of previous tests.

Test Details: Identifying the Culprit Test

To effectively address the flaky test, it's crucial to identify the specific test file and test name that are exhibiting the issue. This allows for targeted investigation and debugging, focusing efforts on the problematic areas of the test suite. Understanding the test's purpose and functionality also provides valuable context for understanding the root cause of the flakiness and developing an appropriate solution.

Test file: e2e/tests/02-commands/t06-vm0-agent-session.bats Test name: VM0 agent session: session persists across runs with same config and artifact Test purpose: Verify that multiple runs with the same config + artifact reuse the same session (findOrCreate)

The test file e2e/tests/02-commands/t06-vm0-agent-session.bats indicates that the test is part of the end-to-end test suite, specifically within the commands directory. This suggests that the test is designed to verify the behavior of command-line operations or interactions within the system. The test name, VM0 agent session: session persists across runs with same config and artifact, provides a more specific description of the test's purpose. It indicates that the test is focused on verifying the session management capabilities of the VM0 agent. The core functionality being tested is whether the agent session persists across multiple runs when the same configuration and artifacts are used. This behavior is crucial for ensuring efficiency and consistency in the agent's operation, as it avoids the overhead of creating a new session for each run. The test's purpose is to verify that the findOrCreate mechanism, which is responsible for either finding an existing session or creating a new one, is functioning correctly. This involves ensuring that the session is properly identified and reused when the same configuration and artifacts are provided. The flaky behavior of this test suggests that there might be issues with the session management logic, the findOrCreate mechanism, or the conditions under which sessions are persisted or terminated. Understanding the specific functionality being tested is essential for pinpointing the root cause of the flakiness and developing a targeted solution.

Related Files: Mapping the Code Landscape

Identifying the related files is essential for a comprehensive understanding of the issue and its potential solutions. These files provide insights into the different components involved in the execution flow, the timeout logic, the event handling mechanisms, and the sandbox environment. By examining these files, developers can trace the execution path, identify potential bottlenecks, and understand how different parts of the system interact. This comprehensive view is crucial for developing effective fixes and preventing future occurrences of the flaky test.

turbo/apps/cli/src/commands/run.ts - CLI timeout logic (120s): This file likely contains the code that implements the CLI's timeout logic, specifically the 120-second timeout mentioned in the observed behavior. Examining this file will reveal how the CLI monitors the execution time and how it handles timeouts. It will also provide insights into the conditions under which the CLI terminates the process and reports a timeout error. Understanding the CLI's timeout mechanism is crucial for determining whether the timeout is appropriately configured and whether it aligns with the expected execution time of the agent session.
turbo/apps/web/src/lib/e2b/e2b-service.ts - Sandbox execution with timeoutMs: 0: This file is a key piece of the puzzle, as it contains the code responsible for executing the agent code within the sandbox environment. The critical detail here is the timeoutMs: 0 setting, which indicates that there is no timeout for the sandbox execution. This means that the server will wait indefinitely for the sandbox process to complete, potentially leading to hangs if the process encounters an issue. Examining this file will provide a deeper understanding of how the sandbox execution is initiated, how errors are handled, and how events are generated. The absence of a timeout in this file is a primary suspect in the flaky test issue, as it allows the server to hang indefinitely, leading to the CLI timeout.
turbo/apps/web/src/lib/events/vm0-events.ts - vm0_error event sending: This file likely defines the structure and logic for sending the vm0_error event. Understanding how this event is generated and sent is crucial for determining why it is not being sent in certain scenarios. Examining this file will reveal the conditions under which the vm0_error event is triggered and the mechanisms used to transmit the event to the CLI. The absence of the vm0_error event is a key indicator of the issue, as it prevents the CLI from properly handling errors that occur during the sandbox execution.
turbo/apps/web/src/lib/e2b/scripts/send-event.ts - Curl-based event sending in sandbox: This file contains the code responsible for sending events from the sandbox environment to the server. The fact that it uses curl to send the events suggests that it relies on network communication to transmit the events. Examining this file will provide insights into how the events are formatted, how the network connection is established, and how errors are handled during the event sending process. The reliance on curl and network communication introduces the potential for network-related issues to disrupt the event sending process, contributing to the flaky behavior of the test.

By analyzing these related files, developers can gain a comprehensive understanding of the system's architecture, execution flow, and error handling mechanisms. This understanding is essential for developing effective solutions to the flaky test issue and preventing future occurrences.

Conclusion: Resolving Flaky Tests for Reliable Software

Addressing flaky tests is crucial for maintaining the reliability of your test suite and ensuring the quality of your software. The case study presented here, focusing on a missing server-side timeout in sandbox execution, highlights the importance of comprehensive error handling and timeout mechanisms in distributed systems. By understanding the architecture, tracing the execution flow, and identifying the root causes of flakiness, developers can implement targeted solutions that prevent future occurrences. In this specific scenario, the key takeaway is the need for a timeout on the server-side sandbox execution to prevent indefinite hangs and ensure that errors are properly reported to the CLI. Implementing such a timeout, along with robust error handling and network communication mechanisms, will significantly improve the reliability of the test suite and contribute to a more robust and stable software product. For more information on handling timeouts and improving system reliability, visit this external resource on robust system design.