Phase 5: Hardening Database, NULLs, And Rollback Plan

Nov 28, 2025 by Alex Johnson 54 views

Phase 5: Production Hardening - Database Indexing, NULL Handling, and Rollback Documentation

In this comprehensive article, we delve into Phase 5 of our production hardening process, focusing on critical tasks to elevate our system to an A+ production-ready status. Specifically, we address database indexing, NULL value handling, and rollback documentation. This article outlines the steps taken to optimize our telemetry implementation, ensuring peak performance and data accuracy. Let's explore the crucial enhancements implemented to achieve a robust and reliable system.

Context

The telemetry implementation from #2438, #2439, and #2443 is complete and working excellently (Grade: A-, 92/100). This issue tracks the 3 production-hardening tasks needed to reach A+ (production-ready) status.

Parent Issues: #2438, #2439, #2443

Status: Core telemetry working, needs production optimization

Estimated Time: 2 hours total

Production-Hardening Tasks

1. Add Database GIN Index for `cancel_reason` Queries

Database indexing is paramount for optimizing query performance, especially as data scales. This section focuses on implementing a GIN index to enhance the speed and efficiency of queries on the cancel_reason field within our job_execution_summaries table. We begin by addressing the performance bottleneck associated with unindexed JSONB queries on job_execution_summaries.results->>'cancel_reason'. While performance is currently acceptable with around 3,000 records, it's projected to degrade significantly as the dataset grows beyond 100,000 records. This lag would severely impact the dashboard's responsiveness and overall user experience. To resolve this, we implement a GIN index, which is particularly effective for indexing JSONB data. GIN indexes are designed to efficiently search within JSONB documents, making them ideal for our cancel_reason queries.

The implementation process involves several steps. First, we generate a new migration file using mix ecto.gen.migration add_cancel_reason_index. This command creates a migration file with the necessary structure to define our index creation and removal operations. Inside the migration file, we define the up function to execute the SQL command for creating the index. The SQL command CREATE INDEX idx_job_summaries_cancel_reason ON job_execution_summaries USING GIN ((results -> 'cancel_reason')) creates a GIN index named idx_job_summaries_cancel_reason on the job_execution_summaries table. This index specifically targets the cancel_reason field within the results JSONB column. Similarly, the down function is defined to execute the SQL command for dropping the index, ensuring that we can revert the migration if needed. The SQL command DROP INDEX IF EXISTS idx_job_summaries_cancel_reason removes the index if it exists. After defining the migration, we run it using mix ecto.migrate. This command executes the up function, creating the GIN index in our database.

To ensure the index has been created successfully, we use \d job_execution_summaries in psql to verify its existence. This command provides a detailed description of the job_execution_summaries table, including any indexes. We look for the newly created idx_job_summaries_cancel_reason index in the output. Once we've confirmed the index exists, we need to assess its performance impact. We use the EXPLAIN ANALYZE command to analyze the query performance before and after the index creation. The command EXPLAIN ANALYZE SELECT * FROM job_execution_summaries WHERE state = 'cancelled' AND results->>'cancel_reason' = 'movie not matched' provides a detailed breakdown of the query execution plan and timing. This allows us to compare the query execution time with and without the index, quantifying the performance improvement.

The acceptance criteria for this task are stringent. The migration must be created and run successfully, the index must be verified to exist in the database, and the query performance must be demonstrably improved. We meticulously document the before-and-after query times to provide a clear record of the performance enhancement. By adding this GIN index, we ensure that queries on cancel_reason remain performant even as the dataset grows, maintaining the responsiveness and usability of our dashboard.

# Create migration
mix ecto.gen.migration add_cancel_reason_index

Add to migration file:

defmodule EventasaurusApp.Repo.Migrations.AddCancelReasonIndex do
  use Ecto.Migration

  def up do
    execute """
    CREATE INDEX idx_job_summaries_cancel_reason 
    ON job_execution_summaries 
    USING GIN ((results -> 'cancel_reason'))
    """
  end

  def down do
    execute "DROP INDEX IF EXISTS idx_job_summaries_cancel_reason"
  end
end

Then run:

mix ecto.migrate

Acceptance Criteria

[ ] Migration created and run successfully
[ ] Verify index exists: \d job_execution_summaries in psql

[ ] Test query performance with EXPLAIN ANALYZE:

EXPLAIN ANALYZE 
SELECT * FROM job_execution_summaries 
WHERE state = 'cancelled' 
AND results->>'cancel_reason' = 'movie not matched';

[ ] Document before/after query times

2. Update Metrics Queries for NULL `cancel_reason` Handling

Accurate NULL handling in database queries is critical for maintaining data integrity and reliability. This section outlines the process of updating our metrics queries to explicitly handle NULL values in the cancel_reason field. Currently, a significant number of historical cancelled jobs, specifically 479 out of the total, have NULL values in the cancel_reason field. This represents 19.8% of all cancelled jobs. The current metrics queries, however, do not explicitly account for these NULL values. This omission can lead to inaccuracies in our metrics, as these cancellations are effectively excluded from the analysis. To rectify this, we need to modify our queries to include explicit handling of NULL values, ensuring that all cancellations are correctly categorized and accounted for.

The initial step in this process is to identify the functions and queries that require modification. We focus on the lib/eventasaurus_discovery/job_execution_summaries.ex file, which contains the get_worker_metrics/1 and get_system_metrics/1 functions, as well as any other functions querying cancel_reason. We analyze these functions to identify the queries that filter based on cancel_reason and need to be updated. The incorrect queries, such as the original failed_cancellations query, use a condition that excludes NULL values implicitly. For instance, the original query filter(s.state == "cancelled" and fragment("?->>'cancel_reason' = ?", s.results, "movie not matched")) only considers cancellations where cancel_reason is not NULL and equals “movie not matched.” To correct this, we modify the query to explicitly handle NULL values using the IS NOT NULL operator in conjunction with the original condition. This ensures that we capture all relevant cancellations, including those with a non-NULL cancel_reason matching our criteria.

To account for the NULL values, we introduce a new query that specifically targets cancellations with a NULL cancel_reason. This allows us to track and categorize these cancellations separately, providing a more comprehensive view of our system's performance. The updated failed_cancellations query, for example, now includes the condition fragment("?->>'cancel_reason' IS NOT NULL AND ?->>'cancel_reason' = ?", s.results, s.results, "movie not matched"). This ensures that only non-NULL cancel_reason values are compared against our criteria. We also introduce a new unknown_cancellations query using the condition fragment("?->>'cancel_reason' IS NULL", s.results). This query specifically targets cancellations where cancel_reason is NULL, allowing us to track these separately.

After updating the queries, we need to verify the changes. We run tests to ensure that the updated queries return the correct counts and that the dashboard displays an “Unknown” category for NULL cancel_reason values. We use JobExecutionSummaries.get_system_metrics(24) in IEx to verify that the system metrics now include a breakdown of cancellations by reason, including the unknown_cancellations. This allows us to confirm that the NULL values are being correctly accounted for in our metrics. We also verify the results against the database using a SQL query. The query SELECT CASE WHEN results->>'cancel_reason' IS NULL THEN 'unknown' ELSE results->>'cancel_reason' END as reason, COUNT(*) FROM job_execution_summaries WHERE state = 'cancelled' GROUP BY reason groups the cancellations by reason, treating NULL values as “unknown.” This allows us to directly compare the results from our updated queries with the database, ensuring consistency and accuracy.

The acceptance criteria for this task are rigorous. All metrics queries must be updated with explicit NULL handling, the dashboard must display an “Unknown” category for NULL cancel_reason values, and the test queries must return the correct counts. By explicitly handling NULL values in our metrics queries, we ensure the accuracy and reliability of our data, providing a more complete and accurate view of our system's performance.

# Before (INCORRECT - excludes NULL)
failed_cancellations = filter(
  s.state == "cancelled" and
  fragment("?->>'cancel_reason' = ?", s.results, "movie not matched")
)

# After (CORRECT - explicit NULL handling)
failed_cancellations = filter(
  s.state == "cancelled" and
  fragment("?->>'cancel_reason' IS NOT NULL AND ?->>'cancel_reason' = ?", 
           s.results, s.results, "movie not matched")
)

# Add tracking for unknown cancellations
unknown_cancellations = filter(
  s.state == "cancelled" and
  fragment("?->>'cancel_reason' IS NULL", s.results)
)

Files to Update

lib/eventasaurus_discovery/job_execution_summaries.ex
- get_worker_metrics/1
- get_system_metrics/1
- Any other functions querying cancel_reason

Acceptance Criteria

[ ] All metrics queries updated with explicit NULL handling
[ ] Dashboard shows "Unknown" category for NULL cancel_reason

[ ] Test queries return correct counts:

# In IEx
JobExecutionSummaries.get_system_metrics(24)
# Should show breakdown including unknown_cancellations

[ ] Verify against database:

SELECT 
  CASE 
    WHEN results->>'cancel_reason' IS NULL THEN 'unknown'
    ELSE results->>'cancel_reason'
  END as reason,
  COUNT(*)
FROM job_execution_summaries
WHERE state = 'cancelled'
GROUP BY reason;

3. Document Rollback Plan

A well-documented rollback plan is a critical component of production safety, providing a clear and actionable strategy for reverting changes in case of issues. This section focuses on creating a comprehensive rollback plan for Phase 5, ensuring that we can quickly and effectively address any unexpected problems that may arise post-deployment. Without a documented rollback strategy, production issues can escalate rapidly, leading to prolonged downtime and potential data loss. A clear plan empowers the team to respond swiftly and decisively, minimizing the impact of any adverse events.

The first step in creating the rollback plan is to identify potential failure points. We consider the specific changes introduced in Phase 5, including the database GIN index, the NULL handling updates, and any other modifications. We then outline the steps needed to revert each change, providing specific commands and procedures. This proactive approach ensures that we are prepared for a range of scenarios, from performance degradation to data inaccuracies and system errors. The rollback plan is documented in a new file, docs/ROLLBACK_PLAN.md, providing a centralized and easily accessible resource for the team. This markdown file serves as a comprehensive guide, detailing the emergency rollback procedures and preventative measures to ensure a smooth deployment and post-deployment phase. The document is structured to address various scenarios, including dashboard performance degradation, incorrect metrics, system errors, and database migration failures.

For dashboard performance degradation, the rollback plan includes steps to diagnose the issue, such as checking for the existence of the GIN index. If the index is missing, the plan provides commands to recreate it immediately. This ensures that performance issues related to indexing can be quickly resolved. In cases where metrics are incorrect, the plan outlines steps to verify NULL handling in the queries and to check the ObanTelemetry configuration. This ensures that data inaccuracies are thoroughly investigated and addressed. For system errors, the plan includes a detailed procedure for identifying the problematic commit, reverting the code changes, rolling back the database migration (if applicable), restarting the application, and verifying system stability. This comprehensive approach ensures that system-level issues can be efficiently resolved. If the database migration fails, the plan provides guidance on checking the migration status, reviewing PostgreSQL logs, and addressing specific issues such as GIN index creation failures or permissions problems. This ensures that database-related issues are handled effectively.

The rollback plan also includes preventive measures to minimize the risk of issues. Before deploying Phase 5 changes, the plan mandates testing in a staging environment, running migrations during low-traffic periods, monitoring application logs for an hour post-deployment, and keeping the previous deployment ready for a quick rollback. These measures help to identify and address potential issues before they impact the production environment. Continuous monitoring of dashboard response time, Oban job success rates, database query performance, and application error rates is also emphasized, enabling proactive identification and resolution of any emerging problems.

The acceptance criteria for this task are comprehensive. The docs/ROLLBACK_PLAN.md file must be created, reviewed by the team, and the rollback procedures tested in the development environment. The emergency contact and escalation path must also be documented, ensuring clear lines of communication in case of issues. By documenting a robust rollback plan, we minimize the risk associated with deploying new changes, ensuring that we can quickly and effectively address any issues that may arise.

# Phase 5 Telemetry Rollback Plan

## Emergency Rollback Procedures

### If Dashboard Performance Degrades

**Symptoms:** Job execution monitor slow to load, timeout errors

**Diagnosis:**
```bash
# Check if index exists
psql -d eventasaurus_dev -c "\d job_execution_summaries"
# Look for: idx_job_summaries_cancel_reason

Fix:

If index is missing, create immediately:

mix ecto.migrate
# Or manually in psql:
CREATE INDEX idx_job_summaries_cancel_reason 
ON job_execution_summaries 
USING GIN ((results -> 'cancel_reason'));

If Metrics Are Incorrect

Symptoms: Dashboard shows unexpected counts, missing data

Diagnosis:

-- Check for NULL cancel_reasons
SELECT COUNT(*) 
FROM job_execution_summaries 
WHERE state = 'cancelled' 
AND results->>'cancel_reason' IS NULL;

Fix:

Verify NULL handling in queries (see Task #2)

If queries are correct but data is wrong, check ObanTelemetry:

grep -n "cancel_reason" lib/eventasaurus_app/monitoring/oban_telemetry.ex
# Should see lines 211-259 handling cancel_reason extraction

If System Errors Occur

Symptoms: Application crashes, Oban job failures, database errors

Rollback Steps:

Identify problem commit:

git log --oneline --graph --decorate -10

Revert code changes:
```
git revert <commit-hash>
git push
```
Rollback database migration (if index was added):
```
mix ecto.rollback --step 1
```

Restart application:

# Development
mix phx.server

# Production
# Follow your deployment process

Verify system stability:

# Check Oban dashboard
open http://localhost:4000/admin/oban

# Check job execution monitor
open http://localhost:4000/admin/job-executions

If Database Migration Fails

Symptoms: Migration error during mix ecto.migrate

Diagnosis:

# Check migration status
mix ecto.migrations

# Check PostgreSQL logs
tail -f /path/to/postgresql/logs

Fix:

If GIN index creation fails:

-- Check if index already exists
SELECT indexname FROM pg_indexes 
WHERE tablename = 'job_execution_summaries';

-- If it exists but migration failed, mark as complete manually

If permissions issue:

-- Ensure user has CREATE INDEX privilege
GRANT CREATE ON DATABASE eventasaurus_dev TO postgres;

Prevention

Before deploying Phase 5 changes:

✅ Test in staging environment first
✅ Run migrations during low-traffic period
✅ Monitor application logs for 1 hour post-deployment
✅ Keep previous deployment ready for quick rollback

Monitoring:

Dashboard response time
Oban job success rates
Database query performance
Application error rates


#### Acceptance Criteria

- [ ] `docs/ROLLBACK_PLAN.md` created
- [ ] Document reviewed by team
- [ ] Rollback procedures tested in development
- [ ] Emergency contact/escalation path documented

---

### 4. Phase 5.3 TMDB Threshold Validation (Optional)

This optional task explores the potential benefits of lowering the TMDB confidence threshold for movie matching. Validating the TMDB threshold involves carefully analyzing the trade-offs between match rate and false positive rate. Lowering the threshold could increase the number of movies matched, but it also carries the risk of incorrectly matching movies, leading to inaccurate data. Currently, the TMDB confidence threshold is set at 50%, resulting in a 70% match rate. However, there are 1,935 “movie not matched” cancellations, indicating that there is room for improvement in the match rate. The question is whether we can lower the threshold without significantly increasing the false positive rate. 

To answer this question, we propose creating a test script to analyze TMDB match quality at different confidence thresholds. This script, `lib/mix/tasks/analyze_tmdb_threshold.ex`, would systematically evaluate matches within a specific confidence range, such as 45-50%. The process involves several steps. First, the script would identify recent movie matches within the specified confidence range. Then, a sample of these matches, typically 50-100, would be manually verified to determine if they are correct. This manual verification is crucial for assessing the accuracy of the matches and calculating the false positive rate. Based on the manual verification, the script would calculate the false positive rate, providing a quantitative measure of the match quality at the lower threshold. Finally, the findings would be documented, and a recommendation would be made on whether to proceed with lowering the threshold or to maintain the current setting.

The decision to lower the TMDB threshold is based on a predefined set of criteria. If the false positive rate is less than 5%, we would proceed with lowering the threshold, indicating that the increased match rate outweighs the risk of inaccuracies. If the false positive rate is greater than 10%, we would keep the current 50% threshold, as the risk of false positives is deemed too high. If the false positive rate falls within the 5-10% range, the decision would be user-dependent, based on their tolerance for false positives and the desired match rate. This structured decision-making process ensures that the threshold is set based on a balance of accuracy and coverage.

It’s important to note that this task is optional. The current 70% match rate may be acceptable for certain use cases, and the additional effort required to validate a lower threshold may not be justified. However, for use cases where a higher match rate is crucial, this validation process provides valuable insights and data to inform the decision-making process. By systematically analyzing the trade-offs, we can ensure that the TMDB threshold is set optimally for our specific needs.

```elixir
# Create test script: lib/mix/tasks/analyze_tmdb_threshold.ex
defmodule Mix.Tasks.AnalyzeTmdbThreshold do
  use Mix.Task
  
  @shortdoc "Analyze TMDB match quality at different confidence thresholds"
  
  def run(_) do
    # 1. Find recent movie matches with confidence 45-50%
    # 2. Manually verify if matches are correct (sample of 50-100)
    # 3. Calculate false positive rate
    # 4. Document findings and recommend proceed/skip
  end
end

Decision Criteria

If false positive rate <5%: Proceed with lowering threshold
If false positive rate >10%: Keep current 50% threshold
If 5-10%: User decision based on tolerance

This is OPTIONAL - Current 70% match rate may be acceptable for your use case.

5. Timeline Buffer Guidance

Effective project management includes accurate timeline estimation and buffer allocation. This section addresses the discrepancy between the estimated and actual time spent on Phase 5 and provides guidance for future planning. The original estimate for Phase 5 was 20 hours, but the actual implementation likely took 25-30 hours. This difference highlights the need for a more robust estimation process and the inclusion of buffer time to account for unforeseen challenges and complexities. Accurate timeline estimations are essential for setting realistic expectations, managing resources effectively, and ensuring project milestones are met on time.

To improve future project planning, we recommend adding a 25% buffer for unknowns. This means that if a task is initially estimated to take 20 hours, an additional 5 hours (25% of 20) should be added to the estimate, resulting in a new estimate of 25 hours. This buffer accounts for the inherent uncertainties and unexpected issues that can arise during implementation. Additionally, we recommend allocating 2 hours specifically for code review. Code review is a critical step in the development process, ensuring code quality, identifying potential bugs, and promoting knowledge sharing within the team. Allocating dedicated time for code review ensures that this crucial step is not overlooked or rushed. We also recommend allocating 2 hours for documentation. Thorough documentation is essential for maintaining the system, onboarding new team members, and ensuring that the project's knowledge is preserved. Allocating specific time for documentation ensures that this important aspect of the project is not neglected.

By incorporating these recommendations, we can arrive at a more realistic total estimate for complex features. In the case of a task initially estimated at 20 hours, the revised estimate would be approximately 29 hours. This includes the 25% buffer for unknowns, 2 hours for code review, and 2 hours for documentation. This comprehensive approach ensures that our timelines are accurate, achievable, and allow for the necessary steps to ensure a high-quality outcome. This enhanced estimation process leads to more predictable project timelines, improved resource management, and a greater likelihood of meeting project milestones successfully. By learning from past experiences and incorporating these buffers, we can improve the accuracy and reliability of our project timelines, leading to more successful project outcomes.

Success Criteria

This issue is complete when:

✅ GIN index created and performance validated
✅ NULL handling implemented and tested
✅ Rollback plan documented and reviewed
✅ All changes tested in development environment
✅ System metrics show correct counts
✅ Dashboard performance is fast (<500ms load time)

Final Grade Target: A+ (97/100) - Production Ready ✨

Implementation Order

First: Database GIN index (30 min) - Prerequisite for performance
Second: NULL handling (1 hour) - Ensures data accuracy
Third: Rollback docs (30 min) - Safety net for production

Total Time: ~2 hours for all must-do tasks

References

Parent Issues: #2438 (ShowtimeProcessJob Metrics), #2439 (Refinements), #2443 (Rollout Assessment)
Audit Report: All tasks verified as needed during comprehensive audit
Current System Status: Working excellently (A-, 92/100), needs production hardening
Database Evidence: 2,991 job summaries, 479 NULL cancel_reasons, 1,935 "movie not matched"

In conclusion, Phase 5 Production Hardening represents the final crucial steps to achieving an A+ grade for our system. By focusing on database indexing, NULL handling, and rollback documentation, we are ensuring optimal performance, data accuracy, and production safety. The implementation of a GIN index significantly enhances query performance, particularly as our dataset grows. Explicit NULL handling in metrics queries guarantees the reliability of our data analysis, and a comprehensive rollback plan provides a safety net for addressing any unexpected issues. These efforts collectively contribute to a robust and dependable system. To learn more about database indexing best practices, visit PostgreSQL Indexing.

This issue represents the final 8% needed to reach production-ready status (A+ grade).

Context

Production-Hardening Tasks

1. Add Database GIN Index for cancel_reason Queries

Acceptance Criteria

2. Update Metrics Queries for NULL cancel_reason Handling

Files to Update

Acceptance Criteria

3. Document Rollback Plan

If Metrics Are Incorrect

If System Errors Occur

If Database Migration Fails

Prevention

Decision Criteria

5. Timeline Buffer Guidance

Success Criteria

Implementation Order

References

1. Add Database GIN Index for `cancel_reason` Queries

2. Update Metrics Queries for NULL `cancel_reason` Handling