Phase 5: Hardening Database, NULLs, And Rollback Plan
In this comprehensive article, we delve into Phase 5 of our production hardening process, focusing on critical tasks to elevate our system to an A+ production-ready status. Specifically, we address database indexing, NULL value handling, and rollback documentation. This article outlines the steps taken to optimize our telemetry implementation, ensuring peak performance and data accuracy. Let's explore the crucial enhancements implemented to achieve a robust and reliable system.
Context
The telemetry implementation from #2438, #2439, and #2443 is complete and working excellently (Grade: A-, 92/100). This issue tracks the 3 production-hardening tasks needed to reach A+ (production-ready) status.
Parent Issues: #2438, #2439, #2443
Status: Core telemetry working, needs production optimization
Estimated Time: 2 hours total
Production-Hardening Tasks
1. Add Database GIN Index for cancel_reason Queries
Database indexing is paramount for optimizing query performance, especially as data scales. This section focuses on implementing a GIN index to enhance the speed and efficiency of queries on the cancel_reason field within our job_execution_summaries table. We begin by addressing the performance bottleneck associated with unindexed JSONB queries on job_execution_summaries.results->>'cancel_reason'. While performance is currently acceptable with around 3,000 records, it's projected to degrade significantly as the dataset grows beyond 100,000 records. This lag would severely impact the dashboard's responsiveness and overall user experience. To resolve this, we implement a GIN index, which is particularly effective for indexing JSONB data. GIN indexes are designed to efficiently search within JSONB documents, making them ideal for our cancel_reason queries.
The implementation process involves several steps. First, we generate a new migration file using mix ecto.gen.migration add_cancel_reason_index. This command creates a migration file with the necessary structure to define our index creation and removal operations. Inside the migration file, we define the up function to execute the SQL command for creating the index. The SQL command CREATE INDEX idx_job_summaries_cancel_reason ON job_execution_summaries USING GIN ((results -> 'cancel_reason')) creates a GIN index named idx_job_summaries_cancel_reason on the job_execution_summaries table. This index specifically targets the cancel_reason field within the results JSONB column. Similarly, the down function is defined to execute the SQL command for dropping the index, ensuring that we can revert the migration if needed. The SQL command DROP INDEX IF EXISTS idx_job_summaries_cancel_reason removes the index if it exists. After defining the migration, we run it using mix ecto.migrate. This command executes the up function, creating the GIN index in our database.
To ensure the index has been created successfully, we use \d job_execution_summaries in psql to verify its existence. This command provides a detailed description of the job_execution_summaries table, including any indexes. We look for the newly created idx_job_summaries_cancel_reason index in the output. Once we've confirmed the index exists, we need to assess its performance impact. We use the EXPLAIN ANALYZE command to analyze the query performance before and after the index creation. The command EXPLAIN ANALYZE SELECT * FROM job_execution_summaries WHERE state = 'cancelled' AND results->>'cancel_reason' = 'movie not matched' provides a detailed breakdown of the query execution plan and timing. This allows us to compare the query execution time with and without the index, quantifying the performance improvement.
The acceptance criteria for this task are stringent. The migration must be created and run successfully, the index must be verified to exist in the database, and the query performance must be demonstrably improved. We meticulously document the before-and-after query times to provide a clear record of the performance enhancement. By adding this GIN index, we ensure that queries on cancel_reason remain performant even as the dataset grows, maintaining the responsiveness and usability of our dashboard.
# Create migration
mix ecto.gen.migration add_cancel_reason_index
Add to migration file:
defmodule EventasaurusApp.Repo.Migrations.AddCancelReasonIndex do
use Ecto.Migration
def up do
execute """
CREATE INDEX idx_job_summaries_cancel_reason
ON job_execution_summaries
USING GIN ((results -> 'cancel_reason'))
"""
end
def down do
execute "DROP INDEX IF EXISTS idx_job_summaries_cancel_reason"
end
end
Then run:
mix ecto.migrate
Acceptance Criteria
- [ ] Migration created and run successfully
- [ ] Verify index exists:
\d job_execution_summariesin psql - [ ] Test query performance with
EXPLAIN ANALYZE:EXPLAIN ANALYZE SELECT * FROM job_execution_summaries WHERE state = 'cancelled' AND results->>'cancel_reason' = 'movie not matched'; - [ ] Document before/after query times
2. Update Metrics Queries for NULL cancel_reason Handling
Accurate NULL handling in database queries is critical for maintaining data integrity and reliability. This section outlines the process of updating our metrics queries to explicitly handle NULL values in the cancel_reason field. Currently, a significant number of historical cancelled jobs, specifically 479 out of the total, have NULL values in the cancel_reason field. This represents 19.8% of all cancelled jobs. The current metrics queries, however, do not explicitly account for these NULL values. This omission can lead to inaccuracies in our metrics, as these cancellations are effectively excluded from the analysis. To rectify this, we need to modify our queries to include explicit handling of NULL values, ensuring that all cancellations are correctly categorized and accounted for.
The initial step in this process is to identify the functions and queries that require modification. We focus on the lib/eventasaurus_discovery/job_execution_summaries.ex file, which contains the get_worker_metrics/1 and get_system_metrics/1 functions, as well as any other functions querying cancel_reason. We analyze these functions to identify the queries that filter based on cancel_reason and need to be updated. The incorrect queries, such as the original failed_cancellations query, use a condition that excludes NULL values implicitly. For instance, the original query filter(s.state == "cancelled" and fragment("?->>'cancel_reason' = ?", s.results, "movie not matched")) only considers cancellations where cancel_reason is not NULL and equals “movie not matched.” To correct this, we modify the query to explicitly handle NULL values using the IS NOT NULL operator in conjunction with the original condition. This ensures that we capture all relevant cancellations, including those with a non-NULL cancel_reason matching our criteria.
To account for the NULL values, we introduce a new query that specifically targets cancellations with a NULL cancel_reason. This allows us to track and categorize these cancellations separately, providing a more comprehensive view of our system's performance. The updated failed_cancellations query, for example, now includes the condition fragment("?->>'cancel_reason' IS NOT NULL AND ?->>'cancel_reason' = ?", s.results, s.results, "movie not matched"). This ensures that only non-NULL cancel_reason values are compared against our criteria. We also introduce a new unknown_cancellations query using the condition fragment("?->>'cancel_reason' IS NULL", s.results). This query specifically targets cancellations where cancel_reason is NULL, allowing us to track these separately.
After updating the queries, we need to verify the changes. We run tests to ensure that the updated queries return the correct counts and that the dashboard displays an “Unknown” category for NULL cancel_reason values. We use JobExecutionSummaries.get_system_metrics(24) in IEx to verify that the system metrics now include a breakdown of cancellations by reason, including the unknown_cancellations. This allows us to confirm that the NULL values are being correctly accounted for in our metrics. We also verify the results against the database using a SQL query. The query SELECT CASE WHEN results->>'cancel_reason' IS NULL THEN 'unknown' ELSE results->>'cancel_reason' END as reason, COUNT(*) FROM job_execution_summaries WHERE state = 'cancelled' GROUP BY reason groups the cancellations by reason, treating NULL values as “unknown.” This allows us to directly compare the results from our updated queries with the database, ensuring consistency and accuracy.
The acceptance criteria for this task are rigorous. All metrics queries must be updated with explicit NULL handling, the dashboard must display an “Unknown” category for NULL cancel_reason values, and the test queries must return the correct counts. By explicitly handling NULL values in our metrics queries, we ensure the accuracy and reliability of our data, providing a more complete and accurate view of our system's performance.
# Before (INCORRECT - excludes NULL)
failed_cancellations = filter(
s.state == "cancelled" and
fragment("?->>'cancel_reason' = ?", s.results, "movie not matched")
)
# After (CORRECT - explicit NULL handling)
failed_cancellations = filter(
s.state == "cancelled" and
fragment("?->>'cancel_reason' IS NOT NULL AND ?->>'cancel_reason' = ?",
s.results, s.results, "movie not matched")
)
# Add tracking for unknown cancellations
unknown_cancellations = filter(
s.state == "cancelled" and
fragment("?->>'cancel_reason' IS NULL", s.results)
)
Files to Update
lib/eventasaurus_discovery/job_execution_summaries.exget_worker_metrics/1get_system_metrics/1- Any other functions querying cancel_reason
Acceptance Criteria
- [ ] All metrics queries updated with explicit NULL handling
- [ ] Dashboard shows "Unknown" category for NULL cancel_reason
- [ ] Test queries return correct counts:
# In IEx JobExecutionSummaries.get_system_metrics(24) # Should show breakdown including unknown_cancellations - [ ] Verify against database:
SELECT CASE WHEN results->>'cancel_reason' IS NULL THEN 'unknown' ELSE results->>'cancel_reason' END as reason, COUNT(*) FROM job_execution_summaries WHERE state = 'cancelled' GROUP BY reason;
3. Document Rollback Plan
A well-documented rollback plan is a critical component of production safety, providing a clear and actionable strategy for reverting changes in case of issues. This section focuses on creating a comprehensive rollback plan for Phase 5, ensuring that we can quickly and effectively address any unexpected problems that may arise post-deployment. Without a documented rollback strategy, production issues can escalate rapidly, leading to prolonged downtime and potential data loss. A clear plan empowers the team to respond swiftly and decisively, minimizing the impact of any adverse events.
The first step in creating the rollback plan is to identify potential failure points. We consider the specific changes introduced in Phase 5, including the database GIN index, the NULL handling updates, and any other modifications. We then outline the steps needed to revert each change, providing specific commands and procedures. This proactive approach ensures that we are prepared for a range of scenarios, from performance degradation to data inaccuracies and system errors. The rollback plan is documented in a new file, docs/ROLLBACK_PLAN.md, providing a centralized and easily accessible resource for the team. This markdown file serves as a comprehensive guide, detailing the emergency rollback procedures and preventative measures to ensure a smooth deployment and post-deployment phase. The document is structured to address various scenarios, including dashboard performance degradation, incorrect metrics, system errors, and database migration failures.
For dashboard performance degradation, the rollback plan includes steps to diagnose the issue, such as checking for the existence of the GIN index. If the index is missing, the plan provides commands to recreate it immediately. This ensures that performance issues related to indexing can be quickly resolved. In cases where metrics are incorrect, the plan outlines steps to verify NULL handling in the queries and to check the ObanTelemetry configuration. This ensures that data inaccuracies are thoroughly investigated and addressed. For system errors, the plan includes a detailed procedure for identifying the problematic commit, reverting the code changes, rolling back the database migration (if applicable), restarting the application, and verifying system stability. This comprehensive approach ensures that system-level issues can be efficiently resolved. If the database migration fails, the plan provides guidance on checking the migration status, reviewing PostgreSQL logs, and addressing specific issues such as GIN index creation failures or permissions problems. This ensures that database-related issues are handled effectively.
The rollback plan also includes preventive measures to minimize the risk of issues. Before deploying Phase 5 changes, the plan mandates testing in a staging environment, running migrations during low-traffic periods, monitoring application logs for an hour post-deployment, and keeping the previous deployment ready for a quick rollback. These measures help to identify and address potential issues before they impact the production environment. Continuous monitoring of dashboard response time, Oban job success rates, database query performance, and application error rates is also emphasized, enabling proactive identification and resolution of any emerging problems.
The acceptance criteria for this task are comprehensive. The docs/ROLLBACK_PLAN.md file must be created, reviewed by the team, and the rollback procedures tested in the development environment. The emergency contact and escalation path must also be documented, ensuring clear lines of communication in case of issues. By documenting a robust rollback plan, we minimize the risk associated with deploying new changes, ensuring that we can quickly and effectively address any issues that may arise.
# Phase 5 Telemetry Rollback Plan
## Emergency Rollback Procedures
### If Dashboard Performance Degrades
**Symptoms:** Job execution monitor slow to load, timeout errors
**Diagnosis:**
```bash
# Check if index exists
psql -d eventasaurus_dev -c "\d job_execution_summaries"
# Look for: idx_job_summaries_cancel_reason
Fix:
If index is missing, create immediately:
mix ecto.migrate
# Or manually in psql:
CREATE INDEX idx_job_summaries_cancel_reason
ON job_execution_summaries
USING GIN ((results -> 'cancel_reason'));
If Metrics Are Incorrect
Symptoms: Dashboard shows unexpected counts, missing data
Diagnosis:
-- Check for NULL cancel_reasons
SELECT COUNT(*)
FROM job_execution_summaries
WHERE state = 'cancelled'
AND results->>'cancel_reason' IS NULL;
Fix:
- Verify NULL handling in queries (see Task #2)
- If queries are correct but data is wrong, check ObanTelemetry:
grep -n "cancel_reason" lib/eventasaurus_app/monitoring/oban_telemetry.ex # Should see lines 211-259 handling cancel_reason extraction
If System Errors Occur
Symptoms: Application crashes, Oban job failures, database errors
Rollback Steps:
- Identify problem commit:
git log --oneline --graph --decorate -10 - Revert code changes:
git revert <commit-hash> git push - Rollback database migration (if index was added):
mix ecto.rollback --step 1 - Restart application:
# Development mix phx.server # Production # Follow your deployment process - Verify system stability:
# Check Oban dashboard open http://localhost:4000/admin/oban # Check job execution monitor open http://localhost:4000/admin/job-executions
If Database Migration Fails
Symptoms: Migration error during mix ecto.migrate
Diagnosis:
# Check migration status
mix ecto.migrations
# Check PostgreSQL logs
tail -f /path/to/postgresql/logs
Fix:
- If GIN index creation fails:
-- Check if index already exists SELECT indexname FROM pg_indexes WHERE tablename = 'job_execution_summaries'; -- If it exists but migration failed, mark as complete manually - If permissions issue:
-- Ensure user has CREATE INDEX privilege GRANT CREATE ON DATABASE eventasaurus_dev TO postgres;
Prevention
Before deploying Phase 5 changes:
- âś… Test in staging environment first
- âś… Run migrations during low-traffic period
- âś… Monitor application logs for 1 hour post-deployment
- âś… Keep previous deployment ready for quick rollback
Monitoring:
- Dashboard response time
- Oban job success rates
- Database query performance
- Application error rates
#### Acceptance Criteria
- [ ] `docs/ROLLBACK_PLAN.md` created
- [ ] Document reviewed by team
- [ ] Rollback procedures tested in development
- [ ] Emergency contact/escalation path documented
---
### 4. Phase 5.3 TMDB Threshold Validation (Optional)
This optional task explores the potential benefits of lowering the TMDB confidence threshold for movie matching. Validating the TMDB threshold involves carefully analyzing the trade-offs between match rate and false positive rate. Lowering the threshold could increase the number of movies matched, but it also carries the risk of incorrectly matching movies, leading to inaccurate data. Currently, the TMDB confidence threshold is set at 50%, resulting in a 70% match rate. However, there are 1,935 “movie not matched” cancellations, indicating that there is room for improvement in the match rate. The question is whether we can lower the threshold without significantly increasing the false positive rate.
To answer this question, we propose creating a test script to analyze TMDB match quality at different confidence thresholds. This script, `lib/mix/tasks/analyze_tmdb_threshold.ex`, would systematically evaluate matches within a specific confidence range, such as 45-50%. The process involves several steps. First, the script would identify recent movie matches within the specified confidence range. Then, a sample of these matches, typically 50-100, would be manually verified to determine if they are correct. This manual verification is crucial for assessing the accuracy of the matches and calculating the false positive rate. Based on the manual verification, the script would calculate the false positive rate, providing a quantitative measure of the match quality at the lower threshold. Finally, the findings would be documented, and a recommendation would be made on whether to proceed with lowering the threshold or to maintain the current setting.
The decision to lower the TMDB threshold is based on a predefined set of criteria. If the false positive rate is less than 5%, we would proceed with lowering the threshold, indicating that the increased match rate outweighs the risk of inaccuracies. If the false positive rate is greater than 10%, we would keep the current 50% threshold, as the risk of false positives is deemed too high. If the false positive rate falls within the 5-10% range, the decision would be user-dependent, based on their tolerance for false positives and the desired match rate. This structured decision-making process ensures that the threshold is set based on a balance of accuracy and coverage.
It’s important to note that this task is optional. The current 70% match rate may be acceptable for certain use cases, and the additional effort required to validate a lower threshold may not be justified. However, for use cases where a higher match rate is crucial, this validation process provides valuable insights and data to inform the decision-making process. By systematically analyzing the trade-offs, we can ensure that the TMDB threshold is set optimally for our specific needs.
```elixir
# Create test script: lib/mix/tasks/analyze_tmdb_threshold.ex
defmodule Mix.Tasks.AnalyzeTmdbThreshold do
use Mix.Task
@shortdoc "Analyze TMDB match quality at different confidence thresholds"
def run(_) do
# 1. Find recent movie matches with confidence 45-50%
# 2. Manually verify if matches are correct (sample of 50-100)
# 3. Calculate false positive rate
# 4. Document findings and recommend proceed/skip
end
end
Decision Criteria
- If false positive rate <5%: Proceed with lowering threshold
- If false positive rate >10%: Keep current 50% threshold
- If 5-10%: User decision based on tolerance
This is OPTIONAL - Current 70% match rate may be acceptable for your use case.
5. Timeline Buffer Guidance
Effective project management includes accurate timeline estimation and buffer allocation. This section addresses the discrepancy between the estimated and actual time spent on Phase 5 and provides guidance for future planning. The original estimate for Phase 5 was 20 hours, but the actual implementation likely took 25-30 hours. This difference highlights the need for a more robust estimation process and the inclusion of buffer time to account for unforeseen challenges and complexities. Accurate timeline estimations are essential for setting realistic expectations, managing resources effectively, and ensuring project milestones are met on time.
To improve future project planning, we recommend adding a 25% buffer for unknowns. This means that if a task is initially estimated to take 20 hours, an additional 5 hours (25% of 20) should be added to the estimate, resulting in a new estimate of 25 hours. This buffer accounts for the inherent uncertainties and unexpected issues that can arise during implementation. Additionally, we recommend allocating 2 hours specifically for code review. Code review is a critical step in the development process, ensuring code quality, identifying potential bugs, and promoting knowledge sharing within the team. Allocating dedicated time for code review ensures that this crucial step is not overlooked or rushed. We also recommend allocating 2 hours for documentation. Thorough documentation is essential for maintaining the system, onboarding new team members, and ensuring that the project's knowledge is preserved. Allocating specific time for documentation ensures that this important aspect of the project is not neglected.
By incorporating these recommendations, we can arrive at a more realistic total estimate for complex features. In the case of a task initially estimated at 20 hours, the revised estimate would be approximately 29 hours. This includes the 25% buffer for unknowns, 2 hours for code review, and 2 hours for documentation. This comprehensive approach ensures that our timelines are accurate, achievable, and allow for the necessary steps to ensure a high-quality outcome. This enhanced estimation process leads to more predictable project timelines, improved resource management, and a greater likelihood of meeting project milestones successfully. By learning from past experiences and incorporating these buffers, we can improve the accuracy and reliability of our project timelines, leading to more successful project outcomes.
Success Criteria
This issue is complete when:
- âś… GIN index created and performance validated
- âś… NULL handling implemented and tested
- âś… Rollback plan documented and reviewed
- âś… All changes tested in development environment
- âś… System metrics show correct counts
- âś… Dashboard performance is fast (<500ms load time)
Final Grade Target: A+ (97/100) - Production Ready ✨
Implementation Order
- First: Database GIN index (30 min) - Prerequisite for performance
- Second: NULL handling (1 hour) - Ensures data accuracy
- Third: Rollback docs (30 min) - Safety net for production
Total Time: ~2 hours for all must-do tasks
References
- Parent Issues: #2438 (ShowtimeProcessJob Metrics), #2439 (Refinements), #2443 (Rollout Assessment)
- Audit Report: All tasks verified as needed during comprehensive audit
- Current System Status: Working excellently (A-, 92/100), needs production hardening
- Database Evidence: 2,991 job summaries, 479 NULL cancel_reasons, 1,935 "movie not matched"
In conclusion, Phase 5 Production Hardening represents the final crucial steps to achieving an A+ grade for our system. By focusing on database indexing, NULL handling, and rollback documentation, we are ensuring optimal performance, data accuracy, and production safety. The implementation of a GIN index significantly enhances query performance, particularly as our dataset grows. Explicit NULL handling in metrics queries guarantees the reliability of our data analysis, and a comprehensive rollback plan provides a safety net for addressing any unexpected issues. These efforts collectively contribute to a robust and dependable system. To learn more about database indexing best practices, visit PostgreSQL Indexing.
This issue represents the final 8% needed to reach production-ready status (A+ grade).