-
Notifications
You must be signed in to change notification settings - Fork 25
feat: Add transactional queue system to prevent data loss #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: fix/handle-flushing-events-with-32kb-limit
Are you sure you want to change the base?
feat: Add transactional queue system to prevent data loss #83
Conversation
da3b21c to
bd6abce
Compare
- Replace non-transactional queue with safe message handling - Add immediate retry logic with exponential backoff (3 attempts) - Implement failed queue system for long-term retry management - Add comprehensive error logging and observability features - Include safety mechanisms to prevent infinite loops - Enhance MockedHttpClient and MockErrorHandler for testing - Add ReliableDeliveryTest suite with 11 comprehensive tests This addresses the critical customer issue where HTTP failures caused permanent message loss due to premature queue removal. Now messages are only removed after confirmed successful delivery. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
bd6abce to
39d9ce5
Compare
rafaeelaudibert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels somewhat complicated and not well abstracted. I would recommend you check how we approach this both on posthog-node and posthog-python.
|
|
||
| $userAgent = sprintf('%s/%s', | ||
| $sampleMessage['library'] ?? 'PostHog-PHP', | ||
| $sampleMessage['library_version'] ?? 'Unknown' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get the version from a version.php file somehow?
| protected $maximum_backoff_duration = 10000; // Set maximum waiting limit to 10s | ||
| protected $max_retry_attempts = 3; | ||
| protected $max_failed_queue_size = 1000; | ||
| protected $initial_retry_delay = 60; // Initial retry delay in seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very high, retrying in a couple seconds usually makes sense if you make them exponentially grow
|
|
||
| for ($attempt = 0; $attempt < $this->max_retry_attempts; $attempt++) { | ||
| if ($attempt > 0) { | ||
| usleep($backoff * 1000); // Wait with exponential backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we really sleep? Is PHP multithreaded or will this block the whole server? I dont like this, we should always yield back control when possible
| 'Failed queue size limit reached. Dropping oldest failed batch.'); | ||
| } | ||
|
|
||
| $this->failed_queue[] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an append? PHP is crazy lol
|
@haacked FYI I'm still waiting on changes on this PR, I believe it still looks exactly the same as it did when I reviewed last week? I told Lucas on a DM that we should probably aim towards making this more similar as to how we handle retries on Python/NodeJS |
|
@rafaeelaudibert it's on my list, I'll probably focus on this tomorrow once I have time to look how the python sdk works |
rafaeelaudibert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes to make this leave my todo list :)
Summary
This PR adds a transactional queue system that prevents data loss when HTTP requests fail. This is a separate improvement that builds on the 32KB payload fix to provide comprehensive reliability.
Base Branch:
fix/handle-flushing-events-with-32kb-limit(stacked PR)Target for final merge: This will be part of the overall reliability improvements
🚨 Problem Addressed
The customer reported a critical architectural issue beyond the 32KB limit:
✅ Solution
Implement transactional queue behavior where messages are only removed AFTER confirmed delivery.
Key Changes
🔄 Transactional Queue Logic (
QueueConsumer.php)array_splice()before delivery witharray_slice()to peek at messages🛡️ Multi-Level Retry System
📊 Enhanced Observability
getFailedQueueStats()for real-time failure monitoringclearFailedQueue()for manual recovery🧪 Comprehensive Testing
ReliableDeliveryTest.phpwith 11 comprehensive testsConfiguration Options
Test Results
✅ All reliability tests pass (11/11)
✅ All existing tests pass (backward compatibility confirmed)
✅ No infinite loops (safety mechanisms validated)
✅ Memory bounded (failed queue overflow protection)
Customer Impact
This addresses the customer's core reliability concerns while maintaining full backward compatibility.
🤖 Generated with Claude Code