Because Gutterball uses events dispatched from Candlepin, reliable delivery of those events is extremely important for accuracy of the reports.
Candlepin events are dispatched first by submitting to an embedded Hornetq server using the core API. We have a small number of listeners configured to store a reduced version of the event in our database, write to /var/log/candlepin/audit.log, and optionally publish onto the AMQP bus. Hornetq gives us essentially asynchronous sending of events (the API call can return, events are sent in the background), as well as reliable delivery of the message.
It is also worth noting that events to be dispatched are gathered up during a REST API request and only sent after successful completion, if an error is encountered in Candlepin code after some event has been sent, that message will never make it to Hornetq or the message bus.
Our Hornetq messages are marked as ‘durable’, as are the queues they are sent to. By default hornetq stores its journal in /var/lib/candlepin/hornetq/.
To communicate with AMQP we use the Qpid JMS client, and thus we’re basically using JMS API.
Gutterball likewise uses the JMS API to receive the messages on the other end of the bus.
In this scenario, both Candlepin and Gutterball applications are live, but the Qpid message bus goes down for some reason. Both applications will log an error when this happens, they lose their connection to qpidd but continue to operate fine. Hornetq queues up messages in the durable queues we configure and will hold onto them, and re-attempt delivery the next time the application is restarted and qpidd is up.
This behaviour should hold true for any exception thrown in an EventListener, and as such we should not ignore exceptions in these classes if it’s important the event reach its destination.
If gutterball is down, events sit in the qpid exchange and will be sent when gutterball returns. The default JMS time-to-live is infinite and we do not appear to specify one specifically, so the message will remain there until gutterball returns.
Event processing in gutterball is divided into two phases / transactions.
In the first, we attempt to simply get the event into our database with minimal processing. We parse JSON into an event, and store it in gb_event with a status that indicates it was received, then commit the transaction. If an exception is thrown in this phase, the event remains on the qpid exchange and will be re-tried whenever gutterball rejoins the bus. Note that this has been known to trigger the dreaded qpidd capacity exceeded error discussed below. However, errors here should be exceedingly rare. (no known situations can cause one)
In the second phase, we perform actual gutterball event processing. In this phase any exception should be caught and logged. Because we assigned the event an initial status of received, we know that any event remaining in the database in this state probably failed processing. (ignoring race condition for an event that is currently being processed) This phase has been known to fail due to bugs in the code, and on application upgrade, or perhaps on demand via an API call, gutterball could scan for such events and re-try.
Really no options here other than to go to gutterball and insert explicit throw exception statements.
A potential remaining problem is if the number of messages failing to import exceeds the amount qpidd is configured to store, which results in an exception: Enqueue capacity threshold exceeded. This exception seems to only surface in high volume scenarios (parallel spec tests) when events throw an exception when being received by gutterball. Because this is only possible in phase 1 now, and we do not know of any situations where this is possible, we are hopeful the two phase approach to message processing in gutterball will prevent this.