One of the teams I work in recently discovered that we weren’t paginating through the REST API results we receive in one part of our application. We added that in, but the Java library we use for constructing the URL (including query string params) had an “override” method as well as an “append” method for query string params. The diff for the change looked something like this and no engineers spot the problem during code review:
URLBuilder.new() .path("/rest/api/2/filters") + .addQueryParam("startAt", startAt) .putQueryParams(Map.of("ids", filterIds))
Subsequently we only fetched page 1 everytime we wanted to fetch the next page in the result set. The code change author forgot to add new unit tests because the pagination logic in general had good unit test coverage. The integration tests didn’t fail because the data set used for them only contained 1 page of results. On deployment to production our services fell into an infinite loop and started running out of memory and maxing out the CPU. Thankfully for one of the relevant services this caused the deployment itself to fail and rollback, but for another service with the same logic but a different traffic pattern we didn’t see the effects of the change until later and had to manually rollback.
The second, related, bug was caused by our implementation of what to do during pagination if we receive a non-200 HTTP response code, for example a 429 Too Many Requests. Graceful recovery from intermittent failures like this is par for the course in this codebase, so our implementation just treated that scenario as “no data” instead of throwing an exception. We’re likely to have some data we can process from the requests for the previous pages so let’s do the best we can with that rather than throw away all the data we’ve obtained. What could go wrong?
It turns out that this particular REST API is considered the authoritative source of truth. If some particular result is missing from the response we delete corresponding data from elsewhere to match the fact that it doesn’t exist anymore. At a best guess an unrelated deployment causes a spike in the number of 429s we receive for at least one customer. Rate limiting is implemented per-customer and not per-REST API endpoint. We diligently start deleting the customer’s data as it’s not in the result set from that API. A freaked out customer raises a support ticket telling us their data is going missing. We shortly find the problem, implement a fix and try to deploy it but the service won’t startup because of a whole set of entirely unrelated changes that now require new service configuration that is missing because the engineers didn’t realise the two services shared that area of code. Check our backlog for the tickets that identify the work required to split the code further or rewrite that part of the codebase completely. Once that’s resolved, the build pipelines fail on some flakey test that requires investigation to ensure it really is flakey and not an actual problem, and then a 3 day public holiday weekend starts.
Dear valued customer, please wait until Tuesday for us to stop deleting your valued data at which point we’ll see if we can recover what is lost.