It has been a long time since our last post, and many things changed for us both. New jobs, new adventures, and new challenges. I will not delve into the details but let's say that our business domain changed, but we are still facing architectural and programmatic problem that we can all enjoy J.
Recently I have been asked to incorporate Time Series Forecasting into our product. We have been working close with an in house Data Scientists team. They developed the algorithm and then it was up to us to decide how to productize it. If you ask me, one of the biggest challenges facing developers in this Big Data era is the ability to take very complex algorithms and methods (usually written by non-product-facing people), and understand what are the correct approaches and tools to make them work in production. An algorithm can provide great results, but when looking close at the time it takes to run and the load you will be facing you understand that practically it is useless.
So the problem we were facing was creating a solution for time series forecasting running 24/7. We needed to gather utilization values and from a 3rd party system, and create weekly and daily predictions.
Number of requests
Our system works in peaks. We can be requested to compute prediction based on a change in the system, or several times on a daily basis. I will concentrate on the latter cases, since that will give us the max load. Several times a day we needed to serve 4000 forecasting requests coming all in a one minute interval.
Data Points Length
We needed to forecast for about 3 months to the future, on a daily and weekly granularity. In numbers this means that we had 180 data points, and we needed to predict ~90 values to the future.
Possible solutions to run the prediction
We really cracked our head on the best way to tackle this. You need to remember the data scientists wrote their code using R.
At first there were suggestions to rewrite their R code into Java/Python. Since this is not our field of expertise we quickly decided to drop this approach.
We than looked at the possibility to run the R code using Rserve. Rserve is a TCP/IP server which allows other programs to use R. This solution required us to create another container that will run the actual Rserve, require maintenance, and also we would now have a network lag. We could have installed the Rserve on the same machine as the client but that wasn’t possible due to security reasons.
Lastly we looked at rpy2. rpy2 is an interface to R running embedded in a Python process. This gave us the freedom we needed and expertise we have in python accompanied with the ability to withstand the load we were expecting. It’s easy to use and configure (windows needed some hacking but we were able to overcome it).
So we decided to stay with the R code to run the prediction using rpy2. But that was just half of the way. We still needed a supporting ecosystem that will trigger the R code and sustain our required load. We created a dedicated RedHat VM running 8 cores. We installed there an apache web server that used flask to receive requests from consumers. Python was than used with rpy2 to run the R code. We had a ProcessPoolExecutor that uses a pool of processes to execute calls asynchronously.
The component that actually sent out prediction requests was another service written in Java we had in the system. We decided that it will send asynch requests in bulks. It had a retry mechanism and was sending a callback address to the apache web server.
Well, I must say I was really surprised. The system is running for six months now. The web server is stable, holds the load easily. Moreover the load keeps on growing with no signs of breaking.
As we grow, we expect new challenges. In other words – watch this space J