It has been a long time since our last post, and many things changed for us
both. New jobs, new adventures, and new challenges. I will not delve into the
details but let's say that our business domain changed, but we are still facing
architectural and programmatic problem that we can all enjoy J.
Intro
Recently I have been asked to incorporate
Time Series Forecasting into our product. We have been working close with an in
house Data Scientists team. They developed the algorithm and then it was up to
us to decide how to productize it. If you ask me, one of the biggest challenges
facing developers in this Big Data era is the ability to take very complex algorithms
and methods (usually written by non-product-facing people), and understand what
are the correct approaches and tools to make them work in production. An algorithm
can provide great results, but when looking close at the time it takes to run
and the load you will be facing you understand that practically it is useless.
The problem
So the problem we were facing was creating
a solution for time
series forecasting running 24/7. We needed to gather utilization values and
from a 3rd party system, and create weekly and daily predictions.
Expected load
Number of requests
Our system works in peaks. We can be
requested to compute prediction based on a change in the system, or several
times on a daily basis. I will concentrate on the latter cases, since that will
give us the max load. Several times a day we needed to serve 4000 forecasting requests coming all in a one minute interval.
Data Points Length
We needed to forecast for about 3 months
to the future, on a daily and weekly granularity. In numbers this means that we
had 180 data points, and we needed to predict ~90 values to the future.
Possible solutions to run the prediction
We really cracked our head on the best way
to tackle this. You need to remember the data scientists wrote their code using R.
At first there were suggestions to rewrite
their R code into Java/Python. Since this is not our field of expertise we
quickly decided to drop this approach.
We than looked at the possibility to run
the R code using Rserve. Rserve is
a TCP/IP server which allows other programs to use R. This solution required us
to create another container that will run the actual Rserve, require maintenance,
and also we would now have a network lag. We could have installed the Rserve on
the same machine as the client but that wasn’t possible due to security
reasons.
Lastly we looked at rpy2. rpy2 is an interface to R running
embedded in a Python process. This gave us the freedom we needed and expertise
we have in python accompanied with the ability to withstand the load we were
expecting. It’s easy to use and configure (windows needed some hacking
but we were able to overcome it).
Chosen Architecture
So we decided to stay with the R
code to run the prediction using rpy2. But that was just half of the way. We
still needed a supporting ecosystem that will trigger the R code and sustain our required load. We created a dedicated RedHat VM running 8 cores. We
installed there an apache web server that used flask to receive
requests from consumers. Python was than used with rpy2 to run the R code. We
had a ProcessPoolExecutor that uses a pool
of processes to execute calls asynchronously.
The component that actually sent out prediction
requests was another service written in Java we had in the system. We decided
that it will send asynch requests in bulks. It had a retry mechanism and was
sending a callback address to the apache web server.
Results
Well, I must say I was really surprised.
The system is running for six months now. The web server is stable, holds the
load easily. Moreover the load keeps on growing with no signs of breaking.
As we grow, we expect new challenges. In
other words – watch this space J