A client I worked with had an issue with several of their Spark jobs not running as they should have. As you may know, Spark is a framework that enables developers to communicate with a big data platform. In this particular environment, each team member was developing in a different programing language (Python, Scala and Java) to achieve the same goal of leveraging Spark Machine Learning. The challenge was that they could not understand each other’s work.
It was evident that we needed to implement a new process involving a code generator. The code generator would allow developers to concentrate their efforts on reusability, transparency and real data science problem solving. The business would also save on costly development time. This led us to introduce Talend to the big data environment.
So how can we use data and machine learning predict the future? My recommendation is to:
- store the data in any Hadoop cluster,
- leverage the Spark framework for its Machine Learning capability,
- use Talend as the code generator/ETL.
Talend has the capability to interact with the Hadoop cluster via Spark (Yarn) in order to provide a prediction based on data.
How does Spark deal with Machine Learning?
A picture is worth a thousand words (credit: cakesolutions.net).
In a nutshell, the data will be acquired and pushed into a Hadoop cluster. To reduce noise the data will need to be normalised (pre-processing), classification is then used to transform a signal into a singular value (features) and finally the model will need to be trained. Once the model is in training, we will test it against real values (80% training, 20% testing) in order to determinate how accurate the prediction model is. If it doesn’t produce a satisfactory outcome, then we would redo the same process with a different algorithm.
Everyone loves a recipe – Talend Cookbook
Loading the data into your Hadoop-MapR cluster
Create a new Standard Data Integration job:
- Download the data (tFileFetch) that you want to work with (Pearson data set is a good idea).
- Unzip it (tUnArchive) if it is an archive.
- Upload to your cluster (tHDFSPut).
Training the test model using the data
Create a new Big Data Spark job:
- Drag and drop your Hadoop cluster connection information from the Metadata section (n the image below we have used MapR).
- Map the data (tFileInputDelimited).
- Clean / normalise your dataset (tMap, tModelEncoder).
- Not in the image, split the data into the training/testing data set.
- Train your model (Linear Regression, tLinearRegressionModel).
- Test / predict from your model (tPredict).
- Redo point 4 and 5 with a different model.
By leveraging the data, we have been able to predict the future from a various set of algorithms without having to code much. Talend did that for us. You can download a sandbox that contains Talend and an Hadoop cluster with Spark and MapReduce already pre-configured. The sandbox is packed with a couple of other hands-on example/jobs.
Download the sandbox from here: https://info.talend.com/prodevaltpbdsandbox2-0.html