From the weather to the lottery, it’s human nature to want to predict the future.

We try to change our behaviour or decisions to influence our future in a (hopefully) positive way. Businesses are no different. They want to predict what a customer wants before they ask, the resources they need and the revenue they’ll make. In the past, we might have resorted to fortune telling or tea leaves but now we have data and machine learning.

What’s in our toolbox?

A client I worked with had an issue with several of their Spark jobs not running as they should have. As you may know, Spark is a framework that enables developers to communicate with a big data platform. In this particular environment, each team member was developing in a different programing language (Python, Scala and Java) to achieve the same goal of leveraging Spark Machine Learning. The challenge was that they could not understand each other’s work.

 

It was evident that we needed to implement a new process involving a code generator. The code generator would allow developers to concentrate their efforts on reusability, transparency and real data science problem solving. The business would also save on costly development time. This led us to introduce Talend to the big data environment.

 

So how can we use data and machine learning predict the future? My recommendation is to:

  • store the data in any Hadoop cluster,
  • leverage the Spark framework for its Machine Learning capability,
  • use Talend as the code generator/ETL.

Talend has the capability to interact with the Hadoop cluster via Spark (Yarn) in order to provide a prediction based on data.

How does Spark deal with Machine Learning?

A picture is worth a thousand words (credit: cakesolutions.net).

In a nutshell, the data will be acquired and pushed into a Hadoop cluster. To reduce noise the data will need to be normalised (pre-processing), classification is then used to transform a signal into a singular value (features) and finally the model will need to be trained. Once the model is in training, we will test it against real values (80% training, 20% testing) in order to determinate how accurate the prediction model is. If it doesn’t produce a satisfactory outcome, then we would redo the same process with a different algorithm.

Everyone loves a recipe – Talend Cookbook

Loading the data into your Hadoop-MapR cluster

Create a new Standard Data Integration job:

  • Download the data (tFileFetch) that you want to work with (Pearson data set is a good idea).
  • Unzip it (tUnArchive) if it is an archive.
  • Upload to your cluster (tHDFSPut).

Training the test model using the data

Create a new Big Data Spark job:

  • Drag and drop your Hadoop cluster connection information from the Metadata section (n the image below we have used MapR).
  • Map the data (tFileInputDelimited).
  • Clean / normalise your dataset (tMap, tModelEncoder).
    • Not in the image, split the data into the training/testing data set.
  • Train your model (Linear Regression, tLinearRegressionModel).
  • Test / predict from your model (tPredict).
  • Redo point 4 and 5 with a different model.

Conclusion

By leveraging the data, we have been able to predict the future from a various set of algorithms without having to code much. Talend did that for us. You can download a sandbox that contains Talend and an Hadoop cluster with Spark and MapReduce already pre-configured. The sandbox is packed with a couple of other hands-on example/jobs.

 

Download the sandbox from here: https://info.talend.com/prodevaltpbdsandbox2-0.html

About the author

Adrien has over a decade of experience in the IT industry with a focus on data management and digital across almost every industry (retail, academic, automotive, bank / finance, insurance and telecommunication). Passionate about learning new tools, his latest discoveries are around Big Data and IT mathematical application such as Machine Learning. Adrien holds a Master of Science with a Business Intelligence specialisation from EISTI.
Adrien Follin