Apache Spark is currently one of the most popular open-source cluster-computing frameworks. With its Machine Learning Library (MLlib) it supports the easy scaling of a range of feature extraction and machine learning tasks commonly employed in text mining. Furthermore, it works with both Python and R.
The tutorial will first cover the basics of using an Apache Spark cluster for text mining and machine learning, and will then provide a walk-through of the text classification solution developed within the framework of the Hungarian leg of the Comparative Agendas Project – with the support of the MTA SZTAKI Cloud team – as a use case example of the possibilities opened up by the increased speed offered by parallel computing.
The tutorial will address among other things: a) configuring the Apache Spark cluster, b) using a Hadoop Distributed File System with the cluster, c) operating the cluster via an RStudio Server and sparklyr (the Spark interface for R developed by RStudio), and d) the differences in available functionality of the Machine Learning Library for sparklyr, SparkR (the R API developed by Apache Spark) and PySpark (the Python API for Spark).