Apache Spark is a system for large-scale data processing used for Big Data applications business applications, but also in many scientific applications. Spark uses Java (or Scala) object serialization to transfer data over the network. Especially if data fits in memory, the performance of serialization is the most important bottleneck in Spark applications. Spark currently offers two mechanisms for serialization: Standard Java object serialization and Kryo serialization.
In the Ibis project (www.cs.vu.nl/ibis), we have developed an alternative serialization mechanism for high-performance computing applications that relies on compile-time code generation and zero-copy networking for increased performance. Performance of JVM serialization can also be compared with benchmarks: https://github.com/eishay/jvm-serializers/wiki. However, we also want to evaluate if we can increase Spark performance at the application level by using out improved object serialization system. In addition, our Ibis implementation can use fast local networks such as Infiniband transparently. We also want to investigate if using specialized networks increases application performance. Therefore, this project involves extending Spark with our serialization and networking methods (based on existing libraries), and on analyzing the performance of several real-world Spark applications.