Optimizing Spark Performance Through Intelligent Data Preprocessing | Gadi Goren , Veeva |

DataFlint

/@Dataflint

Published: November 13, 2025

Open in YouTube
Insights

This video provides an in-depth exploration of optimizing Apache Spark performance through intelligent data preprocessing, presented by Gadi Goren from Veeva. The core purpose of the talk is to share practical solutions for common data engineering challenges encountered in production environments, specifically within the context of processing large volumes of commercial and health-related data for pharmaceutical clients. Goren begins by establishing the business context, explaining that their clients are pharmaceutical companies keen on understanding the effectiveness of their advertising campaigns by combining commercial ad impression data with anonymous health data. The presentation details how an external data preprocessing layer, dubbed "Pioneer," significantly enhances the efficiency and reliability of subsequent Spark-based data processing.

The speaker delves into three primary problems that often plague Spark pipelines: the "small file problem" (many small input files), the "large file problem" (single or few very large input files), and the challenge of "schema evolution" (inconsistent schemas across input files). He explains how these issues lead to Spark driver overload, S3 slowdowns, inefficient task management, "struggler tasks" causing idle clusters, and job failures due to schema mismatches. The proposed solution involves introducing a dedicated preprocessing stage before data enters Spark, ensuring that Spark receives uniformly prepared, "Spark-ready" data. This stage handles tasks like splitting oversized files, consolidating numerous small files into optimally sized ones (e.g., 100MB), and converting data to efficient formats like Parquet, while also managing schema alignment.

Goren elaborates on their implementation using AWS Step Functions with the Distributed Map feature, leveraging lightweight, I/O-bound containers to process data in parallel. This approach is highlighted as being fast, cost-effective, and highly scalable. The benefits demonstrated include significantly improved Spark efficiency and predictability, eliminating production slowdowns and crashes caused by unpredictable input data. Through a demo, the speaker illustrates the substantial speedup achieved by offloading these preprocessing tasks from Spark, allowing Spark to focus on its core strengths of data transformation and analysis rather than infrastructure management. The discussion also touches upon the evolution of their solution, moving from an all-Spark approach to this hybrid model after encountering severe performance and stability issues, particularly with schema merging.

Key Takeaways:

  • Business Context for Data Processing: The speaker's company processes vast amounts of commercial advertising data combined with anonymous health data for pharmaceutical clients to assess campaign effectiveness, highlighting a critical use case for robust data pipelines in the life sciences sector.
  • Challenges of Unpredictable Data Input: Data arriving from external partners (DSPs, SSPs, publishers) is often inconsistent in terms of file size (many small files or very large files), format, and schema, leading to significant performance bottlenecks and failures in Spark.
  • The "Small File Problem" in Spark: Numerous small files cause Spark driver overload, excessive S3 API calls leading to slowdowns, and inefficient cluster utilization as Spark spends more time managing metadata and partitions than processing data.
  • The "Large File Problem" in Spark: Single, large, compressed files (e.g., GZIP) can lead to "struggler tasks" where one executor works intensely while others remain idle, causing severe slowdowns or out-of-memory errors and cluster crashes.
  • Schema Evolution and Inconsistency: Varying schemas across input files can cause Spark to infer incorrect schemas, leading to data corruption or job failures when encountering unexpected data types. Spark's internal schema merge process is often expensive and inefficient.
  • Solution: External Data Preprocessing Layer: Implement a dedicated preprocessing stage before data enters Spark. This stage prepares "Spark-ready data" by standardizing file sizes, formats, and schemas, allowing Spark to operate more efficiently and predictably.
  • Preprocessing Operations: Key preprocessing tasks include splitting large files into smaller, manageable chunks; consolidating many small files into optimally sized files (e.g., 100MB); converting data to efficient columnar formats like Parquet; and harmonizing schemas.
  • AWS-Based Implementation: The specific solution leverages AWS Step Functions with its Distributed Map feature, running parallel processes on lightweight, I/O-bound containers. This approach is described as fast, cost-effective, and scalable for handling massive data volumes.
  • Significant Performance Gains: Demonstrations show substantial speedups (e.g., 5x for small file problem, 3x for schema merge) when preprocessing is done externally, validating the investment in this additional stage.
  • Predictable Spark Performance: The preprocessing layer ensures consistent input for Spark, leading to predictable job runtimes and stability, regardless of whether processing daily incremental data or large historical backfills, thereby preventing unexpected production issues.
  • Cost-Effectiveness of Preprocessing: Despite being an additional step, the preprocessing stage is very low-cost and quick (minutes to tens of minutes for data volumes that would take hours in Spark), making it a worthwhile investment for overall pipeline efficiency.
  • Handling S3 Limits: The external preprocessing allows for controlled parallelism, preventing S3 slowdowns by staying within read limits (e.g., ~500 files/second), unlike dynamic Spark clusters that might exceed these limits.
  • Evolutionary Solution Development: The current preprocessing approach was developed iteratively after encountering severe limitations and failures when attempting to handle all data quality and formatting issues directly within Spark.
  • Data Lake Technology: The company utilizes Apache Hudi as its data lake format, indicating a focus on transactional data lakes and efficient data management.

Tools/Resources Mentioned:

  • Apache Spark: The primary big data processing framework being optimized.
  • AWS Step Functions: An AWS service used to coordinate distributed applications and microservices, specifically for orchestrating the preprocessing workflow.
  • AWS Step Functions Distributed Map: A feature within Step Functions that allows for running many parallel iterations of a step, ideal for processing large datasets.
  • AWS S3: Amazon Simple Storage Service, used for storing raw input data from partners and processed data.
  • Parquet: A columnar storage file format optimized for analytics, used for storing "Spark-ready" data.
  • Apache Hudi: A data lake platform that enables transactional data lakes, mentioned as the format for their data lake.

Key Concepts:

  • Data Preprocessing: The process of transforming raw data into a clean and organized format suitable for analysis or further processing. In this context, it specifically refers to preparing data for optimal ingestion by Apache Spark.
  • Small File Problem: A common issue in big data systems where processing many small files leads to high overhead for metadata management, inefficient resource utilization, and performance degradation.
  • Large File Problem: The challenge of processing extremely large files, especially compressed ones, which can lead to single-node bottlenecks ("struggler tasks") and resource contention in distributed systems.
  • Schema Evolution: The process of adapting to changes in the structure (schema) of data over time. Inconsistent schema evolution can break data pipelines if not handled gracefully.
  • Spark Driver Overload: A state where the Spark driver, responsible for coordinating tasks, becomes overwhelmed by the volume of metadata or tasks, leading to slowdowns or crashes.
  • Struggler Tasks: Tasks in a distributed computing environment that take significantly longer to complete than others, often due to data skew or resource contention, slowing down the entire job.
  • I/O Bound: A process or system whose performance is limited by the speed of input/output operations (e.g., reading from or writing to disk/network) rather than CPU processing.
  • Spark-Ready Data: Data that has been preprocessed and formatted in a way that is highly optimized for ingestion and processing by Apache Spark, ensuring efficiency and stability.