This configuration is effective only when using file-based sources such as Parquet, Then the partitions with small files will be faster than partitions with bigger files (which is This is used when putting multiple files into a partition. The estimated cost to open a file, measured by the number of bytes could be scanned in the same This configuration is effective only when using file-based sources such as Parquet, JSON and ORC. The maximum number of bytes to pack into a single partition when reading files. That these options will be deprecated in future release as more optimizations are performed automatically. The following options can also be used to tune the performance of query execution. Larger batch sizes can improve memory utilizationĪnd compression, but risk OOMs when caching data. 圜olumnarStorage.batchSizeĬontrols the size of batches for columnar caching. When set to true Spark SQL will automatically select a compression codec for each column based You can call ("tableName") to remove the table from memory.Ĭonfiguration of in-memory caching can be done using the setConf method on SparkSession or by running Then Spark SQL will scan only required columns and will automatically tune compression to minimize Spark SQL can cache tables using an in-memory columnar format by calling ("tableName") or dataFrame.cache(). Converting sort-merge join to broadcast joinįor some workloads, it is possible to improve performance by either caching data in memory, or by.