Controlling log messages in PySpark

Sample log4j.properties

# Set root logger level to WARN to suppress INFO and DEBUG messages.  
log4j.rootCategory=ERROR, console  
   
# Define the console appender (where logs will be printed to the console)  
log4j.appender.console=org.apache.log4j.ConsoleAppender  
log4j.appender.console.target=System.out  
log4j.appender.console.layout=org.apache.log4j.PatternLayout  
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n  
  
# Set specific log levels for Spark components.  
# This is often more effective than just the root logger for Spark-specific messages.  
  
# Set the log level for the org.apache.spark package to WARN.  
log4j.logger.org.apache.spark=ERROR  
  
# You can also set levels for more specific Spark components if needed.  
# For example, to silence excessive DAGScheduler INFO messages, you could try:  
# log4j.logger.org.apache.spark.scheduler.DAGScheduler=WARN  
  
# Similarly, for BlockManager messages:  
# log4j.logger.org.apache.spark.storage.BlockManager=WARN  
  
# If you still see INFO messages from other libraries (e.g., Netty, Hadoop), you can add more specific loggers.  
# For example, to reduce Netty INFO messages:  
# log4j.logger.org.apache.spark.network.netty=WARN  
# log4j.logger.io.netty=WARN  
  
# Example of setting level for Hadoop (if you see Hadoop related INFO messages)  
# log4j.logger.org.apache.hadoop=WARN

Explanation of common settings

log4j.rootCategory

  • Defines the root logger.
  • First part (WARN, ERROR, etc.): Logging level for the root logger. Messages at or above this level will be processed.
  • Second part (console): Appenders to use for the root logger (in this case, ‘console’ which we define below).

log4j.appender.console

  • Defines an appender named ‘console’.
    • org.apache.log4j.ConsoleAppender: Specifies that this appender writes to the console.
    • log4j.appender.console.target=System.out: Output stream for the console appender (System.out is standard output).
    • log4j.appender.console.layout: Defines the layout for log messages in this appender.
    • org.apache.log4j.PatternLayout: Uses a pattern to format the log messages.
    • log4j.appender.console.layout.ConversionPattern: The actual pattern for formatting.
      • %d{yy/MM/dd HH:mm:ss}: Date and time format.
      • %p: Log level (WARN, ERROR, etc.).
      • %c{1}: Shortened class name of the logger (just the last part).
      • %m: The log message itself.
      • %n: Newline character.

log4j.logger.package.name

  • Sets the log level for a specific package (or class).
  • e.g., log4j.logger.org.apache.spark=WARN: Sets the level for all loggers under the org.apache.spark package to WARN.

Sample use

log_file=~/software/pyspark-env/log4j.properties

spark-submit \  
    --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file://${log_file}" \  
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file://${log_file}" \  
    src/test_2.py