Technical Note on cPyMemTrace Performance

pymemtrace.cPyMemTrace contains Python profilers and tracers written in ‘C’ that logs the runtime information including the Resident Set Size for every Python and C call and return or object allocation and de-allocation as required.

Test Program and Data

This tests the performance of cPyMenTrace with a real world program. The program reads remote sensing structured text files containing static data and dynamic arrays of floating point data and summarised them as HTML pages. The input was 3 files totalling 1.3Mb.

Platform was a Mac Book Pro (late 2018) 2.7 GHz Intel Core i7 running macOS Ventura 13.5.

The command used was:

time tdlastohtml --log-process=0.25  example_data/LAS/data tmp/LASTOHTML_A

Profiling

The pymemtrace.cPyMemTrace.Profile class that logs Python and C code. This ignores Python line, opcode and exception events. pymemtrace provides a decorator pymemtrace.cpymemtrace_decs.profile() that will profile a function. For example:

from pymemtrace import cpymemtrace_decs

@cpymemtrace_decs.profile(
    message="LASToHTML Profile d_rss_trigger=0",
    d_rss_trigger=0,
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

In this exercise there were 4.4m profiling events. When d_rss_trigger=-1 then only events that change the RSS by >= the page size (4096 bytes) are reported. When d_rss_trigger=0 then all events are reported.

Profiling Performance

Here is the cost of profiling given various options:

cPyMemTrace Profiling Performance

Setup

Time (s)

Time Rel.

Log Size

Log Lines

Notes

No Profiling

0.765

1x

0

0

Baseline execution.

Profile

6.81

8.9x

399K

2,095

d_rss_trigger=-1

Profile, all events

32.4

42x

831M

4,428,053

d_rss_trigger=0

As expected reporting every event is about four times as expensive (in time) compared to those events that trigger large changes in the RSS.

Tracing

The pymemtrace.cPyMemTrace.Trace class that logs pure Python code, it ignores C call, and return events. pymemtrace provides a decorator pymemtrace.cpymemtrace_decs.trace() that will profile a function. For example:

from pymemtrace import cpymemtrace_decs

@cpymemtrace_decs.trace(
    message="LASToHTML Trace d_rss_trigger=0",
    d_rss_trigger=0,
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

In this exercise there were 7m profiling events. When d_rss_trigger=-1 then only events that change the RSS by >= the page size (4096 bytes) are reported. When d_rss_trigger=0 then all events are reported.

Tracing Performance

Here is the cost of tracing given various options:

cPyMemTrace Tracing Performance

Setup

Time (s)

Time Rel.

Log Size

Log Lines

Notes

No Tracing

0.765

1x

0

0

Baseline execution.

Trace

13.0

17x

443K

2,326

d_rss_trigger=-1

Trace, all

47.6

62x

1.3G

7,044,826

d_rss_trigger=0

This is similar to Profiling but the event count is much larger and that shows up in the runtime cost.

Reference Tracing

The pymemtrace.cPyMemTrace.ReferenceTracing class that logs every allocation and de-allocation. The log file format is different to the Profile/Traace file format. See cPyMemTrace Log File Format for the details. pymemtrace provides a decorator pymemtrace.cpymemtrace_decs.reference_tracing() that will profile a function.

Here are the code examples, firstly reference tracing all except the builtin types (int, str etc.). This is the default:

@cpymemtrace_decs.reference_tracing(
    message="LASToHTML Reference Tracing include_builtins=False",
    include_builtins=False,
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

Then reference tracing including all the Python builtins:

@cpymemtrace_decs.reference_tracing(
    message="LASToHTML Reference Tracing include_builtins=True",
    include_builtins=True,
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

Now reference tracing but only reporting on a specific type and no builtin types:

@cpymemtrace_decs.reference_tracing(
    message=(
        "LASToHTML Reference Tracing include_builtins=False"
        " include_tp_names=['LASSection',]"
    ),
    include_builtins=False,
    include_tp_names=['LASSection',],
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

Now reference tracing but only reporting on a specific type and no builtin types with no garbage collection run:

@cpymemtrace_decs.reference_tracing(
    message=(
        "LASToHTML Reference Tracing include_builtins=False"
        " include_tp_names=['LASSection',]"
    ),
    include_builtins=False,
    include_tp_names=['LASSection',],
    gc_collect_on_exit=-1,
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

Finally reference tracing but only reporting on a specific type and no builtin types but with a full Garbage Collection run before closing the log file:

@cpymemtrace_decs.reference_tracing(
    message=(
        "LASToHTML Reference Tracing include_builtins=False"
        " include_tp_names=['LASSection',]"
        " gc_collect_on_exit=2"
    ),
    include_builtins=False,
    include_tp_names=['LASSection',],
    gc_collect_on_exit=2, # The default.
)
def process_arguments(args, log_level):
    # Execution code here.
    pass

Reference Tracing Performance

Here are the results compared to the baseline of no tracing:

cPyMemTrace Reference Tracing Performance

Setup

Time (s)

Time Rel.

Log Size

Log Lines

Notes

No Tracing

0.765

1x

0

0

Baseline execution.

Ref Trace, no builtins

27.3

36x

170M

700K

include_builtins=False

Ref Trace, with builtins

59.9

78x

1.0G

4M

include_builtins=True

Ref Trace, single class

1.573

2.0x

3.2K

16

include_tp_names=['LASSection',] gc_collect_on_exit=-1

Ref Trace, single class, with GC

1.841

2.4x

6.1K

28

include_tp_names=['LASSection',]

Reference tracing is extremely invasive as it records every Python object allocation and de-allocation. Including builtins can double the executions times compared to ignoring them.

Reference tracing a single type is remarkably effective and efficient. The Garbage Collection run adds very little to the execution time.