#java #memory #panama-ffm #exeris #jvm-performance #off-heap

ByteBuffer Solves Half the Problem: The LoanedBuffer Pattern

Direct ByteBuffer gives you zero-copy. It does not give you deterministic cleanup. In a runtime where memory ownership is part of the architecture, that gap is structural - not a tuning detail.

This is the seventh article in the Exeris Kernel series.

TL;DR: Direct ByteBuffer gives you zero-copy but defers cleanup to the GC. Arena gives you deterministic cleanup but scopes ownership to a single region. Neither models a buffer shared across threads with a lifetime longer than any single fork. LoanedBuffer is the third option Exeris needed: explicit reference counting, try-with-resources for the boring case, and EX-MEM-1003 when discipline fails. The cost is honest - the compiler will not catch a missed retain() before a fork().

You can move TLS off the heap. You can ban ThreadLocal and replace it with ScopedValue. You can structure your concurrency with StructuredTaskScope. You can push the TLS boundary into Panama FFM so that cipher operations no longer allocate. And you will still find allocation pressure on the hot path - because somewhere, something is moving bytes through a byte[].

This article is about that boundary.

It is also where one specific JVM-era assumption finally breaks: that direct ByteBuffer is “the off-heap one.” Direct ByteBuffer solves zero-copy. It does not solve ownership. In a runtime where memory lifecycle is supposed to be part of the architecture, those are not the same problem.


The Constraint

When I started designing the memory subsystem for Exeris, I had two constraints that had to hold simultaneously on the request hot path:

  1. Zero copy. Bytes coming off a socket, through TLS, through HTTP framing, into a request handler - none of those steps may allocate a new array and copy.
  2. Deterministic cleanup. When work on a buffer is done, native memory must be released now, not whenever the GC notices.

Most JVM memory abstractions give you exactly one of these.

byte[] gives you neither. Heap allocation, GC-managed lifecycle, and a copy every time you cross a native boundary.

ByteBuffer.wrap(byte[]) is a heap buffer with a different name. Same problems.

ByteBuffer.allocateDirect(n) gives you (1) but not (2). The segment is off-heap, so crossing into native code does not require a copy. But the underlying memory is freed when a Cleaner thread observes that the ByteBuffer reference has become unreachable. Under load, this means buffers can survive arbitrarily long after the work is done. You do not control when. You cannot ask. There is no close().

Arena.allocate(layout) from Panama FFM gives you (2) but with a coarser ownership model. An Arena owns a region of memory; closing the arena releases everything in it. That is fine for a request lifetime. It is less fine when a buffer is shared between threads, or transferred from one task to another and released by the second, or part of an in-flight queue.

I needed both. And I needed them composable.


ByteBuffer Solves Half the Problem

What stopped me from using ByteBuffer directly was not API ergonomics. It is that ByteBuffer does not have an ownership model.

It has an access model - position, limit, capacity, slice, duplicate - but nothing that answers the question “who is responsible for releasing this memory, and when?”. Direct buffers defer that question to the GC. Heap-backed buffers defer it to the GC twice - once for the buffer object, once for the array it wraps.

That works fine when allocations are infrequent and lifetimes are obvious. It breaks when bytes are flowing through 1-VT-per-stream concurrency at network speed and a Cleaner thread is the only thing standing between you and a slow, silent native heap leak.

The standard JVM workaround is buffer pooling: keep a ConcurrentLinkedQueue of direct ByteBuffer instances, hand them out, and trust callers to return them. This works in practice and underpins frameworks like Netty. It also reintroduces the exact problem the GC was trying to solve: explicit lifecycle management, with the additional twist that forgetting to return is now silent - the buffer just sinks into orphan memory without a Cleaner event to notice.

What I wanted was the lifecycle discipline of an Arena combined with the flexibility of a pooled buffer - and a way to share ownership across threads without either a lock or a leak.

That is what LoanedBuffer is. The rest of this article is what it cost.

Figure 1: Three ownership models - ByteBuffer (GC-coupled), Arena (scope-bounded), and LoanedBuffer (reference-counted with explicit close).

The Loan Pattern

The basic shape is unsurprising. LoanedBuffer implements AutoCloseable. You allocate via the SPI. You use try-with-resources:

try (LoanedBuffer buffer = allocator.allocate(AllocationHint.MEDIUM)) {
    buffer.writeBytes(payload, 0, payload.length);
    transport.send(buffer);
}

Three things are happening here that ByteBuffer does not give you.

First, the allocator is injected via SPI. The application code does not know whether the underlying allocator is a slab pool, a partitioned arena, or a specialized native pool optimized for a specific transport. Implementation blindness is preserved - Core operates exclusively on the MemoryAllocator contract, resolved at bootstrap via ServiceLoader:

MemoryAllocator allocator = ServiceLoader.load(MemoryAllocator.class)
        .findFirst()
        .orElseThrow(() -> new KernelBootstrapException(KernelErrorCodes.EX_BOOT_0002));

Second, AllocationHint is a typed enum, not a raw byte count. The hint tells the allocator which size class is wanted (SMALL, MEDIUM, LARGE, NETWORK_FRAME). The allocator picks a slab from the matching pool. There is no math at the call site, no rounding, no “did I just trigger a slow path?”.

Third, close() is deterministic and immediate. When the try block exits, the slab returns to its pool now. The watermark manager updates now. There is no Cleaner thread, no PhantomReference, no waiting for GC.

That is the boring, single-owner case. The pattern earns its name in the next case - the one ByteBuffer cannot model at all.


Reference Counting with VarHandle

Inside the kernel, a buffer often has more than one logical owner. The transport layer wants to hold it while the request handler is reading. The handler wants to hold it while async work is in flight. The async work might want to retain it for a follow-up operation.

LoanedBuffer solves this with explicit reference counting. Allocation starts the count at one. retain() increments. close() decrements. When the count reaches zero, the slab returns to the pool.

The implementation uses VarHandle for the CAS path. No synchronized, no AtomicInteger allocation per buffer, no monitor inflation. The reference count is a primitive int field on the buffer itself, accessed through a class-level VarHandle:

public final void retain() {
    REF_COUNT_HANDLE.getAndAdd(this, 1);
}

public final void close() {
    int previous = (int) REF_COUNT_HANDLE.getAndAdd(this, -1);
    if (previous == 1) {
        fireCloseActions();
    }
}

Calling close() more than the buffer was retained is a fatal contract violation. Calling retain() on a non-owning view - for example, a peek() slice that exposes a memory region without transferring ownership - is also a contract violation. The kernel emits EX-MEM-1003 (Peek View Ownership Misuse) as a glass-box telemetry event when this happens, with the calling method captured in rawArgs[0]. The call itself is a no-op: it neither increments the count nor returns silently. It is logged and refused.

The point of refusing is not to be punitive. It is that an unobserved retain()-on-view bug becomes a use-after-free somewhere else, on a different thread, at an unpredictable time. Failing fast and loudly at the misuse site makes the bug local instead of distributed.


The Async Transfer Problem

This is the case that motivated the design. I never had to debug it in production - I caught it on paper while sketching the ownership model, and the design followed from there.

A request arrives. The handler reads it into a buffer. The handler then forks two async sub-tasks under a StructuredTaskScope: one to validate, one to enrich. Both sub-tasks need to read the same buffer. The handler joins both, then serializes the response.

In the standard JVM model, this is a sharing problem with no good answer. If you pass a ByteBuffer to two virtual threads, you have just created an aliasing hazard with no concurrency model. If you copy the buffer twice, you have just defeated zero-copy.

In the LoanedBuffer model, sharing is explicit:

try (var scope = StructuredTaskScope.open(Joiner.awaitAllSuccessfulOrThrow())) {
    try (LoanedBuffer buffer = allocator.allocate(AllocationHint.NETWORK_FRAME)) {
        buffer.retain();

        scope.fork(() -> {
            try {
                return processAsync(buffer);
            } finally {
                buffer.close();
            }
        });

        scope.join();
    }
}

The pattern is:

  1. The allocator returns the buffer with refCount = 1.
  2. Before forking, the parent calls retain(). Count is now 2.
  3. The parent forks the sub-task. The sub-task runs concurrently.
  4. The sub-task closes the buffer when done. Count drops to 1.
  5. The parent’s try-with-resources closes when the outer block exits. Count drops to 0. Slab returns to the pool.

This is dependency-safe because retain() happened before the fork. If a caller forgets the retain(), the parent’s close() can race the sub-task’s read, and the sub-task observes a slab that has already been recycled. The kernel catches this in its TCK suite, but the contract is the caller’s to honor - there is no automatic retain on fork. I considered making scope.fork() automatically retain the buffer if a special wrapper type was passed in, but the cost was introducing a parallel API surface for what is fundamentally a discipline issue. The current design keeps the rule visible at the call site: if you fork it, you retain it first.

This is also the place where the STS-bootstrap article’s pattern pays off again. There, the structured scope owned a startup round - a bounded unit of work with a clear lifetime. Here, the structured scope owns a fan-out unit with the same clarity, but with an additional resource - the buffer - whose lifetime is longer than any single fork and shorter than the enclosing scope. The ownership model has to support that.

StructuredTaskScope does not. LoanedBuffer does.

Figure 2: Async ownership transfer between two threads. Reference count tracks live ownership; close actions fire only at zero.

The JMM contract underneath this is worth stating directly because it is easy to get wrong. There are no explicit memory fences in the close-action path. The visibility of close-action slots written by the allocating thread to the releasing thread is guaranteed by safe publication of the buffer reference itself. When the parent passes the buffer into scope.fork(), the JDK’s structured-scope implementation publishes the reference safely - that publication is what makes all the buffer’s fields visible to the sub-task, including the close-action chain. If you bypassed scope.fork() and handed the buffer to another thread through, say, a non-volatile field, the model breaks.

This is also why the Community transport’s allocator uses shared-arena semantics for all allocations rather than Arena.ofConfined(). The carrier thread allocates, but the per-stream Virtual Thread closes - different threads, same buffer. A confined arena would refuse the cross-thread close(). Shared arena allows it, with the JMM safely-published buffer reference carrying visibility.


Watermarks and the Pressure Boundary

LoanedBuffer solves per-buffer ownership. It does not solve aggregate pressure.

When the slab pools start running low, the kernel needs to know - and decide what to do about it - before an EX-MEM-1001 (Off-heap Exhausted) gets thrown on a request hot path. That is the job of WatermarkManager.

The manager exposes four threshold levels:

LevelOff-heap utilizationResourceArbiter decision
NORMAL< 70%ALLOW - allocations proceed
WARNING70–85%THROTTLE - large allocations rejected
CRITICAL85–95%REJECT - only essential traffic
SHEDDING≥ 95%SHED_LOAD - EX-MEM-1001 thrown

The ResourceArbiter reads the current level on each allocation request:

public LoanedBuffer tryAllocate(AllocationHint hint) {
    if (watermark.isHighWatermarkBreached()) {
        throw new MemoryExhaustedException(hint.bytes(), watermark.availableBytes());
    }
    return allocator.allocate(hint);
}

This is where LoanedBuffer connects forward to the next architectural layer. The watermark levels are not just internal accounting - they expose pressure as a typed signal (WatermarkLevel) that the rest of the kernel - scheduling, admission, business logic - can react to without inspecting GC counters or parsing JFR events at runtime. How the transport edge uses that signal to shed load before work hits the kernel is a separate decision and belongs to its own article.


Leak Detection: When Discipline Fails

The Loan pattern relies on discipline. Every allocation must be paired with a close(). Every retain() must be paired with another close(). There is no GC fallback.

In production, that discipline is enforced by the API surface - try-with-resources, sealed types, the Glass Box telemetry of EX-MEM-1003. In development and testing, it is enforced by LeakTracker, which integrates java.lang.ref.Cleaner to detect buffers that became unreachable without being closed.

When LeakTracker runs in PARANOID mode and observes a LoanedBuffer whose reference count is non-zero at GC time, it emits EX-MEM-1002 (Arena Leak Detected):

CodeMeaningGlass-Box Payload (rawArgs)
EX-MEM-1001Off-heap Exhausted[0] long requestedBytes, [1] long availableBytes
EX-MEM-1002Arena Leak Detected[0] long segmentAddress, [1] long segmentByteSize
EX-MEM-1003Peek View Ownership Misuse[0] String callerMethod

The leak is logged with the segment address and size, and a JFR event is emitted. In a long-running test, this turns “I forgot a close somewhere” into a specific, actionable signal with a stack trace.

This does not catch all leaks. A reference held by a long-lived data structure will not be GC’d, and LeakTracker will not fire. The terminalStateCatalog discipline I described in the Flow article applies here too: long-lived in-memory caches need their own bounded retention policy. The pattern catches forgotten references, not intentionally retained ones.

The trade-off is honest. LeakTracker is a development and staging tool, not a production safety net. In production, the API surface and code review are the primary defense. In development, PARANOID mode is the difference between “there is a leak somewhere in 50k LOC” and “the leak is in OrderHandler.java line 142, allocated from NetworkFrameSlabPool, 4096 bytes.”


What Still Remains True

A few things stay true even after this model is in place. Some of them are the reasons not to adopt it.

ByteBuffer is still the right answer for most Java applications. If you are building a normal HTTP service and your bottleneck is not allocation pressure on the request hot path, the Loan Pattern is over-engineering. It costs cognitive load on every read path, every fork, every cross-thread handoff. That cost is justified by the constraint, not by aesthetics.

Arena is still useful for request-lifetime allocations that do not need sharing. Inside a single Virtual Thread, with a clear scope boundary, an Arena.ofConfined() is simpler than a refcounted buffer. The kernel uses both patterns where each fits.

GC is still your friend for object graphs. Nothing in the Loan Pattern says “never allocate on the heap.” The pattern is specific to off-heap memory on the request hot path. Domain objects, plan instances, log records - all of those still live where Java has always put them.

The pattern does not solve cross-process IPC. If a buffer leaves the process

  • shipped over a network, written to a shared memory file, handed to a different JVM - the reference count stops being meaningful. The LoanedBuffer model is correct only for in-process lifetime. The handoff to a different process is its own boundary problem with its own ownership semantics.

Finally, the Loan Pattern does not soften the discipline cost. Every fork must retain. Every share must retain/close. Every async path must close in finally. The compiler will not catch the omissions for you. The code review will. The TCK will. LeakTracker will, in development. The runtime will not.

I considered making this less explicit - a wrapper type that auto-retained on escape from a method, an annotation that made the compiler enforce paired close(). Each of those would have either added runtime overhead, added a parallel API, or relied on a static analyzer that did not exist. The current design accepts the discipline cost as the price of the architectural model.


The thing I keep coming back to is that this is not a clever data structure. It is a contract - who owns this memory, who is allowed to extend its lifetime, and what is observable when someone gets it wrong. The implementation is unremarkable: a VarHandle, an int, a close-action chain. The work is deciding that ownership belongs in the type system at all, and accepting that the discipline cost is the price of the model.

The next architectural decision in the series is what the kernel does with the pressure signal once it has one - how WatermarkLevel becomes a shed decision at the network edge, and the single place in the kernel where unstructured Virtual Threads are deliberately allowed.


Explore the Exeris Kernel - zero-allocation architecture in running code: 🔗 exeris-systems/exeris-kernel

The Memory subsystem lives in exeris-kernel-spi (MemoryAllocator, LoanedBuffer, AllocationHint) and exeris-kernel-core (AbstractLoanedBuffer, WatermarkManager, ResourceArbiter, LeakTracker). If you want to see how the refcount / retain() / close() contract behaves under cross-thread fork-and-join, the TCK suite in exeris-kernel-tck is the fastest way in.