Prior art on implementing a "print" op on hardware

jkshtj · October 3, 2024, 5:38pm

Hi folks,
Are there existing hardware backends that support a “print” op? If yes, could someone with direct/indirect experience implementing them speak to how such ops work?

Specifically, I’m interested in understanding how a tensor partitioned across the memory hierarchy printed without significantly changing the compute graph or introducing expensive “collective” operations?

I understand that a lot of this would depend on the hardware features available. For instance, in some hardware backends you cannot print from device and the tensor to be printed must always be sent to the host first. I’m more interested in the scenarios where the device does support printing.

P.S. - Perhaps even CPUs with a cache hierarchy run into similar challenges while printing a value. Any relevant insights here would be appreciated.

Topic		Replies	Views
OpenCL backend dev - questions/support hardware-backends	4	260	August 29, 2024
Memory operations on a custom backend hardware-backends	4	1144	July 5, 2022
MPS working group? hardware-backends	2	365	March 12, 2024
Where to Post for Tensor Subclass Support?	2	84	March 3, 2025
[tac] Follow up: Inductor HW backend implementation hardware-backends	7	777	November 16, 2024

Prior art on implementing a "print" op on hardware

Related topics