Prior art on implementing a "print" op on hardware

Hi folks,
Are there existing hardware backends that support a “print” op? If yes, could someone with direct/indirect experience implementing them speak to how such ops work?

Specifically, I’m interested in understanding how a tensor partitioned across the memory hierarchy printed without significantly changing the compute graph or introducing expensive “collective” operations?

I understand that a lot of this would depend on the hardware features available. For instance, in some hardware backends you cannot print from device and the tensor to be printed must always be sent to the host first. I’m more interested in the scenarios where the device does support printing.

P.S. - Perhaps even CPUs with a cache hierarchy run into similar challenges while printing a value. Any relevant insights here would be appreciated.