Depthwise conv2d: An NNC Case Study

I spent the last month experimenting with using NNC to generate fast convolutions for Skylake-AVX512 CPUs. In the process, I found a nice opportunity to optimize depthwise convolutions, which are at least somewhat common in our TorchBench workloads, and could be easily landed with a relatively straightforward heuristic.

Methodology: Finding Candidates

We wanted to focus on convolutions that aren’t currently handled well by MKLDNN. Jason Ansel added a nice FX-based profiler to TorchBench (benchmark/ at master · pytorch/benchmark · GitHub), which was extended to report FLOPS and memory bandwidth for conv2d layer. We dumped that data into a big spreadsheet and colorized it to look for places where we were far from roofline. This turned up a bunch of candidates, many of which were depthwise convolutions (i.e., convolutions where the number of groups equals the number of channels).

A screen shot of said spreadsheet, since I can’t easily link my Google doc:

Depthwise convolutions, especially with 3x3 kernels, are basically memory bound: they don’t have enough FLOPS to cover the latency of reading the image. MKLDNN loooooves to transform convolutions into NCHW16c format (basically, moving vectorizable blocks of channels into consecutive memory), so it wastes a bunch of memory bandwidth converting to and from PyTorch’s native NCHW format (more on this later). This approach seemed to create an opportunity to create a fast NCHW convolution using NNC, which can dynamically generate code to take advantage of the fixed parameters of the convolution (e.g., image size, stride, pad).

Methodology: Benchmarking

I like to iterate on the smallest possible unit of work to keep my development loop quick. I wrote a benchmark harness to compare the 3x3 depthwise convolutions from a range of models (I slurped mobilenet, shufflenet and detectron2) using ATen and NNC. This setup lets me quickly compare each layer both to roofline and to alternate implementations. From this sample, I was able to find some heuristics that work reasonably well across all the layers I found. With one exception (from detectron2), NNC is a good bit faster with NCHW than MKL-DNN:

Implementation: Compute and Schedule

NNC’s design is pretty heavily influenced by TVM. Like TVM, NNC separates a kernel into a “compute” definition, that describes the mathematical function being computed, and a “schedule” that optimizes the loop nest. Where NNC differs from TVM is that its scheduling language eagerly mutates the loop nest, rather than deferring it to code generation time. It’s a subtle difference, but it makes the optimizer a bit easier to reason about for us compiler writers.

NNC’s C++ API isn’t as concise as TVM’s, but you can probably figure out the compute expression easily enough:

  Tensor* conv = Reduce(
      {{N, "n"}, {K, "k"}, {OH, "oh"}, {OW, "ow"}},
      [&](const std::vector<VarHandle>& v) {
        auto const& k = v[1];
        return bias.load(k);
      [&](const std::vector<VarHandle>& v) {
        auto const& n = v[0];
        auto const& k = v[1];
        auto const& oh = v[2];
        auto const& ow = v[3];
        auto const& c = v[4];
        auto const& r = v[5];
        auto const& s = v[6];
        auto cond = CompareSelect::make(oh * stride - pad + r, 0, 1, 0, kLT);
        cond = CompareSelect::make(ow * stride - pad + s, 0, 1, cond, kLT);
        cond = CompareSelect::make(oh * stride - pad + r, H, 1, cond, kGE);
        cond = CompareSelect::make(ow * stride - pad + s, W, 1, cond, kGE);
        auto in = ifThenElse(
            input.load(n, k, oh * stride - pad + r, ow * stride - pad + s));
        return in * weight.load(k, c, r, s);
      {{C / groups, "c"}, {R, "r"}, {S, "s"}});

Most of the verbosity really comes from giving the axes friendly names. If only I could std::tie a vector… Anyways, the scheduling for these probably needs a bit more explaining:

  constexpr int kLoopH = 2, kLoopW = 3;
  if (R == 3 && stride == 2 && pad == 1) {
    For *head, *tail;
    auto loops = nest.getLoopStmtsFor(conv);
    nest.sliceHead(loops[kLoopW], 2, &head, &tail);
    loops = nest.getLoopStmtsFor(conv);
    nest.sliceHead(loops[kLoopH], 2, &head, &tail);
  } else if (R == 3 && stride == 1 && pad == 1) {
    For *main, *peeled;
    auto loops = nest.getAllLoopNestsWritingToBuf(conv->buf());
    main = loops[1][kLoopW];
    nest.sliceHead(main, 1, &peeled, &main);
    nest.sliceTail(main, 1, &main, &peeled);
    main = LoopNest::getParentLoop(main);
    nest.sliceHead(main, 1, &peeled, &main);
    nest.sliceTail(main, 1, &main, &peeled);

Here I’m specializing on the stride, so that we can use different heuristics for stride=2 and stride=1 convolutions. The basic idea is the same: we want to peel (which we, for some reason, named “slice”) a few iterations off the front or back of the loop to handle the special case where we’re in the padding region. With that done, the LLVM backend actually generates fairly reasonable code.

Results: End to End

From there it’s a simple matter to wire this conv2d implementation into the fuser. While the NNC-based fuser mostly handles elementwise operations, it’s mechanisms are quite general, so it’s a small diff to select good convolution candidates for fusion. By querying the profiling executor for input shapes, we can ensure that we only hand NNC convolutions that it will handle well. I’ve only enabled NNC when running single-threaded; thread-level parallelism just landed in NNC, so I haven’t had time to incorporate it in the schedule yet.

I’ve benchmarked the four models from TorchBench that have depthwise convolutions (mobilenet v2/v3, mnasnet, and shufflenet). Other models’ performance will remain unchanged. While I was experimenting with NNC, Elias Ellison and Nick Korovaiko did some outstanding work using model freezing to lower models to MKL-DNN and pre-transpose weights into a desired format. So in addition to benchmarking “pure” TorchScript, I benchmarked the models when frozen and converted to MKL-DNN format as well. Nb: the NNC approach currently requires freezing as well; this is simply because the bias tensor is an “optional” value, which isn’t handled by the profiling executor at the moment (nor by the NNC backend).

There are some interesting bits here. NNC improves on the JIT baseline on all four models, by up to 40%, and even yields faster results than MKL-DNN on the MobileNet models; MKL-DNN actually slows down MobileNetV2. A deeper analysis shows that the use of custom activation functions is forcing the activations to be converted in and out of MKL-DNN’s preferred format. I expect that if we add support for those custom activations, we’ll see MKL-DNN improve beyond NNC’s performance (Nick is currently working on this optimization).

Postscript: Where Do We Go From Here?

While I’m certainly happy to see some end-to-end speedups coming from NNC, I think the overall perf picture here is ambiguous:

  • MKL-DNN seems like the better choice, if we can cover its operator gaps
    • Although I never tested NNC with NCHW16c, so maybe things will get competitive again
    • I’ll probably land this anyways, since it looks like pure win if and when it kicks in
  • NNC’s scheduling library leaves a lot to be desired
    • For “dense” convolutions I was only able to outperform MKL-DNN with very friendly shapes (e.g., sizes perfectly aligned to vector widths)
    • I hit a lot of bugs in the process (although to their credit, Mikhail and Raghavan have been super-duper fast at fixing many of these)
    • compute_at nominally exists but isn’t well supported, and Andrew Tulloch’s prior work on ResNet-50 with TVM suggests those are critical to getting good perf, by scheduling the padding and packing well
  • NNC (and TVM, for that matter) are pretty much always going to have a “ninja gap” on consumer platforms
    • Using LLVM makes many things easy, but can make it hard to reach peak perf
    • Code generators like libxsmm or loop tool seem like the most promising approaches to hitting peak
  • Auto-tuning is still underexplored in the “main branch”
    • NNC doesn’t have an autotuner, because the scheduling library isn’t very robust
    • We don’t really have a story for “native” autotuner support (although of course CuDNN does this “under the hood” with cudnn.benchmark.
    • It’s possible (and maybe safer?) to port tuned heuristics by hand but autotuning is a pretty easy way to explore the perf space