Eliminating Framework Overhead by Compiling a Model Directly

ZolotukhinM · January 22, 2021, 9:26pm

Compiling a model with NNC

Recently we have introduced python bindings for tensor expressions used in NNC. While the bindings are still far from being complete, they nevertheless help to speed up iteration speed of experiments with NNC. Motivated by this I decided to try manually implementing the entire Deep-and-Wide model in Tensor Expressions - fortunately the model is small.

After about an hour I ended up with a working implementation (code here), and was eager to see how it compares to JIT and to the Static Runtime. It turned out that even without any NNC’s loop optimizations, the naive TE implementation is 3x faster than the Static Runtime and ~7x time faster than the JIT version:

TorchScript: 1.750s
Static runtime: 0.763s
Compiled with NNC: 0.282s

It won’t be as surprising if we look at the code for the model generated by NNC:

{
for (int i0 = 0; i0 < 51; i0++) {
fc_w_t[i0, 0] = FC_W[0, i0];
}
for (int i1 = 0; i1 < 32; i1++) {
user_emb_t[0, i1, 0] = USER_EMB[0, 0, i1];
}
for (int i2 = 0; i2 < 32; i2++) {
dp_terms[0, 0, i2] = (AD_EMB[0, 0, i2]) * (user_emb_t(0, i2, 0));
}
dp[0, 0] = 0.f;
for (int i0_1 = 0; i0_1 < 32; i0_1++) {
dp[0, 0] = ReduceOp((dp[0, 0]) + (dp_terms(0, 0, i0_1)), out_args={0, 0}, reduce_args={i0});
}
for (int i1_1 = 0; i1_1 < 50; i1_1++) {
wide_offset[0, i1_1] = (WIDE[0, i1_1]) + (MU[0, i1_1]);
}
for (int i1_2 = 0; i1_2 < 50; i1_2++) {
wide_norm[0, i1_2] = (wide_offset(0, i1_2)) * (SIGMA[0, i1_2]);
}
for (int i1_3 = 0; i1_3 < 50; i1_3++) {
wide_preproc[0, i1_3] = IfThenElse((wide_norm(0, i1_3))<0.f ? 1 : 0, 0.f, IfThenElse((wide_norm(0, i1_3))>10.f ? 1 : 0, 10.f, wide_norm(0, i1_3)));
}
for (int i1_4 = 0; i1_4 < 51; i1_4++) {
inp[0, i1_4] = IfThenElse(i1_4<1 ? 1 : 0, dp(0, i1_4), wide_preproc(0, i1_4 - 1));
}
for (int i2_1 = 0; i2_1 < 51; i2_1++) {
mm_terms[0, 0, i2_1] = (inp(0, i2_1)) * (fc_w_t(i2_1, 0));
}
mm[0, 0] = 0.f;
for (int i0_2 = 0; i0_2 < 51; i0_2++) {
mm[0, 0] = ReduceOp((mm[0, 0]) + (mm_terms(0, 0, i0_2)), out_args={0, 0}, reduce_args={i0});
}
fc1[0, 0] = (mm(0, 0)) + (FC_B[0]);
sigmoid[0, 0] = sigmoid(fc1(0, 0));
}

While we have a dozen of loops, they are very small and LLVM manages to optimize them quite well (in fact, for batch_size=1 it unrolls everything entirely and converts the model into a straight-line block of code). If a model was bigger, it would probably be better to apply NNC’s optimizations (schedules) to make sure that LLVM didn’t choke optimizing it, but in this case it was not even necessary!

Next Steps

Deep-and-Wide is a small model and it would be interesting to see whether we could apply a similar approach on a bigger model. Most probably, such work would also require improvements across the entire NNC stack: supporting wider range of operators (directly or via “external” calls), dynamic shapes, more robust and convenient API for TE construction and transformation.

Reproducing Results

The test script used in this note could be found in this gist. The command line is:

MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -C 3 python test.py

The measurements were performed on a Skylake server.

salexspb · January 25, 2021, 8:20pm

Very exciting! Do you want to generate similar level of TEs eventually automatically or do you intend for now to make it easier to define TEs in the meantime?

the naive TE implementation is 3x faster than the Static Runtime a
Do you guys have any references for Static Runtime as well, by any chance? What is the overall project status?

ZolotukhinM · January 25, 2021, 8:47pm

For the TE we plan to do both. It’s not realistic to expect everyone to convert their models to TE by hand, so we definitely need an automated way to do that. At the same time, we do want to make the API easier to use, so that for those who do want to go that path, it’s not that painful

bwasti · January 25, 2021, 10:23pm

I’ve just posted a note on Static Runtime design as it stands today. Reducing Framework Overhead with Static Runtime

Topic		Replies	Views
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript performance	0	1670	January 28, 2021
Python Operator Authoring w/ NNC nnc	5	2464	June 7, 2022
NNC Per-Operator Benchmarks (on CPU) nnc	5	1018	January 27, 2021
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	66501	July 29, 2024
Compiling the optimizer with PT2 compiler	8	3476	January 29, 2024

Eliminating Framework Overhead by Compiling a Model Directly

Compiling a model with NNC

Next Steps

Reproducing Results

Related topics