Eliminating Framework Overhead by Compiling a Model Directly

Compiling a model with NNC

Recently we have introduced python bindings for tensor expressions used in NNC. While the bindings are still far from being complete, they nevertheless help to speed up iteration speed of experiments with NNC. Motivated by this I decided to try manually implementing the entire Deep-and-Wide model in Tensor Expressions - fortunately the model is small.

After about an hour I ended up with a working implementation (code here), and was eager to see how it compares to JIT and to the Static Runtime. It turned out that even without any NNC’s loop optimizations, the naive TE implementation is 3x faster than the Static Runtime and ~7x time faster than the JIT version:

TorchScript: 1.750s
Static runtime: 0.763s
Compiled with NNC: 0.282s

It won’t be as surprising if we look at the code for the model generated by NNC:

for (int i0 = 0; i0 < 51; i0++) {
fc_w_t[i0, 0] = FC_W[0, i0];
for (int i1 = 0; i1 < 32; i1++) {
user_emb_t[0, i1, 0] = USER_EMB[0, 0, i1];
for (int i2 = 0; i2 < 32; i2++) {
dp_terms[0, 0, i2] = (AD_EMB[0, 0, i2]) * (user_emb_t(0, i2, 0));
dp[0, 0] = 0.f;
for (int i0_1 = 0; i0_1 < 32; i0_1++) {
dp[0, 0] = ReduceOp((dp[0, 0]) + (dp_terms(0, 0, i0_1)), out_args={0, 0}, reduce_args={i0});
for (int i1_1 = 0; i1_1 < 50; i1_1++) {
wide_offset[0, i1_1] = (WIDE[0, i1_1]) + (MU[0, i1_1]);
for (int i1_2 = 0; i1_2 < 50; i1_2++) {
wide_norm[0, i1_2] = (wide_offset(0, i1_2)) * (SIGMA[0, i1_2]);
for (int i1_3 = 0; i1_3 < 50; i1_3++) {
wide_preproc[0, i1_3] = IfThenElse((wide_norm(0, i1_3))<0.f ? 1 : 0, 0.f, IfThenElse((wide_norm(0, i1_3))>10.f ? 1 : 0, 10.f, wide_norm(0, i1_3)));
for (int i1_4 = 0; i1_4 < 51; i1_4++) {
inp[0, i1_4] = IfThenElse(i1_4<1 ? 1 : 0, dp(0, i1_4), wide_preproc(0, i1_4 - 1));
for (int i2_1 = 0; i2_1 < 51; i2_1++) {
mm_terms[0, 0, i2_1] = (inp(0, i2_1)) * (fc_w_t(i2_1, 0));
mm[0, 0] = 0.f;
for (int i0_2 = 0; i0_2 < 51; i0_2++) {
mm[0, 0] = ReduceOp((mm[0, 0]) + (mm_terms(0, 0, i0_2)), out_args={0, 0}, reduce_args={i0});
fc1[0, 0] = (mm(0, 0)) + (FC_B[0]);
sigmoid[0, 0] = sigmoid(fc1(0, 0));

While we have a dozen of loops, they are very small and LLVM manages to optimize them quite well (in fact, for batch_size=1 it unrolls everything entirely and converts the model into a straight-line block of code). If a model was bigger, it would probably be better to apply NNC’s optimizations (schedules) to make sure that LLVM didn’t choke optimizing it, but in this case it was not even necessary!

Next Steps

Deep-and-Wide is a small model and it would be interesting to see whether we could apply a similar approach on a bigger model. Most probably, such work would also require improvements across the entire NNC stack: supporting wider range of operators (directly or via “external” calls), dynamic shapes, more robust and convenient API for TE construction and transformation.

Reproducing Results

The test script used in this note could be found in this gist. The command line is:

MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -C 3 python test.py

The measurements were performed on a Skylake server.


Very exciting! Do you want to generate similar level of TEs eventually automatically or do you intend for now to make it easier to define TEs in the meantime?

the naive TE implementation is 3x faster than the Static Runtime a
Do you guys have any references for Static Runtime as well, by any chance? What is the overall project status?

For the TE we plan to do both. It’s not realistic to expect everyone to convert their models to TE by hand, so we definitely need an automated way to do that. At the same time, we do want to make the API easier to use, so that for those who do want to go that path, it’s not that painful :slight_smile:

I’ve just posted a note on Static Runtime design as it stands today. Reducing Framework Overhead with Static Runtime