## Compiling a model with NNC

Recently we have introduced python bindings for tensor expressions used in NNC. While the bindings are still far from being complete, they nevertheless help to speed up iteration speed of experiments with NNC. Motivated by this I decided to try manually implementing the entire Deep-and-Wide model in Tensor Expressions - fortunately the model is small.

After about an hour I ended up with a working implementation (code here), and was eager to see how it compares to JIT and to the Static Runtime. It turned out that even without any NNCâ€™s loop optimizations, the naive TE implementation is 3x faster than the Static Runtime and ~7x time faster than the JIT version:

```
TorchScript: 1.750s
Static runtime: 0.763s
Compiled with NNC: 0.282s
```

It wonâ€™t be as surprising if we look at the code for the model generated by NNC:

```
{
for (int i0 = 0; i0 < 51; i0++) {
fc_w_t[i0, 0] = FC_W[0, i0];
}
for (int i1 = 0; i1 < 32; i1++) {
user_emb_t[0, i1, 0] = USER_EMB[0, 0, i1];
}
for (int i2 = 0; i2 < 32; i2++) {
dp_terms[0, 0, i2] = (AD_EMB[0, 0, i2]) * (user_emb_t(0, i2, 0));
}
dp[0, 0] = 0.f;
for (int i0_1 = 0; i0_1 < 32; i0_1++) {
dp[0, 0] = ReduceOp((dp[0, 0]) + (dp_terms(0, 0, i0_1)), out_args={0, 0}, reduce_args={i0});
}
for (int i1_1 = 0; i1_1 < 50; i1_1++) {
wide_offset[0, i1_1] = (WIDE[0, i1_1]) + (MU[0, i1_1]);
}
for (int i1_2 = 0; i1_2 < 50; i1_2++) {
wide_norm[0, i1_2] = (wide_offset(0, i1_2)) * (SIGMA[0, i1_2]);
}
for (int i1_3 = 0; i1_3 < 50; i1_3++) {
wide_preproc[0, i1_3] = IfThenElse((wide_norm(0, i1_3))<0.f ? 1 : 0, 0.f, IfThenElse((wide_norm(0, i1_3))>10.f ? 1 : 0, 10.f, wide_norm(0, i1_3)));
}
for (int i1_4 = 0; i1_4 < 51; i1_4++) {
inp[0, i1_4] = IfThenElse(i1_4<1 ? 1 : 0, dp(0, i1_4), wide_preproc(0, i1_4 - 1));
}
for (int i2_1 = 0; i2_1 < 51; i2_1++) {
mm_terms[0, 0, i2_1] = (inp(0, i2_1)) * (fc_w_t(i2_1, 0));
}
mm[0, 0] = 0.f;
for (int i0_2 = 0; i0_2 < 51; i0_2++) {
mm[0, 0] = ReduceOp((mm[0, 0]) + (mm_terms(0, 0, i0_2)), out_args={0, 0}, reduce_args={i0});
}
fc1[0, 0] = (mm(0, 0)) + (FC_B[0]);
sigmoid[0, 0] = sigmoid(fc1(0, 0));
}
```

While we have a dozen of loops, they are very small and LLVM manages to optimize them quite well (in fact, for batch_size=1 it unrolls everything entirely and converts the model into a straight-line block of code). If a model was bigger, it would probably be better to apply NNCâ€™s optimizations (schedules) to make sure that LLVM didnâ€™t choke optimizing it, but in this case it was not even necessary!

## Next Steps

Deep-and-Wide is a small model and it would be interesting to see whether we could apply a similar approach on a bigger model. Most probably, such work would also require improvements across the entire NNC stack: supporting wider range of operators (directly or via â€śexternalâ€ť calls), dynamic shapes, more robust and convenient API for TE construction and transformation.

## Reproducing Results

The test script used in this note could be found in this gist. The command line is:

```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -C 3 python test.py
```

The measurements were performed on a Skylake server.