Run Llama3-8b on a Raspberry Pi 5 with ExecuTorch

digantdesai · May 1, 2024, 7:20pm

TL;DR

This post showcases the execution of a 4-bit quantized Llama3-8b on a Raspberry Pi 5 using ExecuTorch, achieving ~2 tokens per second. Additionally, it presents the running of the Llama2-7b and the smaller Stories110M model on the same platform.

Motivation

We want to explore the feasibility of running large language models on low-power devices like the Raspberry Pi 5 using ExecuTorch. This post demonstrates the potential of Executorch in enabling efficient inference on edge devices for various applications such as chatbots, smart assistants, or other NLP tasks. By successfully running Stories110M and quantized Llama2-7b and cutting edge Llama3-8b on a Raspberry Pi 5, we highlight the capability of ExecuTorch to bring advanced AI technologies to resource-constrained platforms with minimal effort, thereby expanding their utility and accessibility.

Setting Things Up

I started on my personal Linux desktop with the ExecuTorch LLM example Readme (Also see [1] for more details). My desktop is running Arch Linux on an x86_64 CPU, and has ~16GiB RAM. This is the host machine.

Check ExecuTorch documentation for what systems are currently supported. Also before we get started, this post is neither self contained nor a substitute for the ExecuTorch tutorials, please use those as a source of truth in case of any discrepancy.

I recently acquired a Raspberry Pi 5, which I intend to use as the target platform here for running the Llama2-7B and the newly released Llama3-8B. The device is equipped with 4 Cortex-A76 CPUs and approximately 8GiB of RAM.

      _,met$$$$$gg.            
    ,g$$$$$$$$$$$$$$$P.          
  ,g$$P"     """Y$$.".          OS: Debian GNU/Linux 12 (bookworm) aarch64 
 ,$$P'              `$$$.       Host: Raspberry Pi 5 Model B Rev 1.0 
',$$P       ,ggs.     `$$b:     Kernel: 6.1.0-rpi7-rpi-2712 
`d$$'     ,$P"'   .    $$$      Uptime: 8 hours, 7 mins 
 $$P      d$'     ,    $$P      Packages: 1553 (dpkg) 
 $$:      $$.   -    ,d$$'      Shell: bash 5.2.15 
 $$;      Y$b._   _,d$P'        Resolution: 3840x2160 
 Y$$.    `.`"Y$$$$P"'           DE: LXDE-pi-wayfire 
 `$$b      "-.__                WM: wayfire 
  `Y$$                          Theme: PiXflat [GTK3] 
   `Y$$.                        Icons: PiXflat [GTK3] 
     `$$b.                      Terminal: lxterminal 
       `Y$$b.                   Terminal Font: Monospace 10 
          `"Y$b._               CPU: (4) @ 2.400GHz 
              `"""              Memory: 817MiB / 8049MiB

I decided to cross-compile ExecuTorch runtime given it should be faster than doing a native development on a Raspberry Pi 5, and there are obvious performance and RAM size advantages. However, a little spoiler - we’ll later see that even my Linux desktop runs out of RAM when handling the larger Llama3-8b and Llama2-7b float weights, which exceed my desktop’s RAM capacity.

Cross-compilation requires a toolchain that operates on the host machine and produces code for the target machine. Luckily, for Arch Linux, there’s a handy package available for installing GCC that generates binaries for GNU/Linux running on aarch64 (or Armv8) CPUs.

# Pacman is a package manager for Arch Linux to install the package [4]
# For your distro you have to do something similar but using your 
# package manager.

$ pacman -S aarch64-linux-gnu-gcc

This will install the necessary tools such as the compiler, linker, system libraries, etc., enabling us to cross-compile ExecuTorch runtime executables on the host machine for use on the Raspberry Pi 5 target.

Next, I followed the steps outlined in the ExecuTorch LLM Readme guide [1]:

Cloned a copy of the ExecuTorch Github repository
Set up ExecuTorch as per the “Setting up guide” [2]
Downloaded the Stories110, Llama2-7B, and Llama3-8B models as directed by the ExecuTorch LLM guide [1]

That’s it! Now, we can proceed with generating the PTE file - a binary artifact that the ExecuTorch runtime will consume on the target device. If you’re interested, there’s a wealth of additional information in the ExecuTorch documentation.

To keep our focus solely on running the models on the Pi, I’ll be skipping the fine-tuning and model-evaluation steps.

Generating a PTE files

The PTEs we will be generating here are portable across CPUs but this is not typical.

As per the LLM guide [1], generating the PTE was a relatively straightforward process. To ensure the workflow is functional and adaptable across different models, we will attempt to generate three distinct models of increasing complexity.

Stories110M

Using already downloaded Stories110M weights and export_llama script in the repo, as described in the guide [1], I could generate the Stories PTE file relatively easily.

Llama2-7b

This model presented a challenge for my Linux desktop host machine, which has 16GiB of RAM. The export step in the PTE generation process requires more memory for this larger model, so I was unable to export it. Instead, I utilized my M1-MacBook Pro, which has 48GiB of RAM, to generate the PTE files for Llama2-7b and Llama3. The steps to set up the ExecuTorch environment for generating PTEs for LLMs were identical to those on my Linux desktop, so I won’t repeat them here.

Llama3-8b

By following the same steps as for Llama2-7b, but with a different checkpoint, I was also able to obtain the PTE file for Llama3. Please note that since Llama3-8b was just released (at the time of writing this), you’ll need to use the main branch of ExecuTorch for Llama3-8b, not the stable release v0.2 branch. This is because some components related to Llama3-8b, such as the tokenizer, are not available in the stable branch.

Compiling a runner

To cross compile the llama runner, a runtime executable who will consume the previously generated PTE file(s), you will need to pass -DCMAKE_TOOLCHAIN_FILE=/path/to/file_below [3] to the CMake command building ExecuTorch libs and executables. The following is the CMake Toolchain File for the arm toolchain we installed earlier.

# Make Toolchain file for crosscompiling for ARM.
#
# Target operating system name.
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR aarch64)
set(CMAKE_CROSSCOMPILING TRUE)

# Name of C compiler.
set(CMAKE_C_COMPILER "/usr/bin/aarch64-linux-gnu-gcc")
set(CMAKE_CXX_COMPILER "/usr/bin/aarch64-linux-gnu-g++")

# Where to look for the target environment. (More paths can be added here)
set(CMAKE_FIND_ROOT_PATH /usr/aarch64-linux-gnu)
set(CMAKE_INCLUDE_PATH  /usr/aarch64-linux-gnu/include/aarch64-linux-gnu)
set(CMAKE_LIBRARY_PATH  /usr/aarch64-linux-gnu/lib/aarch64-linux-gnu)
set(CMAKE_PROGRAM_PATH  /usr/aarch64-linux-gnu/bin/aarch64-linux-gnu)

# Adjust the default behavior of the FIND_XXX() commands:
# search programs in the host environment only.
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)

# Search headers and libraries in the target environment only.
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)

You may have to tweak this a little based on your OS and toolchain installation path(s).

With this, we should be ready to build both ExecuTorch libraries and ExecuTorch Llama runner.

GLIC version issue

If your GLIBC version on the target device is different between the toolchain sysroot and the one installed on your Raspberry Pi 5. One of the ways you can get around that by just copying the toolchain and ExecuTorch libraries on the Raspberry Pi 5, and then running them as shown below.

pi@raspberrypi:~/llm $ ./lib/ld-linux-aarch64.so.1 \  # Copied from toolchain sysroot
    --library-path ./lib/:./lib64:./et_libs/    \  # System Libraries + ExecuTorch Libraries
    ./llama_main                                   # ExecuTorch Llama Runner
    [... other args for the runner ...]

Results

Stories110M

FP32

(+ XNNPACK + KV-cache + SDPA-custom OP + 4 Threads)

pi@raspberrypi:~/llm $ ./lib/ld-linux-aarch64.so.1 \
  --library-path ./lib/:./lib64:./et_libs/ ./llama_main \
  --model_path=./xnnpack_llama2_fp32.pte \
  --tokenizer_path=./tokenizer.bin \
  --cpu_threads=4 \
  --prompt="what is the meaning of life?"
 
I 00:00:00.000164 executorch:main.cpp:64] Resetting threadpool with num threads = 4
I 00:00:00.000641 executorch:runner.cpp:45] Creating LLaMa runner: model_path=./xnnpack_llama2_fp32.pte, tokenizer_path=./tokenizer.bin
I 00:00:00.695386 executorch:runner.cpp:64] Reading metadata from model
I 00:00:00.695556 executorch:runner.cpp:123] get_vocab_size: 32000
I 00:00:00.695572 executorch:runner.cpp:123] get_bos_id: 1
I 00:00:00.695591 executorch:runner.cpp:123] get_eos_id: 2
I 00:00:00.695598 executorch:runner.cpp:123] get_n_bos: 1
I 00:00:00.695604 executorch:runner.cpp:123] get_n_eos: 1
I 00:00:00.695609 executorch:runner.cpp:123] get_max_seq_len: 128
I 00:00:00.695629 executorch:runner.cpp:123] use_kv_cache: 1
I 00:00:00.695640 executorch:runner.cpp:123] use_sdpa_with_kv_cache: 1
I 00:00:00.695652 executorch:runner.cpp:123] append_eos_to_prompt: 0
 
what is the meaning of life? All you need to do is explore and learn about the world around you. But, not all the time is easy. So, when you feel like exploring, take a little break and get a drink of water.
One day, while exploring, the boy discovered a cobweb in the corner of the room. It was very small and full of tiny cobwebs. He was fascinated by it and wanted to learn more about it.
He asked his mom about the cobweb. She told him that it was made by spiders and that spiders use
 
PyTorchObserver {"prompt_tokens":8,"generated_tokens":119,"model_load_start_ms":1712634889738,"model_load_end_ms":1712634890451,"inference_start_ms":1712634890451,"inference_end_ms":1712634903006,"prompt_eval_end_ms":1712634891220,"first_token_ms":1712634891317,"aggregate_sampling_time_ms":140732394210149,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:13.268560 executorch:runner.cpp:407] 	Prompt Tokens: 8    Generated Tokens: 119
I 00:00:13.268573 executorch:runner.cpp:413] 	Model Load Time:		0.713000 (seconds)
I 00:00:13.268587 executorch:runner.cpp:420] 	Total inference time:		12.555000 (seconds)		 Rate: 	9.478295 (tokens/second)
I 00:00:13.268603 executorch:runner.cpp:430] 		Prompt evaluation:	0.769000 (seconds)		 Rate: 	10.403121 (tokens/second)
I 00:00:13.268618 executorch:runner.cpp:439] 		Generated 119 tokens:	11.786000 (seconds)		 Rate: 	10.096725 (tokens/second)
I 00:00:13.268632 executorch:runner.cpp:450] 	Time to first generated token:	0.866000 (seconds)
I 00:00:13.268639 executorch:runner.cpp:456] 	Sampling time over 127 tokens:	140732394210.148987 (seconds)

Int4

(Groupwise Quant + XNNPACK + KV-cache + SDPA-custom OP + 4 Threads)

pi@raspberrypi:~/llm $ ./lib/ld-linux-aarch64.so.1 \
    --library-path ./lib/:./lib64:./et_libs/ \
    ./llama_main  \
    --model_path=./xnnpack_llama2_qb4w.pte \
    --tokenizer_path=./tokenizer.bin \
    --cpu_threads=4 \
    --prompt="what is the meaning of life?"
 
I 00:00:00.000163 executorch:main.cpp:64] Resetting threadpool with num threads = 4
I 00:00:00.000724 executorch:runner.cpp:45] Creating LLaMa runner: model_path=./xnnpack_llama2_qb4w.pte, tokenizer_path=./tokenizer.bin
I 00:00:00.160773 executorch:runner.cpp:64] Reading metadata from model
I 00:00:00.160863 executorch:runner.cpp:123] get_vocab_size: 32000
I 00:00:00.160870 executorch:runner.cpp:123] get_bos_id: 1
I 00:00:00.160876 executorch:runner.cpp:123] get_eos_id: 2
I 00:00:00.160881 executorch:runner.cpp:123] get_n_bos: 1
I 00:00:00.160888 executorch:runner.cpp:123] get_n_eos: 1
I 00:00:00.160893 executorch:runner.cpp:123] get_max_seq_len: 128
I 00:00:00.160901 executorch:runner.cpp:123] use_kv_cache: 1
I 00:00:00.160907 executorch:runner.cpp:123] use_sdpa_with_kv_cache: 1
I 00:00:00.160912 executorch:runner.cpp:123] append_eos_to_prompt: 0
 
what is the meaning of life? You will learn it very slowly. But you can make it through it with your own two hands.
One day, your grandmother asked if she could give you a special gift. She gave her granddaughter a beautiful necklace with a big heart pendant.
The little girl was very happy and thanked her grandmother. She put the necklace on and ran to the garden to show her friends. But when she got there, she saw that all her friends had the same necklace, much prettier than her.
The little girl felt very sad. She
 
PyTorchObserver {"prompt_tokens":8,"generated_tokens":119,"model_load_start_ms":1712635532162,"model_load_end_ms":1712635532333,"inference_start_ms":1712635532333,"inference_end_ms":1712635536358,"prompt_eval_end_ms":1712635532596,"first_token_ms":1712635532620,"aggregate_sampling_time_ms":140732322579298,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.197115 executorch:runner.cpp:407] 	Prompt Tokens: 8    Generated Tokens: 119
I 00:00:04.197128 executorch:runner.cpp:413] 	Model Load Time:		0.171000 (seconds)
I 00:00:04.197162 executorch:runner.cpp:420] 	Total inference time:		4.025000 (seconds)		 Rate: 	29.565217 (tokens/second)
I 00:00:04.197197 executorch:runner.cpp:430] 		Prompt evaluation:	0.263000 (seconds)		 Rate: 	30.418251 (tokens/second)
I 00:00:04.197244 executorch:runner.cpp:439] 		Generated 119 tokens:	3.762000 (seconds)		 Rate: 	31.632111 (tokens/second)
I 00:00:04.197268 executorch:runner.cpp:450] 	Time to first generated token:	0.287000 (seconds)
I 00:00:04.197290 executorch:runner.cpp:456] 	Sampling time over 127 tokens:	140732322579.298004 (seconds)

Llama2-7b

Int4

(Groupwise Quant + XNNPACK + KV-cache + SDPA-custom OP + 4 Threads)

pi@raspberrypi:~/llm $ ./lib/ld-linux-aarch64.so.1 \
    --library-path ./lib/:./lib64:./et_libs/ ./llama_main  \
    --model_path=./llama2/llama2_7b_kv_sdpa_8da4w_gs256.pte \
    --tokenizer_path=./llama2/tokenizer.bin \
    --cpu_threads=4 \
    --prompt="what is the meaning of life?"

I 00:00:00.000231 executorch:main.cpp:64] Resetting threadpool with num threads = 4
I 00:00:00.000741 executorch:runner.cpp:45] Creating LLaMa runner: model_path=./llama2/llama2_7b_kv_sdpa_8da4w_gs256.pte, tokenizer_path=./llama2/tokenizer.bin
I 00:00:18.025087 executorch:runner.cpp:64] Reading metadata from model
I 00:00:18.025161 executorch:runner.cpp:123] get_vocab_size: 32000
I 00:00:18.025170 executorch:runner.cpp:123] get_bos_id: 1
I 00:00:18.025177 executorch:runner.cpp:123] get_eos_id: 2
I 00:00:18.025184 executorch:runner.cpp:123] get_n_bos: 1
I 00:00:18.025191 executorch:runner.cpp:123] get_n_eos: 1
I 00:00:18.025197 executorch:runner.cpp:123] get_max_seq_len: 128
I 00:00:18.025205 executorch:runner.cpp:123] use_kv_cache: 1
I 00:00:18.025213 executorch:runner.cpp:123] use_sdpa_with_kv_cache: 1
I 00:00:18.025219 executorch:runner.cpp:123] append_eos_to_prompt: 0

what is the meaning of life? 11/12/2005
 nobody knows...
if you really want to know...
then you will have to ask the author of the book of life.
this is the last time i will answer a question like that.
if you want to know what is the meaning of life then read my blog...
enjoy your life while it lasts
i was a day dreamer
life is like a rainbow
you have to work hard for your dreams
but it will be worth it in the end
and no one can take that

PyTorchObserver {"prompt_tokens":8,"generated_tokens":119,"model_load_start_ms":1712717102880,"model_load_end_ms":1712717120928,"inference_start_ms":1712717120928,"inference_end_ms":1712717174682,"prompt_eval_end_ms":1712717124286,"first_token_ms":1712717124695,"aggregate_sampling_time_ms":140734536254310,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:01:11.803040 executorch:runner.cpp:407]    Prompt Tokens: 8    Generated Tokens: 119
I 00:01:11.803156 executorch:runner.cpp:413]    Model Load Time:                18.048000 (seconds)
I 00:01:11.803297 executorch:runner.cpp:420]    Total inference time:           53.754000 (seconds)              Rate:  2.213789 (tokens/second)
I 00:01:11.803483 executorch:runner.cpp:430]            Prompt evaluation:      3.358000 (seconds)               Rate:  2.382370 (tokens/second)
I 00:01:11.803660 executorch:runner.cpp:439]            Generated 119 tokens:   50.396000 (seconds)              Rate:  2.361299 (tokens/second)
I 00:01:11.803837 executorch:runner.cpp:450]    Time to first generated token:  3.767000 (seconds)
I 00:01:11.803922 executorch:runner.cpp:456]    Sampling time over 127 tokens:  140734536254.309998 (seconds)

Llama3-8b

Int4

(Groupwise Quant + XNNPACK + KV-cache + SDPA-custom OP + 4 Threads)

pi@raspberrypi5:~/llama3 $ ~/llama2/lib/ld-linux-aarch64.so.1 \
    --library-path ./lib/:./lib64/:./et_libs/ ./llama_main  \
    --model_path=./llama3_kv_sdpa_xnn_qe_4_32.pte  \
    --tokenizer_path=./tokenizer.model \
    --cpu_threads=4 \
    --prompt="what is the meaning of life"  
                                                                                                                                                                                                               
I 00:00:00.000242 executorch:main.cpp:64] Resetting threadpool with num threads = 4
I 00:00:00.000715 executorch:runner.cpp:50] Creating LLaMa runner: model_path=./llama3_kv_sdpa_xnn_qe_4_32.pte, tokenizer_path=./tokenizer.model
I 00:00:46.844266 executorch:runner.cpp:69] Reading metadata from model
I 00:00:46.844363 executorch:runner.cpp:134] get_vocab_size: 128256
I 00:00:46.844371 executorch:runner.cpp:134] get_bos_id: 128000
I 00:00:46.844375 executorch:runner.cpp:134] get_eos_id: 128001
I 00:00:46.844378 executorch:runner.cpp:134] get_n_bos: 1
I 00:00:46.844382 executorch:runner.cpp:134] get_n_eos: 1
I 00:00:46.844386 executorch:runner.cpp:134] get_max_seq_len: 128
I 00:00:46.844392 executorch:runner.cpp:134] use_kv_cache: 1
I 00:00:46.844396 executorch:runner.cpp:134] use_sdpa_with_kv_cache: 1
I 00:00:46.844400 executorch:runner.cpp:134] append_eos_to_prompt: 0

what is the meaning of life for you?
I think I am understanding of people who say they don't know what the meaning of life is, but it's good to hear about other people's views on it. So, here's my view on it:
I think life is simply existing and living. If you can't live, you can't really live, so you shouldn't exist at all. But, even if you can live, you still shouldn't exist at all if you can't live happily.
Life is about living and living happily. Life is about enjoying it and living

PyTorchObserver {"prompt_tokens":7,"generated_tokens":120,"model_load_start_ms":1713804343168,"model_load_end_ms":1713804390147,"inference_start_ms":1713804390147,"inference_end_ms":1713804451862,"prompt_eval_end_ms":1713804393539,"first_token_ms":1713804394025,"aggregate_sampling_time_ms":140733463036888,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:01:48.694592 executorch:runner.cpp:415]    Prompt Tokens: 7    Generated Tokens: 120
I 00:01:48.694608 executorch:runner.cpp:421]    Model Load Time:                46.979000 (seconds)
I 00:01:48.694625 executorch:runner.cpp:428]    Total inference time:           61.715000 (seconds)              Rate:  1.944422 (tokens/second)
I 00:01:48.694646 executorch:runner.cpp:438]            Prompt evaluation:      3.392000 (seconds)               Rate:  2.063679 (tokens/second)
I 00:01:48.694667 executorch:runner.cpp:447]            Generated 120 tokens:   58.323000 (seconds)              Rate:  2.057507 (tokens/second)
I 00:01:48.694688 executorch:runner.cpp:458]    Time to first generated token:  3.878000 (seconds)
I 00:01:48.694700 executorch:runner.cpp:464]    Sampling time over 127 tokens:  140733463036.888000 (seconds)

Tokens/Second

Model	FP32	Int4
Stories110M	10.0	31.6
Llama2-7b	N/A*	2.3
Llama3-8b	N/A*	2.0

Notes:

This was not a rigorous benchmarking effort
* - too large to fit in the Raspberry Pi 5’s 8GiB RAM

Takeaways

We showcase running various LLMs with Lllama architectures on Raspberry Pi 5. We highlight the potential of ExecuTorch in enabling efficient inference of LLms on resource-constrained edge devices. Try it yourself!

References

[1] I used the main branch of ExecuTorch LLM / Llama2 README.md for Llama3. Stories110M and Llama2 should also work on the main branch. This is the sha I used from the main branch for Llama3. If you encounter any issues, I recommend trying the latest version, as we are rapidly updating the main branch. If you’re only interested in Stories110M or Llama2 and/or are seeking stability, we do have a v0.2 stable release branch.

[2] Setting up ExecuTorch

[3] CMake Toolchain File

[4] aarch64-linux-gnu-gcc

Topic		Replies	Views
Quantization in Pytorch	3	1297	February 24, 2025
Minutes from Core maintainer meeting Aug 2023	0	284	February 16, 2024
TorchInductor Update 8: Max-autotune Support on CPU with GEMM Template compiler	0	467	September 4, 2024
Perf counters for fun and profit compiler	0	990	January 23, 2021
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	1980	September 22, 2023

Run Llama3-8b on a Raspberry Pi 5 with ExecuTorch

TL;DR

Motivation

Setting Things Up

Generating a PTE files

Stories110M

Llama2-7b

Llama3-8b

Compiling a runner

GLIC version issue

Results

Stories110M

FP32

Int4

Llama2-7b

Int4

Llama3-8b

Int4

Tokens/Second

Takeaways

References

Related topics