in

Running Large Language Models on Raspberry Pi at the Edge #EdgeComputingForLanguageModels

Running Large Language Models on Raspberry Pi at the Edge

Summarise this content to 300 words

Introduction

In the fast-growing area of artificial intelligence, edge computing presents an opportunity to decentralize capabilities traditionally reserved for powerful, centralized servers. This tutorial explores the practical integration of small versions of traditional large language models (LLMs) into a Raspberry Pi 5, transforming this edge device into an AI hub capable of real-time, on-site data processing.

Our goal is to explore the installation and utilization of Ollama, an open-source framework that allows us to run LLMs locally on our machines (which can be our desktops or edge devices such as the Raspberry Pis or NVidia Jetsons). Ollama is designed to be efficient, scalable, and easy to use, making it a good option for deploying AI models such as Llama, Phi, LLaVa (Multimodal), and TinyLLama. We will integrate some of those models into projects using Python’s ecosystem, exploring their potential in real-world scenarios (or at least point in this direction).

Hardware Setup

The Raspberry Pi 5 is a robust platform that substantially upgrades the previous version 4, equipped with the Broadcom BCM2712, a 2.4GHz quad-core 64-bit Arm Cortex-A76 CPU featuring Cryptographic Extension and enhanced caching capabilities. It boasts a VideoCore VII GPU, dual 4Kp60 HDMI® outputs with HDR, and a 4Kp60 HEVC decoder. Memory options include 4GB and 8GB (our choice to run LLMs) of high-speed LPDDR4X SDRAM. Connectivity is comprehensive, offering dual-band 802.11ac Wi-Fi®, Bluetooth 5.0/BLE, Gigabit Ethernet with PoE+ support, and various USB ports. It also features expandable storage via a microSD card slot, a PCIe 2.0 interface for fast peripherals, and a 5V/5A DC power input through USB-C with Power Delivery. Additional features include a real-time clock, power button, and the standard 40-pin header, making it a versatile choice for real-world applications.

By the way, as Alasdair Allan discussed, custom accelerator hardware may no longer be needed for some inferencing tasks at the edge, as inferencing directly on the Raspberry Pi 5 CPU—with no GPU acceleration—is now on par with the performance of the Coral TPU.

For more info, please see the complete article: Benchmarking TensorFlow and TensorFlow Lite on Raspberry Pi 5.

Raspberry Pi Active Cooler

We should install an Active Cooler, a dedicated clip-on cooling solution for Raspberry Pi 5 (Rasp5), for this project. It combines an aluminum heatsink with a temperature-controlled blower fan to keep the Rasp5 operating comfortably under heavy loads, such as running LLMs.

The Active Cooler has pre-applied thermal pads for heat transfer and is mounted directly to the Raspberry Pi 5 board using spring-loaded push pins. The Raspberry Pi firmware actively manages it: at 60°C, the blower’s fan will be turned on; at 67.5°C, the fan speed will be increased; and finally, at 75°C, the fan increases to full speed. The blower’s fan will spin down automatically when the temperature drops below these limits.

To prevent overheating, all Raspberry Pi boards begin to throttle the processor when the temperature reaches 80°Cand throttle even further when it reaches the maximum temperature of 85°C. More detail, here.

Install the Operating System

To use the Raspberry Pi, we will need an operating system. By default, Raspberry Pis checks for an operating system on any SD card inserted in the slot, so we should install an operating system using Raspberry Pi Imager.

Raspberry Pi Imager is a tool for downloading and writing images on macOS, Windows, and Linux. It includes many popular operating system images for Raspberry Pi. We will also use the Imager to preconfigure credentials and remote access settings.

After downloading the Imager and installing it on your computer, use a new and empty SD card. Select the device (RASPBERRY PI 5), the Operating System (RASPBERRY PI OS (64-BIT)), and your Storage Device:

To use Ollama, we should use the 64-bit version.

We should also define the options, such as the hostname, username, password, LAN configuration (on GENERAL TAB), and, more importantly, SSH Enable on the SERVICES tab.

You can access the terminal of a Raspberry Pi remotely from another computer on the same network using the Secure SHell (SSH) protocol.

After burning the OS to the SD card, install it in the Raspi5’s SD slot and plug in the 5V power source.

Interacting with the Rasp5 via SSH

The easiest way to interact with the Rasp-Zero is via SSH (“Headless”). You can use a Terminal (MAC/Linux) or PuTTy (Windows).

On terminal type:

ssh mjrovai@rpi-5.local

You should replace mjrovai with your username and rpi-5 with the hostname chosen during set-u

When you see the prompt:

mjrovai@rpi-5:~ $

It means that you are interacting remotely with your Rasp5.

It is a good practice to update the system regularly. For that, you should run:

sudo apt-get update

Pip is a tool for installing external Python modules on a Raspberry Pi. However, it has not been enabled in recent OS versions. To allow it, you should run the command (only once):

sudo rm /usr/lib/python3.11/EXTERNALLY-MANAGED

Running the Raspberry Pi Desktop version remotely

Once we run the full Raspberry Pi OS version, we can interact with its Desktop version. For that, we should enable the VNC server via the terminal.

Run the Raspi5 configurator via the command:

sudo raspi-config

and select 3 Interface Options:

Once there, Select I2 VNC:

You will receive a pop-up asking for confirmation. Press , after . You should return to the main screen. Using the tab key, select .

Now, you should install a VNC Viewer on your computer. Once installed, you should confirm the Rasp5 IP address. For example, on the terminal, you can use:

hostname -I

Run the VNC Viewer, start a New Connection, and give a name (for example, RPi-5):

You should receive a new popup asking to confirm your credentials (the Rasp5 username and password):

And that is it! The Raspberry Pi 5 Desktop should appear on your computer monitor.

By default, the desktop’s resolution is not high, but you can change it by reaching the menu (the Raspberry Icon at the left upper corner) and selecting the best screen definition for your monitor:

Transfer Files between the Rasp5 and the desktop

Transferring files between the Rasp5 and our main computer can be done using a pen drive or an FTP program over the network. For the last one, let’s use FileZilla FTP Client.

Follow the instructions and install the program for your Desktop OS.

It is necessary to know what is the Rasp5 IP, as before, on the terminal use:

hostname -I

Use this IP as the Host in the File:

sftp://192.168.4.209

and enter your Rasp5 Username and Password. Pressing Quickconnect will open two separate windows, one for your desktop (right) and another for the Rasp5 (left).

Ollama

We will not enter details here about what Ollama is and how it works under the hood. But you should see this short video from Matt Williams about Ollama:

Installing Ollama

Run the command below to install Ollama:

curl https://ollama.ai/install.sh | sh

As a result, an API will run in the background on 127.0.0.1:11434 (in my case). From now on, you can run Ollama via the terminal.

Unfortunatelly, the Rasp5 GPU is not used by Ollama, to speed up inference. We hope that this can change in the future and more, new devices as the Raspberry Pi AI KIT can be used.

It is possible to verify that Ollama is correctly installed by running:

ollama list

You should see an empty list once we have not installed any model yet. Let’s install and run our first model, the TinyLlama, a pre-trained 1.1B Llama model on 3 trillion tokens.

TinyLlama is a tiny but still strong language model that can be useful for many applications, such as deployment on edge devices with restricted memory and computational capacities, for functionalities like real-time machine translation without an internet connection (the 4bit-quantized TinyLlama-1.1B’s weight only takes up 637 MB), enabling real-time dialogue generation in video games and uses in Home Automation.

Install the Model

ollama pull tinyllama

Now, if you run again:

ollama list

you should see that a new model with a size of 636MB has been installed.

Run the Model

ollama run tinyllama

Now, running the model with the command before, you should have the Ollama prompt available for us to input a question and start chatting with the LLM model:

Examples:

>>> What is the capital of France?
The capital of France is Paris. The official name of the city is “Paris”, and its former name was “City of Lights”. However, it’s not a common practice to use “City of Lights” as the official name of Paris in French. Instead, it is more commonly referred to by its abbreviated name, “Paris”.

>>> What is the distance between Paris and Santiago, Chile?The distance between Paris and Santiago, Chile is approximately 7,042 miles or 11,329 kilometers. It takes around 18 hours and 56 minutes to travel by car or plane. The flight distance may vary depending on the airline and other factors.

>>> what is the latitude and longitude of Paris?

The latitude and longitude of Paris are as follows:

Latitude: 48.853° to 49.624°

Longitude: -0.17° to -0.83°

These coordinates can be used to find the distance between Paris and various points on the Earth’s surface, such as cities, landmarks, and other locations.

It is amazing how such small model works completely off-line on a edge device as the Rasp5. The speed of response is very good, considering that we are using only a CPU.

We can also get statistics about model inference performance using --verbose when calling the model.

First, we should stop using the model:

>>> /bye

and run it again:

ollama run tinyllama --verbose

For example, we asked again what the capital of France is. This time, the model gave us only a short answer, which was different from before but still correct. Also, we received several statistics about the performance and timing metrics for the language model (LLM) run on a Raspberry Pi.

Each metric gives insights into how the model processes inputs and generates outputs. Here’s a breakdown of what each metric means:

  • Total Duration (3.761616043s): This is the complete time taken from the start of the command to the completion of the response. It encompasses loading the model, processing the input prompt, and generating the response.
  • Load Duration (1.319068ms): This very short duration indicates the time to load the model or necessary components into memory. Given its brevity, it suggests that the model was preloaded or that only a minimal setup was required.
  • Prompt Eval Count (41 tokens): The number of tokens in the input prompt. In NLP, tokens are typically words or subwords, so this count includes all the tokens that the model evaluated to understand and respond to the query.
  • Prompt Eval Duration (3.016907s): This measures the model’s time to evaluate or process the input prompt. It accounts for the bulk of the total duration, implying that understanding the query and preparing a response is the most time-consuming part of the process.
  • Prompt Eval Rate (13.59 tokens/s): This rate indicates how quickly the model processes tokens from the input prompt. It reflects the model’s speed in terms of natural language comprehension.
  • Eval Count (8 tokens): This is the number of tokens in the model’s response, which in this case was, “The capital of France is Paris.”
  • Eval Duration (613.67ms): This is the time taken to generate the output based on the evaluated input. It’s much shorter than the prompt evaluation, suggesting that generating the response is less complex or computationally intensive than understanding the prompt.
  • Eval Rate (13.04 tokens/s): Similar to the prompt eval rate, this indicates the speed at which the model generates output tokens. It’s a crucial metric for understanding the model’s efficiency in output generation.

This detailed breakdown can help understand the computational demands and performance characteristics of running LLMs like TinyLlama on edge devices like the Raspberry Pi 5. It shows that while prompt evaluation is more time-consuming, the actual generation of responses is relatively quicker. This analysis is crucial for optimizing performance and diagnosing potential bottlenecks in real-time applications.

Inspecting local resources

Using htop, you can easily monitor the resources running on your device. Enter with the command at the Terminal:

htop

After a reboot, we can see that the system is running with 424MB of its 8GB of SRAM, and the 4 Co-processors are not doing much work. Let’s run the TinyLlama again.

ollama run tinyllama --verbose

The Memory is up around 100Mb

Let’s do a chat and look at hoop again:

We can see that all 4 CPUs are running at 100% of their capacity.

It is also important to monitor the temperature. You can have the temperature shown on the taskbar:

If you are doing nothing, the temperature is around 48°C for a CPU running at 1%. During inference, with the CPU at 100%, the temperature raised to 62°C

.

This is OK and means the active cooler is working, keeping the temperature below 80°C / 85°C (its limit).

Pulling a bigger model

Let’s pull a bigger (still tiny) model, the PHI3, a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft. Let’s run the 3B model:

ollama run phi3 --verbose

Note that instead of using the command llama pull model, we use the command run directly. What happens is that Ollama keeps note of the pulled models, and once the PHI3 does not exist, before running it, Ollama pulls it.

Let’s enter with the same prompt used before:

>>> What is the capital of France?

The capital of France is Paris. It is not only the country’s largest city but also a global center for art, fashion, gastronomy, and culture. Its 19th-century metropolitan population within its administrative limits stands at 2, 175, 601 (at urban level in 2018). Paris is located in northern central France and is the most populous city in the country. The city proper has an area of 105 square kilometers (41 sq mi) and a total population of about 2,273,961 residents as of January 2019.

The answer was a little “verbose, ” with an “eval rate” of 3.51 tokens/s (3 to 4 times slower than TinyLlama), but we can specify a better prompt:

>>> What is the capital of France? Give me a short answer with less than 10 tokens.

Paris.

In this case, the eval rate was 4.99 tokens/s, but the overall time was almost the same: 44 s.

Choosing the most appropriate prompt is one of the most important skills to be used with LLMs, no matter its size.

We can use such a model as an important assistant since its speed is decent.

Multimodal Models

Multimodal models are artificial intelligence (AI) systems that can process and understand information from multiple sources, such as images, text, audio, and video. In our context, multimodal LLMs can process various inputs, including text, images, and audio, as prompts and convert those prompts into various outputs, not just the source type.

We will work here with LLaVA-Phi-3-mini, a fine-tuned LLaVA model from Phi 3 Mini 4k. It has strong performance benchmarks that are on par with the original LLaVA (Large Language and Vision Assistant) model.

The LLaVA-Phi-3-mini is an end-to-end trained large multimodal model designed to understand and generate content based on visual inputs (images) and textual instructions. It combines the capabilities of a visual encoder and a language model to process and respond to multimodal inputs.

Let’s install the model:

ollama run llava-phi3 --verbose

Let’s start with a text input:

>>> You are a helpful AI assistant. What is the capital of France?

The capital of France is Paris.

The response took 6.4s, with an eval rate of 4.5 tokens/s! Not bad!

But let us know to enter with an image as input. For that, let’s create a directory for working:

cd Documents/
mkdir OLLAMA
cd OLLAMA

Let’s download an image from the internet, for example (Wikipedia: Paris, France):

A VGA version, named image_test_1.jpg was downloaded into the OLLAMA directory. We should copy the full image path: Copy Path(s):

/home/mjrovai/Documents/OLLAMA/image_test_1.jpg

Let’s enter with this prompt:

>>> Describe the image /home/mjrovai/Documents/OLLAMA/image_test_1.jpg

The result was great, but the overall latency was important; almost 4 minutes to perform the inference.

Let’s try to fool the model. Download an image of the Eifel Tower in Las Vegas, Nevada, US, and name it image_test_2.jpg:

Let’s enter with this prompt:

>>> Describe the image /home/mjrovai/Documents/OLLAMA/image_test_2.jpg

The image’s description is good, but unfortunately, we fooled the model.

It is possible that the water plus the tower made the model describe it as Paris. Let’s try another image for the same place (Las Vegas) but using another angle and without water:

Great! The model understood the image correctly.

Regarding resources, we are running at 2.7GB of SRAM, full speed with 4 CPUs and 74°C.

Ollama Python Library

So far, we have explored LLMs’ chat capability using the command line on a terminal. However, we want to integrate those models into our projects, so Python seems to be the right path. The good news is that Ollama has such a library.

The Ollama Python library simplifies interaction with advanced LLM models, enabling more sophisticated responses and capabilities, besides providing the easiest way to integrate Python 3.8+ projects with Ollama.

For a better understanding of how to create apps using Ollama with Python, I strongly recommend Matt Williams’s videos, as the one below:

Installation:

pip install ollama

We will need a text editor or an IDE to create a Python script. The Raspberry Desktop has alheady been installed by default (accessed by [Menu] [Programming]), Thonny and Geany. If you prefer another IDE, such as Visual Studio Code, you can download it from [Menu][Preferences][Recomended Software]. When the window pops up, go to [Programming], select the option of your choice, and press [Apply].

Let’s enter with a very simple script:

import ollama

MODEL = 'llava-phi3'
PROMPT = 'What is the capital of France?'

res = ollama.generate(model=MODEL, prompt=PROMPT)
print (res)

And save it, for example, test_ollama.py. We can use the IDE to run it or do it directly on the terminal:

python test_ollama.py

As a result, we will have the model response in JSON format:

{
'model': 'llava-phi3',
'created_at': '2024-06-13T15:50:13.496874744Z',
'response': ' The capital of France is Paris. It is also the largest city in the country and serves as a major center for culture, politics, and economics. The French government, including both the executive branch led by the President and the legislative branch with the National Assembly and Senate, are located in Paris. The official residence of the president is at the Élysée Palace, while the French Parliament meets in the Palais Bourbon.',
'done': True,
'done_reason': 'stop',
'context': [32010, 1724, 338, 278, 7483, 310, 3444, 29973, 32007, 32001, 450, 7483, 310, 3444, 338, 3681, 29889, 739, 338, 884, 278, 10150, 4272, 297, 278, 4234, 322, 19700, 408, 263, 4655, 4818, 363, 9257, 29892, 22661, 29892, 322, 7766, 1199, 29889, 450, 5176, 5874, 29892, 3704, 1716, 278, 22760, 5443, 5331, 491, 278, 7178, 322, 278, 13332, 1230, 5443, 411, 278, 3086, 13266, 322, 18148, 29892, 526, 5982, 297, 3681, 29889, 450, 6221, 25488, 310, 278, 6673, 338, 472, 278, 3067, 368, 29879, 1318, 24537, 29892, 1550, 278, 5176, 13212, 28103, 297, 278, 3793, 1759, 12340, 6718, 29889, 32007],
'total_duration': 23629377147,
'load_duration': 1122936,
'prompt_eval_duration': 244490000,
'eval_count': 89,
'eval_duration': 23254762000
}

As we can see, several pieces of information are generated, such as:

  • 'context‘: the token IDs representing the input and context used by the model. Tokens are numerical representations of text used for processing by the language model.
  • 'response': the main output text generated by the model in response to our prompt.

The Performance Metrics:

  • total_duration: The total time taken for the operation in nanoseconds. In this case, approximately 23.63 seconds.
  • load_duration: The time taken to load the model or components in nanoseconds. About 1.12 milliseconds.
  • prompt_eval_duration: The time taken to evaluate the prompt in nanoseconds. Around 0.24 seconds.
  • eval_count: The number of tokens evaluated during the generation. Here, 89 tokens.
  • eval_duration: The time taken for the model to generate the response in nanoseconds. Approximately 23.25 seconds.

But, what we want is the plain ‘response’ and, perhaps for analysis, the total duration of the inference, so let’s change the code to extract it from JSON:

import ollama

MODEL = 'llava-phi3'
PROMPT = 'What is the capital of France?'

res = ollama.generate(model=MODEL, prompt=PROMPT)
print(f"\n{res['response']}")
print(f"\n [INFO] Total Duration: {(res['total_duration']/1e9):.2f} seconds")

Now, we got:

The capital of France is Paris. It is also the largest city in the country and serves as a major cultural, economic, and political center. Paris is known for its rich history, beautiful architecture, and world-renowned landmarks such as the Eiffel Tower, Notre Dame Cathedral, and Louvre Museum. The city’s population is around 2.14 million people, with an additional 8.6 million in the surrounding metropolitan area. Paris attracts millions of tourists each year, who come to explore its famous museums, art galleries, restaurants, and vibrant nightlife scene.

[INFO] Total Duration: 35.34 seconds

Using Ollama.chat()

Another way to get our response is to use ollama.chat():

import ollama

MODEL = 'llava-phi3'
PROMPT_1 = 'What is the capital of France?'

response = ollama.chat(model=MODEL, messages=[
{'role': 'user','content': PROMPT_1,},])
resp_1 = response['message']['content']
print(f"\n{resp_1}")
print(f"\n [INFO] Total Duration: {(res['total_duration']/1e9):.2f} seconds")p

The capital of France is Paris. It is a major European city known for its culture, history, and architecture.

[INFO] Total Duration: 6.27 seconds

An important consideration is that by using ollama.generate(), the response is “clear” from the model’s “memory” after the end of inference (only used once), but If we want to keep a conversation, we must use ollama.chat(). Let’s see it in action:

import ollama

MODEL = 'llava-phi3'

PROMPT_1 = 'What is the capital of France?'
response = ollama.chat(model=MODEL, messages=[
{'role': 'user','content': PROMPT_1,},])
resp_1 = response['message']['content']
print(f"\n{resp_1}")
print(f"\n [INFO] Total Duration: {(response['total_duration']/1e9):.2f} seconds")

PROMPT_2 = 'and of Italy?'
response = ollama.chat(model=MODEL, messages=[
{'role': 'user','content': PROMPT_1,},
{'role': 'assistant','content': resp_1,},
{'role': 'user','content': PROMPT_2,},])
resp_2 = response['message']['content']
print(f"\n{resp_2}")
print(f"\n [INFO] Total Duration: {(response['total_duration']/1e9):.2f} seconds")

In the above code, we are running two queries, and the second prompt considers the result of the first one.

Here is how the model responded:

Getting an image description:

In the same way that we have used the LlaVa-PHI-3 model with the command line to analyze an image, the same can be done here with Python. Let’s use the same image of Paris, but now with the ollama.generate():

import ollama

MODEL = 'llava-phi3'
PROMPT = "Describe this picture"

with open('image_test_1.jpg', 'rb') as image_file:
img = image_file.read()

response = ollama.generate(
model=MODEL,
prompt=PROMPT,
images= [img]
)
print(f"\n{response['response']}")
print(f"\n [INFO] Total Duration: {(res['total_duration']/1e9):.2f} seconds")

Here is the result:

The model took almost 4 minutes to return with a detailed image description.

Function Calling

So far, we can see that, with the model’s (“response”) answer to a variable, we can easily work with it and integrate it into real-world projects. OR NOT??

It works in theory, but a big problem is that the model always responds differently to the same prompt. Let’s say that what we like, as the model’s response in the last examples, is only the name of the capital of a given country, nothing more. We can use the Ollama function’s calling, which is perfectly compatible with OpenAI API.

But what exactly is “function calling”?

In modern artificial intelligence, Function Calling with Large Language Models (LLMs) allows these models to perform actions beyond generating text. By integrating with external functions or APIs, LLMs can access real-time data, automate tasks, and interact with various systems.

For instance, instead of merely responding to a query about the weather, an LLM can call a weather API to fetch the current conditions and provide accurate, up-to-date information. This capability enhances the relevance and accuracy of the model’s responses and makes it a powerful tool for driving workflows and automating processes, transforming it into an active participant in real-world applications.

For more details, please see this great video made by Marvin Prison:

Let’s create a project.

We want to create an app where the user enters a country’s name and gets, as an output, the distance in km from the capital city of such a country and the app’s location (for simplicity, I will use my location, Santiago, Chile, as the local point).

Once the user enters a country name, the model will return the name of its capital city (as a string) and the latitude and longitude of such city (in float). Using those coordinates, we can use a simple Python library (haversine) to calculate the distance between those 2 points.

The idea of this project is to demonstrate a combination of language model interaction (IA), structured data handling with Pydantic, and geospatial calculations using the Haversine formula (traditional computing).

First, let us install some libraries. Besides Haversine, the main one is the OpenAI Python library, which provides convenient access to the OpenAI REST API from any Python 3.7+ application. The other one is Pydantic (and instructor), a powerful data validation and settings management library engineered by Python to enhance the robustness and reliability of your codebase. In short, Pydantic will help ensure that our model’s response will always be consistent.

pip install haversine
pip install openai
pip install pydantic
pip install instructor

Now, we should create a Python script designed to interact with our model (LLM) to determine the coordinates of a country’s capital city and calculate the distance from Santiago de Chile to that capital.

Let’s go over the code:

1. Importing Libraries

import sys
from haversine import haversine
from openai import OpenAI
from pydantic import BaseModel, Field
import instructor
  • sys: Provides access to system-specific parameters and functions. It’s used to get command-line arguments.
  • haversine: A function from the haversine library that calculates the distance between two geographic points using the Haversine formula.
  • OpenAI: A module for interacting with the OpenAI API (although it’s used in conjunction with a local setup, Ollama). Everything is off-line here.
  • pydantic: Provides data validation and settings management using Python-type annotations. It’s used to define the structure of expected response data.
  • instructor: A module is used to patch the OpenAI client to work in a specific mode (likely related to structured data handling).

2. Defining Input and Model

country = sys.argv[1]  # Get the country from command-line arguments
MODEL = 'llava-phi3' # The name of the model to be used
mylat = -33.33 # Latitude of Santiago de Chile
mylon = -70.51 # Longitude of Santiago de Chile
  • country: The script expects a country name as a command-line argument.
  • MODEL: Specifies the model being used, which is llava-phi3.
  • mylat and mylon: Coordinates of Santiago de Chile, used as the starting point for the distance calculation.

3. Defining the Response Data Structure

class CityCoord(BaseModel):
city: str = Field(..., description="Name of the city")
lat: float = Field(..., description="Decimal Latitude of the city")
lon: float = Field(..., description="Decimal Longitude of the city")
  • CityCoord: A Pydantic model that defines the expected structure of the response from the LLM. It expects three fields: city (name of the city), lat (latitude), and lon (longitude).

4. Setting Up the OpenAI Client

client = instructor.patch(
OpenAI(
base_url="http://localhost:11434/v1", # Local API base URL (Ollama)
api_key="ollama", # API key (not used)
),
mode=instructor.Mode.JSON, # Mode for structured JSON output
)
  • OpenAI: This setup initializes an OpenAI client with a local base URL and an API key (ollama). It uses a local server.
  • instructor.patch: Patches the OpenAI client to work in JSON mode, enabling structured output that matches the Pydantic model.

5. Generating the Response

resp = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "user",
"content": f"return the decimal latitude and decimal longitude of the capital of the {country}."
}
],
response_model=CityCoord,
max_retries=10
)
  • client.chat.completions.create: Calls the LLM to generate a response.
  • model: Specifies the model to use (llava-phi3).
  • messages: Contains the prompt for the LLM, asking for the latitude and longitude of the capital city of the specified country.
  • response_model: Indicates that the response should conform to the CityCoord model.
  • max_retries: The maximum number of retry attempts if the request fails.

6. Calculating the Distance

distance = haversine((mylat, mylon), (resp.lat, resp.lon), unit="km")
print(f"Santiago de Chile is about {int(round(distance, -1)):,} kilometers away from {resp.city}.")
  • haversine: Calculates the distance between Santiago de Chile and the capital city returned by the LLM using their respective coordinates.
  • (mylat, mylon): Coordinates of Santiago de Chile.
  • (resp.lat, resp.lon): Coordinates of the capital city are provided by the LLM response.
  • unit="km": Specifies that the distance should be calculated in kilometers.
  • print: Outputs the distance, rounded to the nearest 10 kilometers, with thousands of separators for readability.

Running the code

We can note that we always receive the same structured information:

Santiago de Chile is about 8,060 kilometers away from Washington, D.C..

Santiago de Chile is about 4,250 kilometers away from Bogotá.

Santiago de Chile is about 11,630 kilometers away from Paris.

And the calculations are pretty good!

Adding images

Now it is time to wrap up everything so far! Let’s modify the “app” so that instead of entering the country name (as a text), the user enters with an image, and the LLM returns with the city in the image and its geographic location. With those data, we can calculate the distance as before.

For simplicity, we will implement this new code in two steps. First, the LLM will analyze the image and create a description (text). This text will be passed on to another instance, where the model will extract the information needed to pass along.

Here is the code:

import sys
from haversine import haversine
import ollama
from openai import OpenAI
from pydantic import BaseModel, Field
import instructor
import time

start_time = time.perf_counter() # Start timing

img = sys.argv[1]
MODEL = 'llava-phi3'
mylat = -33.33
mylon = -70.51

def image_description(img):
with open(img, 'rb') as file:
response = ollama.chat(
model=MODEL,
messages=[
{
'role': 'user',
'content': 'return the decimal latitude and decimal longitude of the city in the image, its name, and what country it is located',
'images': [file.read()],
},
],
options = {
'temperature': 0,
}
)
#print(response['message']['content'])
return response['message']['content']

class CityCoord(BaseModel):
city: str = Field(..., description="Name of the city in the image")
country: str = Field(..., description="Name of the country where the city in the image is located")
lat: float = Field(..., description="Decimal Latitude of the city in the image")
lon: float = Field(..., description="Decimal Longitude of the city in the image")

# enables `response_model` in create call
client = instructor.patch(
OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
),
mode=instructor.Mode.JSON,
)

image_description = image_description(img)
# Send this description to the model
resp = client.chat.completions.create(
model=MODEL,
messages=[
{
"role": "user",
"content": image_description,
}
],
response_model=CityCoord,
max_retries=10,
temperature=0,
)

distance = haversine((mylat, mylon), (resp.lat, resp.lon), unit="km")

print(f"\n The image shows {resp.city}, with lat:{round(resp.lat, 2)} and long: {round(resp.lon, 2)}, located in {resp.country} and about {int(round(distance, -1)):,} kilometers away from Santiago, Chile.\n")

end_time = time.perf_counter() # End timing
elapsed_time = end_time - start_time # Calculate elapsed time
print(f" [INFO] ==> The code (running {MODEL}), took {elapsed_time} seconds to execute.\n")

Let’s enter with an image:

The result was great!

The image shows Machu Picchu, with lat:-13.16 and long: -72.54, located in Peru and about 2,250 kilometers away from Santiago, Chile.

How about Paris?

The image shows Paris, with lat:48.86 and long: 2.35, located in France and about 11,630 kilometers away from Santiago, Chile.

Of course, there are many ways to optimize the code used here, but the whole idea of this tutorial was to call attention to the huge potential of LLMs running at the edge.

Going Further

The small LLM models tested worked well at the edge, both in text and with images, but of course, they had high latency regarding the last one. A combination of specific and dedicated models can lead to better results; for example, in real cases, an Object Detection model (such as YOLO) can get a general description and count of objects on an image that, once passed to an LLM, can help extract important insights and actions.

According to Avi Baum, CTO at Hailo,

In the vast landscape of artificial intelligence (AI), one of the most intriguing journeys has been the evolution of AI on the edge. This journey has taken us from classic machine vision to the realms of discriminative AI, enhancive AI, and now, the groundbreaking frontier of generative AI. Each step has brought us closer to a future where intelligent systems seamlessly integrate with our daily lives, offering an immersive experience of not just perception but also creation at the palm of our hand.

Talking about Hailo, we should soon see a new product, the Hailo-10H AI processor. This device promises to deliver up to 40 tera-operations per second (TOPS); it significantly outperforms all other edge AI processors. It enables edge devices such as the RPi5 to run deep learning applications at full scale more efficiently and effectively than traditional solutions while significantly lowering costs. So, we should see this device processing generative AI models, including LLMs with minimal CPU / GPU load, consuming less than 3.5W. The Hailo-10H will use a standard M.2 form factor suitable to a Raspberry Pi 5, as the newly announced Raspberry Pi AI Kit (which uses the Hailo-8L co-processor):

Conclusion

This tutorial has demonstrated how a Raspberry Pi 5 can be transformed into a potent AI hub capable of running large language models (LLMs) for real-time, on-site data analysis and insights using Ollama and Python. The Raspberry Pi’s versatility and power, coupled with the capabilities of lightweight LLMs like TinyLlama and LLaVa-Phi-3-mini, make it an excellent platform for edge computing applications.

The potential of running LLMs on the edge extends far beyond simple data processing, as in this tutorial’s examples. Here are some innovative suggestions for using this project:

1. Smart Home Automation:

  • Integrate LLMs to interpret voice commands or analyze sensor data for intelligent home automation. This could include real-time monitoring and control of home devices, security systems, and energy management, all processed locally without relying on cloud services.

2. Field Data Collection and Analysis:

  • Deploy LLMs on Raspberry Pi in remote or mobile setups for real-time data collection and analysis. This can be used in agriculture for monitoring crop health, in environmental studies for wildlife tracking, or in disaster response for situational awareness and resource management.

3. Educational Tools:

  • Create interactive educational tools that leverage LLMs to provide instant feedback, language translation, and tutoring. This can be particularly useful in developing regions with limited access to advanced technology and internet connectivity.

4. Healthcare Applications:

  • Use LLMs for medical diagnostics and patient monitoring, providing real-time analysis of symptoms and suggesting potential treatments. This can be integrated into telemedicine platforms or portable health devices.

5. Local Business Intelligence:

  • Implement LLMs in retail or small business environments to analyze customer behavior, manage inventory, and optimize operations. The ability to process data locally ensures privacy and reduces dependency on external services.

6. Industrial IoT:

  • Integrate LLMs into industrial IoT systems for predictive maintenance, quality control, and process optimization. The Raspberry Pi can serve as a localized data processing unit, reducing latency and improving the reliability of automated systems.

7. Autonomous Vehicles:

  • Use LLMs to process sensory data from autonomous vehicles, enabling real-time decision-making and navigation. This can be applied to drones, robots, and self-driving cars for enhanced autonomy and safety.

8. Cultural Heritage and Tourism:

  • Implement LLMs to provide interactive and informative cultural heritage sites and museum guides. Visitors can use these systems to get real-time information and insights, enhancing their experience without internet connectivity.

9. Artistic and Creative Projects:

  • Use LLMs to analyze and generate creative content, such as music, art, and literature. This can foster innovative projects in the creative industries and allow for unique interactive experiences in exhibitions and performances.

10. Customized Assistive Technologies:

  • Develop assistive technologies for individuals with disabilities, providing personalized and adaptive support through real-time text-to-speech, language translation, and other accessible tools.

Knowing more

If you want to learn more about AI, LLMs, and Embedded Machine Learning (TinyML), please see these references:

On the TinyML4D website, You can find lots of educational materials on TinyML. They are all free and open-source for educational uses – we ask that if you use the material, please cite them! TinyML4D is an initiative to make TinyML education availabe to everyone globally.

That’s all, folks!

As always, I hope this project can help others find their way into the exciting world of AI!

Source link

Source link: https://www.hackster.io/mjrobot/running-large-language-models-on-raspberry-pi-at-the-edge-63bb11

What do you think?

Leave a Reply

GIPHY App Key not set. Please check settings

Federico Babina's architectural flag illustrations deconstruct the notion of borders to imagine an alternative world | News

Federico Babina’s flag illustrations challenge borders, inspire alternative worlds. #Architecture

Elon Musk

Elon Musk withdraws lawsuit against OpenAI and Sam Altman. #innovation