#Deploying Large Language Model for Hugging Face Inference #AI

Summarise this content to 300 words

Large language models (LLMs) refer to artificial intelligence systems capable of understanding and generating human-like texts at scale. These models have been trained on massive amounts of text data, typically billions of words, enabling them to learn complex patterns and relationships between words, sentences, and concepts.

Hugging Face Inference points are lightweight, self-contained instances designed specifically for serving machine learning models with low latency and high throughput. They allow developers to easily deploy and manage models built using the open-source library Hugging Face Transformers, without requiring expertise in cloud computing or DevOps.

This blog aims to make you walk through the process of deploying a large language model as a hugging face Inference endpoint. Here, we will take the example of deploying Falcon-40B instruct which is a hugging face model.

To start with this, first, we need to create an account on Hugging Face (Hugging Face — The AI community building the future.). It requires an email ID and a password.

Once it is done, you can log in with the credentials provided at the time of registration. On the home page, one can search for the models we need and select the model to view the details of the model.

As we can see on the right-hand side of the search model page, there is an option called “Deploy” under which you can observe multiple deployment options. Since we are focusing on the Hugging face inference point, we will select the same and proceed.

Once the inference endpoint is selected, it will take you to the page where you can create inference endpoints for the selected model. One needs to fill several fields for this.

The next step is to enter the Hugging face repository ID and the name of the desired endpoint.

Next, select the cloud provider and region.

Next, define the security level for the endpoint.

Create your endpoint by clicking “Create Endpoint”. By default, the endpoint is created with a large GPU. The estimated cost assumes that the endpoint will be up for an entire month and does not include autoscaling into account.

We need to wait for the endpoint to build, initialize, and run which may take up to 10 mins.

Wait for the building and running endpoint.

Finally, we can test the endpoint with the API endpoint generated from deployment using the Hugging face access token.

Cheers!! Happy reading!! Keep learning!!

Please upvote if you liked this!! thanks!!

You can connect with me on Jyoti Dabass, Ph.D | LinkedIn and jyotidabass (Jyoti Dabass, Ph.D) (github.com) for more related content. Thanks!!

Source link

Source link: https://python.plainenglish.io/deploying-large-language-model-as-hugging-face-inference-points-159b9a3bf4a2?source=rss——large_language_models-5