Deploy a HuggingFace model

Create a scalable serverless endpoint for running inference on your HuggingFace model

HuggingFace (HF) provides a wonderfully simple way to use some of the best models from the open-source ML sphere. In this guide we'll look at uploading an HF pipeline and an HF model to demonstrate how almost any of the ~100,000 models available on HuggingFace can be quickly deployed to a serverless inference endpoint via Pipeline Cloud. Buzzwords out of the way, let's go.

πŸ“˜

Install transformers or diffusers by running pip install transformers or pip install diffusers

Currently we support models from the transformers and diffusers libraries, and since they both use a very similar design, the below code works for either!

NOTE: This is a walkthrough, so many of the below code snippets are mere chunks of a larger script. If you're skimming or just want to see code, then skip to the conclusion where you'll find the complete script.

Getting started with HuggingFace transformers

Once you've installed transformers (or diffusers – in this tutorial we'll use the former but both will work), it's really simple to initialise a model and start running inference on it.

We'll use bert-base-uncased in this guide as it's one of the most popular models on the platform. It's a fill-mask model, which means it will take in an input sentence like 'a [MASK] is worth a thousand words' and it will output tokens like 'picture', such that the output 'fills in' the input sentence in an accurate way (therefore creating the completed sentence 'a picture is worth a thousand words').

Using a pipeline from transformers

🚧

A HuggingFace pipeline is not the same as a pipeline-ai pipeline

Just a quick note on terminology: both HuggingFace and Pipeline.ai use the same word 'pipeline' to mean 'a set of processing steps which convert an input to an output'. But the actual underlying representations in code are very different. Later in this guide we're going to embed a HuggingFace 'pipeline' within a pipeline-ai 'pipeline'.

HuggingFace makes it so easy to get started with bert-base-uncased:

from transformers import pipeline as hf_pipeline  # alias to make the code clearer

unmasker = hf_pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("a [MASK] is worth a thousand words")

print(result)

And when we run this Python script, we see, printed out in the terminal, a list of 5 contender tokens to replace the [MASK] token. Surprisingly, none of them are 'picture' but now isn't the time for assessing model quality...

[
    {'score': 0.07544363290071487, 'token': 2158, 'token_str': 'man', 'sequence': 'a man is worth a thousand words'},
  {'score': 0.036318931728601456, 'token': 2166, 'token_str': 'life', 'sequence': 'a life is worth a thousand words'},
  {'score': 0.03240768611431122, 'token': 2450, 'token_str': 'woman', 'sequence': 'a woman is worth a thousand words'},
  {'score': 0.02168961986899376, 'token': 2611, 'token_str': 'girl', 'sequence': 'a girl is worth a thousand words'},
  {'score': 0.018244296312332153, 'token': 2773, 'token_str': 'word', 'sequence': 'a word is worth a thousand words'}
]

What just happened here? We instantiated a fill-mask HF pipeline, set the model to bert-base-uncased, and just by passing a string containing a mask to that class's call function we made a prediction. Internally, the HF pipeline assembles the model on CPU, downloads the bert-base-uncased weights, and then loads them into the model.

If you have a GPU attached, you can ensure the prediction takes place on your GPU instead, by passing the device keyword argument (see line 6):

from transformers import pipeline as hf_pipeline # alias to make the code clearer
import torch

gpu = torch.device("cuda")

unmasker = hf_pipeline("fill-mask", model="bert-base-uncased", device=gpu)
...

Using a model from transformers

The HF pipeline class is a convenience tool for beginners. Let's see how to recreate its functionality and get better control over the underlying bert-base-uncased model. FYI, these code snippets are descended from the tutorial found on the HuggingFace website.

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")

text = "a [MASK] is worth a thousand words"
encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

print(output)

Structurally this is very similar to the HF pipeline example shown above: we instantiate the model and make a prediction. The only difference now is we've removed the initial layer of abstraction to see what's going on underneath.

First, we instantiate the tokeniser and the model from the bert-base-uncased weights (which get downloaded from HuggingFace). Then we tokenize the user's input (containing a masked token), and get PyTorch tensors in return (that's the pt on line 7). Finally we pass this encoded representation of the input to the model's call function. This is the procedure for CPU inference, now let's consider how to run this on GPU.

from transformers import BertTokenizer, BertModel
import torch

gpu = torch.device("cuda")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased').to(gpu)
model = BertModel.from_pretrained("bert-base-uncased").to(gpu)

text = "a [MASK] is worth a thousand words"
encoded_input = tokenizer(text, return_tensors='pt').to(gpu)

output = model(**encoded_input)

print(output)

Here we're taking advantage of the lovely .to() interface provided by PyTorch to send models, inputs, and other data to the GPU. Notice how the model and the encoded input must both be sent to the GPU, otherwise you'll receive an error about tensors being on the wrong device.

Building a pipeline around the transformers model

Now we have the model in a purer form, it's time to re-wrap it in a pipeline, but this time we're going to use the pipeline-ai library to package the load and inference steps into one deployable API. First, install the library by running pip install pipeline-ai. Its PyPi name is pipeline-ai although when we import modules from it, we'll write from pipeline import X.

Let's start by making a pipeline 'blueprint'. A blueprint is essentially the set of instructions for what should happen when a request is made to your inference endpoint. We need to specify the incoming data, what happens to the data, and then the outputs. To build the blueprint you must instantiate a Pipeline and use a context manager, as below.

from pipeline import Pipeline, Variable

with Pipeline("HuggingFace demo") as pipeline_blueprint:
  masked_sentence = Variable(str, is_input=True)

  pipeline_blueprint.add_variable(masked_sentence)

  model = CustomBertModel()
  model.load()

  output = model.predict(
    masked_sentence
  )

  pipeline_blueprint.output(output)

On lines 4 and 6, we tell the blueprint to expect an input string. We're going to load the model on line 8 – more on this in the next section. We'll call our model's predict function to do the actual inference step. And then finally we'll notify the blueprint of the output variable.

πŸ“˜

Why is the syntax so strict?

If you're unfamiliar with building computational graphs this syntax can be a bit alien and tricky to parse. The point is to create a deterministic flow from input/s to output/s so that Pipeline Cloud servers can find optimisations and handle scaling correctly. In the end you'll acheive better performance.

You can add as many inputs and as many outputs to the pipeline as you like, so as your model grows, you can introduce a host of different arguments, data points, and return values. One thing you can't do within a blueprint, however, is use a 'raw' runtime value such as 42 or True. All runtime values should either be within a Variable or further down within the model class.

Creating the core pipeline_model

So inside the blueprint we've instantiated a CustomBertModel class. Now it's time to build that, using the transformers code we wrote earlier. Here's the class if you're using the HF pipeline:

from transformers import pipeline as hf_pipeline
from pipeline import pipeline_model, pipeline_function
import torch

@pipeline_model
class CustomBertModel:
  bert = None
  gpu = torch.device("cuda")

  @pipeline_function
  def load(self) -> bool:
    self.bert = hf_pipeline("fill-mask", model="bert-base-uncased").to(self.gpu)
    return True

  @pipeline_function
  def predict(self, input: str) -> list:
    candidate_tokens = self.bert(input)
    return candidate_tokens

And here's the class if you're using the HF model:

from transformers import BertTokenizer, BertModel
from pipeline import pipeline_model, pipeline_function
import torch

@pipeline_model
class CustomBertModel:
  bert_tokenizer = None
  bert_model = None
  gpu = torch.device("cuda")

  @pipeline_function
  def load(self) -> bool:
    self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    self.bert_model = BertModel.from_pretrained("bert-base-uncased")
    return True

  @pipeline_function
  def predict(self, input: str) -> list:
    encoded_input = self.bert_tokenizer(input, return_tensors="pt")
    last_hidden_states = self.bert_model(**encoded_input).last_hidden_state
    return last_hidden_states

Most of the code here should be familiar, but let's review how it's changed. Firstly we've made two functions: load which handles the instantiation of the model, and sending it to GPU; and predict which handles the tokenisation and prediction of the user's input. Two functions – one for loading and one for inference – is a very common pattern in pipelines deployed to Pipeline Cloud.

We also made the CustomBertModel class which hosts these two functions. Because we need to access the loaded model in the predict function (and we want the two functions to be separate), we use the properties of this class to store the loaded modules. You'll also see some type annotations: these are sometimes necessary for the pipeline-ai library to be able to read your computational graph.

πŸ“˜

We return the hidden states when using BertModel

You may want to do some processing (such as argmax) on the outputs of BertModel in order to convert the hidden states to actual text tokens. Because Pipeline Cloud handles any arbitrary compute, any steps like this are totally compatible with the rest of the pipeline. They are just beyond the scope of this tutorial, so they won't be discussed here.

The final thing to comment on is the use of decorators. Any function called within a pipeline blueprint must be decorated as a @pipeline_function. Since we're calling both load and predict from within the blueprint, both of them are tagged as being pipeline functions. The same is true for models which get instantiated within a pipeline blueprint: these must be tagged @pipeline_model.

These decorators instruct their respective functions to convert references (such as the Variable object) to their actual runtime values so that the underlying transformers inference code can access the user's inputs. Otherwise, it would try to operate on a Variable object rather than its raw value (which is a str).

Set the load function to run only on startup

Remember how every request to your pipeline's endpoint will follow the blueprint from top to bottom? If that were to happen now, the model.load() function would be called on every single request. One of the great features of a platform like Pipeline Cloud is that it can cache your models on GPU so that you don't have to experience cold starts on every request. If we repeatedly called load then we would be throwing away time with pointless loading.

πŸ“˜

Your model stays cached until another replaces it

Pipeline Cloud automatically stores your model in GPU cache after it has been loaded, so that future inference requests can skip the cold start. It will remain cached until another pipeline 'kicks it off' so that the platform can serve all users fairly. However, if your traffic is regular and sufficiently high-volume then your pipeline will remain ~permanently cached while you only pay for inference time.

Thus we need to tell the blueprint to only call the load method once when the pipeline loads, and not again for the duration of the pipeline's time within GPU cache. Fortunately, there's a really easy way to do exactly that, and unlock all the performance benefits that it entails. Just tag the pipeline_function decorator on the load method with the following two arguments:

...
@pipeline_function(run_once=True, on_startup=True)
def load(self) -> bool:
  ...

Now, even though we call model.load() within the pipeline blueprint, we can be sure it will only run when the pipeline caches, and not again. Inference should be even faster as a result!

Running the model locally

As we've seen, pipeline-ai is a library for building a computational flow. It can also be used locally to handle execution of the pipeline, called a 'run'. So, a great way of debugging your pipeline before uploading it to Pipeline Cloud is to run it locally!

Of course, if you don't have a GPU attached then in some cases local runs will be too slow to be practical.

huggingface_pipeline = Pipeline.get_pipeline("HuggingFace demo")

example_input = "a [MASK] is worth a thousand words"
result = huggingface_pipeline.run(example_input)

print(result)

First we 'get' the pipeline by using the name which we set when defining the pipeline blueprint. Then, very simply, we call the .run() method on the pipeline object, passing in our input. Finally we print the result, so in the terminal we see this:

[tensor([[[ 0.0936,  0.2309,  0.0579,  ..., -0.0105,  0.0384, -0.0012],
         [-0.1750,  0.2057,  0.0515,  ..., -0.3433,  0.2265,  0.5729],
         [-0.1380, -0.0084, -0.0826,  ..., -0.2994,  0.4941, -0.1115],
         ...,
         [ 0.8105, -0.0103,  0.0128,  ..., -0.4513, -0.3229, -0.3055],
         [-0.5017, -0.1092, -0.2608,  ...,  0.5027,  0.2166, -0.0970],
         [ 1.1945,  0.4546, -0.1763,  ..., -0.1278, -0.8650, -0.2725]]],
       grad_fn=<NativeLayerNormBackward0>)]

Because a pipeline can have multiple outputs, the returned result is a list – even though we only have one output in this case. Great, we've just validated that the pipeline works by running it locally!

Running the model on Pipeline Cloud

Before we can run the model on Pipeline Cloud, we need to upload it to the servers. Again we 'get' the pipeline, before instantiating a connection to Pipeline Cloud and uploading our pipeline. If you don't yet have a token, make sure to create one in the dashboard.

from pipeline import PipelineCloud

huggingface_pipeline = Pipeline.get_pipeline("HuggingFace demo")

api = PipelineCloud(token="PIPELINE_API_TOKEN")
uploaded_pipeline = api.upload_pipeline(huggingface_pipeline)

print(f"Uploaded pipeline id: {uploaded_pipeline.id}")

🚧

You can't modify uploaded pipelines

Once a pipeline has been uploaded to Pipeline Cloud, it's considered immutable, which is to say it can't be updated or modified in any way, even if its a buggy pipeline. This means you have to upload a new pipeline every time you make a change.

During this stage, the pipeline-ai library will parse (serialise) all your code so that it's formatted as a JSON object. The library then POSTs your pipeline to the endpoint for creating pipelines, because underneath pipeline-ai is just an SDK around our HTTP API.

And now we run the pipeline using a slightly different syntax to the earlier local run. Internally this a POST request to the /v2/runs endpoint; so if you're building an app in a different language you don't need to worry about dropping the pipeline-ai library.

example_input = "a [MASK] is worth a thousand words"

run = api.run_pipeline(
  uploaded_pipeline.id,
  [example_input],
)

print(f"Run id: {run.id}")
print(run.result_preview)

Notice how the inputs to the .run_pipeline() method are kept within a list. Additionally, the return of run_pipeline() is a RunGet object, not the raw result. So we need to access the result_preview property in order to see the actual output of our pipeline. What happens if we run this?

Well, as it stands, this pipeline won't work, because tensors are not JSON-serialisable. Remember when we ran the pipeline locally, it returned a tensor of the last hidden states. We need to convert this tensor to a JSON-serialisable format such as a list so that the API can output it. Fortunately there's a helpful utility function in pipeline-ai which does exactly that.

from pipeline.util.torch_utils import tensor_to_list

with Pipeline("HuggingFace demo") as pipeline_blueprint:
  ...

  raw_output = model.predict(masked_sentence)
  output = tensor_to_list(raw_output)

  pipeline_blueprint.output(output)

We also need to change the pipeline blueprint, as shown, so that it converts the output tensor into a list, in order that we receive our output correctly. Now let's upload and run the pipeline, and see what happens.

[[[[0.09360659867525101, 0.23093537986278534, 0.05791196972131729...

Whoo! A very long list of numbers (aka the hidden states tensor) is returned. We submitted a run to Pipeline Cloud, it executed it on GPU, and then the server returned the result. To repeat, we assembled an API endpoint for serverless inference, and submitted a run to it. And we only get billed for the compute time!

Speed up model loading

Previously we used the run_once flags on our load function in order to instruct Pipeline Cloud servers to only run that function once, when it saves the model into GPU cache. There are some other things we can set to make remote loading of the model even faster.

Fixing the minimum GPU demand

First, we can explicitly tell Pipeline Cloud how much GPU VRAM the model requires. By setting a min_gpu_vram_mb value, we skip the stage during model loading when Pipeline Cloud identifies what size GPU to employ.

...
with Pipeline("HuggingFace demo", min_gpu_vram_mb=1605) as pipeline_blueprint:
...

Don't worry if you don't know how much VRAM your model takes up (while doing inference): this kwarg is completely optional. To figure it out the number, you can run the model locally and in another terminal tab run watch -n1 nvidia-smi. Note down the highest amount of VRAM that you process occupies.

Uploading weights as PipelineFiles

We can also handle downloading the model weights, because if we upload them as PipelineFiles, then the Pipeline Cloud servers can take advantage of a rapid local cache in order to make cold starts even smaller. This is quite an advanced process, so only follow the steps below if you're comfortable with HuggingFace's libraries and your local filesystem.

We'll start by saving the HuggingFace model to an obvious location locally.

from transformers import BertModel

bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_model.save_pretrained("saved-bert-base-uncased-model")

Using save_pretrained() we can store the model's config.json (config) and pytorch_model.bin (weights) locally. Next we modify the pipeline blueprint to include two new PipelineFiles (for these two files).

with Pipeline("HuggingFace demo") as pipeline_blueprint:
  masked_sentence = Variable(str, is_input=True)
  bert_model_config = PipelineFile(path="saved-bert-base-uncased-model/config.json")
  bert_model_weights = PipelineFile(path="saved-bert-base-uncased-model/pytorch_model.bin")

  pipeline_blueprint.add_variables(
    masked_sentence, bert_model_config, bert_model_weights
  )

  model = CustomBertModel()
  model.load(bert_model_config, bert_model_weights)

  raw_output = model.predict(masked_sentence)
  output = tensor_to_list(raw_output)

  pipeline_blueprint.output(output)

We must pass each PipelineFile to our CustomBertModel class's load method, because we're going to reference them inside that function, as below:

import torch

...
@pipeline_function
def load(
  self, bert_model_config: PipelineFile, bert_model_weights: PipelineFile
) -> bool:
  self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
  self.bert_model = BertModel.from_pretrained(
    None,
    state_dict=torch.load(bert_model_weights.path),
    config=bert_model_config.path,
  )
  return True
...

The important lines to look at here are lines 9-13, where we pass in None for the model identifier, instead loading the state_dict and config manually. To access the 'path' of a PipelineFile, use the .path property, as shown above.

It's okay to pass just a path to the config kwarg, but the state_dict must be an actual loaded dictionary of tensors; so we're using torch.load() to import this from the PipelineFile. The tokenizer is so small that using PipelineFile probably won't accelerate loading by a noticeable amount, so for this tutorial we'll skip it.

Conclusion

Although bert-base-uncased is a simple model, the above pattern can be used to deploy pretty much any model available from transformers or diffusers!

Complete script

from transformers import BertTokenizer, BertModel
from pipeline import pipeline_model, pipeline_function, PipelineFile, Pipeline, Variable, PipelineCloud
import torch
from pipeline.util.torch_utils import tensor_to_list


@pipeline_model
class CustomBertModel:
  bert_tokenizer = None
  bert_model = None
  gpu = torch.device("cuda")

  @pipeline_function
  def load(
    self, bert_model_config: PipelineFile, bert_model_weights: PipelineFile
  ) -> bool:
    self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    self.bert_model = BertModel.from_pretrained(
      None,
      state_dict=torch.load(bert_model_weights.path),
      config=bert_model_config.path,
    ).to(self.gpu)
    return True

  @pipeline_function
  def predict(self, input: str) -> list:
    encoded_input = self.bert_tokenizer(input, return_tensors="pt").to(self.gpu)
    last_hidden_states = self.bert_model(**encoded_input).last_hidden_state
    return last_hidden_states

with Pipeline("HuggingFace demo") as pipeline_blueprint:
  masked_sentence = Variable(str, is_input=True)
  bert_model_config = PipelineFile(path="saved-bert-base-uncased-model/config.json")
  bert_model_weights = PipelineFile(
    path="saved-bert-base-uncased-model/pytorch_model.bin"
  )

  pipeline_blueprint.add_variables(
    masked_sentence, bert_model_config, bert_model_weights
  )

  model = CustomBertModel()
  model.load(bert_model_config, bert_model_weights)

  raw_output = model.predict(masked_sentence)
  output = tensor_to_list(raw_output)

  pipeline_blueprint.output(output)


huggingface_pipeline = Pipeline.get_pipeline("HuggingFace demo")
api = PipelineCloud(token="PIPELINE_API_TOKEN")
uploaded_pipeline = api.upload_pipeline(huggingface_pipeline)

print(f"Uploaded pipeline id: {uploaded_pipeline.id}")

example_input = "a [MASK] is worth a thousand words"
run = api.run_pipeline(
  uploaded_pipeline.id,
  [example_input],
)

print(f"Run id: {run.id}")
print(run.result_preview)