Deploy a HuggingFace model
Create a scalable serverless endpoint for running inference on your HuggingFace model
HuggingFace (HF) provides a wonderfully simple way to use some of the best models from the open-source ML sphere. In this guide we'll look at uploading an HF pipeline and an HF model to demonstrate how almost any of the ~100,000 models available on HuggingFace can be quickly deployed to a serverless inference endpoint via Pipeline Cloud. Buzzwords out of the way, let's go.
Install transformers or diffusers by running
pip install transformers
orpip install diffusers
Currently we support models from the
transformers
anddiffusers
libraries, and since they both use a very similar design, the below code works for either!
NOTE: This is a walkthrough, so many of the below code snippets are mere chunks of a larger script. If you're skimming or just want to see code, then skip to the conclusion where you'll find the complete script.
Getting started with HuggingFace transformers
transformers
Once you've installed transformers
(or diffusers
β in this tutorial we'll use the former but both will work), it's really simple to initialise a model and start running inference on it.
We'll use bert-base-uncased
in this guide as it's one of the most popular models on the platform. It's a fill-mask model, which means it will take in an input sentence like 'a [MASK] is worth a thousand words' and it will output tokens like 'picture', such that the output 'fills in' the input sentence in an accurate way (therefore creating the completed sentence 'a picture is worth a thousand words').
Using a pipeline from transformers
transformers
A HuggingFace
pipeline
is not the same as a pipeline-aipipeline
Just a quick note on terminology: both HuggingFace and Pipeline.ai use the same word 'pipeline' to mean 'a set of processing steps which convert an input to an output'. But the actual underlying representations in code are very different. Later in this guide we're going to embed a HuggingFace 'pipeline' within a pipeline-ai 'pipeline'.
HuggingFace makes it so easy to get started with bert-base-uncased
:
from transformers import pipeline as hf_pipeline # alias to make the code clearer
unmasker = hf_pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("a [MASK] is worth a thousand words")
print(result)
And when we run this Python script, we see, printed out in the terminal, a list of 5 contender tokens to replace the [MASK]
token. Surprisingly, none of them are 'picture' but now isn't the time for assessing model quality...
[
{'score': 0.07544363290071487, 'token': 2158, 'token_str': 'man', 'sequence': 'a man is worth a thousand words'},
{'score': 0.036318931728601456, 'token': 2166, 'token_str': 'life', 'sequence': 'a life is worth a thousand words'},
{'score': 0.03240768611431122, 'token': 2450, 'token_str': 'woman', 'sequence': 'a woman is worth a thousand words'},
{'score': 0.02168961986899376, 'token': 2611, 'token_str': 'girl', 'sequence': 'a girl is worth a thousand words'},
{'score': 0.018244296312332153, 'token': 2773, 'token_str': 'word', 'sequence': 'a word is worth a thousand words'}
]
What just happened here? We instantiated a fill-mask
HF pipeline, set the model to bert-base-uncased
, and just by passing a string containing a mask to that class's call
function we made a prediction. Internally, the HF pipeline assembles the model on CPU, downloads the bert-base-uncased
weights, and then loads them into the model.
If you have a GPU attached, you can ensure the prediction takes place on your GPU instead, by passing the device
keyword argument (see line 6):
from transformers import pipeline as hf_pipeline # alias to make the code clearer
import torch
gpu = torch.device("cuda")
unmasker = hf_pipeline("fill-mask", model="bert-base-uncased", device=gpu)
...
Using a model from transformers
transformers
The HF pipeline class is a convenience tool for beginners. Let's see how to recreate its functionality and get better control over the underlying bert-base-uncased
model. FYI, these code snippets are descended from the tutorial found on the HuggingFace website.
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "a [MASK] is worth a thousand words"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)
Structurally this is very similar to the HF pipeline example shown above: we instantiate the model and make a prediction. The only difference now is we've removed the initial layer of abstraction to see what's going on underneath.
First, we instantiate the tokeniser and the model from the bert-base-uncased
weights (which get downloaded from HuggingFace). Then we tokenize the user's input (containing a masked token), and get PyTorch tensors in return (that's the pt
on line 7). Finally we pass this encoded representation of the input to the model's call
function. This is the procedure for CPU inference, now let's consider how to run this on GPU.
from transformers import BertTokenizer, BertModel
import torch
gpu = torch.device("cuda")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased').to(gpu)
model = BertModel.from_pretrained("bert-base-uncased").to(gpu)
text = "a [MASK] is worth a thousand words"
encoded_input = tokenizer(text, return_tensors='pt').to(gpu)
output = model(**encoded_input)
print(output)
Here we're taking advantage of the lovely .to()
interface provided by PyTorch to send models, inputs, and other data to the GPU. Notice how the model and the encoded input must both be sent to the GPU, otherwise you'll receive an error about tensors being on the wrong device.
Building a pipeline around the transformers
model
transformers
modelNow we have the model in a purer form, it's time to re-wrap it in a pipeline, but this time we're going to use the pipeline-ai
library to package the load and inference steps into one deployable API. First, install the library by running pip install pipeline-ai
. Its PyPi name is pipeline-ai
although when we import modules from it, we'll write from pipeline import X
.
Let's start by making a pipeline 'blueprint'. A blueprint is essentially the set of instructions for what should happen when a request is made to your inference endpoint. We need to specify the incoming data, what happens to the data, and then the outputs. To build the blueprint you must instantiate a Pipeline
and use a context manager, as below.
from pipeline import Pipeline, Variable
with Pipeline("huggingface-demo") as pipeline_blueprint:
masked_sentence = Variable(str, is_input=True)
pipeline_blueprint.add_variable(masked_sentence)
model = CustomBertModel()
model.load()
output = model.predict(
masked_sentence
)
pipeline_blueprint.output(output)
On lines 4 and 6, we tell the blueprint to expect an input string. We're going to load the model on line 8 β more on this in the next section. We'll call our model's predict
function to do the actual inference step. And then finally we'll notify the blueprint of the output variable.
Why is the syntax so strict?
If you're unfamiliar with building computational graphs this syntax can be a bit alien and tricky to parse. The point is to create a deterministic flow from input/s to output/s so that Pipeline Cloud servers can find optimisations and handle scaling correctly. In the end you'll acheive better performance.
You can add as many inputs and as many outputs to the pipeline as you like, so as your model grows, you can introduce a host of different arguments, data points, and return values. One thing you can't do within a blueprint, however, is use a 'raw' runtime value such as 42
or True
. All runtime values should either be within a Variable
or further down within the model
class.
Creating the core pipeline_model
pipeline_model
So inside the blueprint we've instantiated a CustomBertModel class. Now it's time to build that, using the transformers
code we wrote earlier. Here's the class if you're using the HF pipeline:
from transformers import pipeline as hf_pipeline
from pipeline import pipeline_model, pipeline_function
import torch
@pipeline_model
class CustomBertModel:
bert = None
gpu = torch.device("cuda")
@pipeline_function
def load(self) -> bool:
self.bert = hf_pipeline("fill-mask", model="bert-base-uncased").to(self.gpu)
return True
@pipeline_function
def predict(self, input: str) -> list:
candidate_tokens = self.bert(input)
return candidate_tokens
And here's the class if you're using the HF model:
from transformers import BertTokenizer, BertModel
from pipeline import pipeline_model, pipeline_function
import torch
@pipeline_model
class CustomBertModel:
bert_tokenizer = None
bert_model = None
gpu = torch.device("cuda")
@pipeline_function
def load(self) -> bool:
self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.bert_model = BertModel.from_pretrained("bert-base-uncased")
return True
@pipeline_function
def predict(self, input: str) -> list:
encoded_input = self.bert_tokenizer(input, return_tensors="pt")
last_hidden_states = self.bert_model(**encoded_input).last_hidden_state
return last_hidden_states
Most of the code here should be familiar, but let's review how it's changed. Firstly we've made two functions: load
which handles the instantiation of the model, and sending it to GPU; and predict
which handles the tokenisation and prediction of the user's input. Two functions β one for loading and one for inference βΒ is a very common pattern in pipelines deployed to Pipeline Cloud.
We also made the CustomBertModel
class which hosts these two functions. Because we need to access the loaded model in the predict
function (and we want the two functions to be separate), we use the properties of this class to store the loaded modules. You'll also see some type annotations: these are sometimes necessary for the pipeline-ai
library to be able to read your computational graph.
We return the hidden states when using
BertModel
You may want to do some processing (such as argmax) on the outputs of
BertModel
in order to convert the hidden states to actual text tokens. Because Pipeline Cloud handles any arbitrary compute, any steps like this are totally compatible with the rest of the pipeline. They are just beyond the scope of this tutorial, so they won't be discussed here.
The final thing to comment on is the use of decorators. Any function called within a pipeline blueprint must be decorated as a @pipeline_function
. Since we're calling both load
and predict
from within the blueprint, both of them are tagged as being pipeline functions. The same is true for models which get instantiated within a pipeline blueprint: these must be tagged @pipeline_model
.
These decorators instruct their respective functions to convert references (such as the Variable
object) to their actual runtime values so that the underlying transformers
inference code can access the user's inputs. Otherwise, it would try to operate on a Variable
object rather than its raw value (which is a str
).
Set the load
function to run only on startup
load
function to run only on startupRemember how every request to your pipeline's endpoint will follow the blueprint from top to bottom? If that were to happen now, the model.load()
function would be called on every single request. One of the great features of a platform like Pipeline Cloud is that it can cache your models on GPU so that you don't have to experience cold starts on every request. If we repeatedly called load
then we would be throwing away time with pointless loading.
Your model stays cached until another replaces it
Pipeline Cloud automatically stores your model in GPU cache after it has been loaded, so that future inference requests can skip the cold start. It will remain cached until another pipeline 'kicks it off' so that the platform can serve all users fairly. However, if your traffic is regular and sufficiently high-volume then your pipeline will remain ~permanently cached while you only pay for inference time.
Thus we need to tell the blueprint to only call the load
method once when the pipeline loads, and not again for the duration of the pipeline's time within GPU cache. Fortunately, there's a really easy way to do exactly that, and unlock all the performance benefits that it entails. Just tag the pipeline_function
decorator on the load
method with the following two arguments:
...
@pipeline_function(run_once=True, on_startup=True)
def load(self) -> bool:
...
Now, even though we call model.load()
within the pipeline blueprint, we can be sure it will only run when the pipeline caches, and not again. Inference should be even faster as a result!
Running the model locally
As we've seen, pipeline-ai
is a library for building a computational flow. It can also be used locally to handle execution of the pipeline, called a 'run'. So, a great way of debugging your pipeline before uploading it to Pipeline Cloud is to run it locally!
Of course, if you don't have a GPU attached then in some cases local runs will be too slow to be practical.
huggingface_pipeline = Pipeline.get_pipeline("huggingface-demo")
example_input = "a [MASK] is worth a thousand words"
result = huggingface_pipeline.run(example_input)
print(result)
First we 'get' the pipeline by using the name which we set when defining the pipeline blueprint. Then, very simply, we call the .run()
method on the pipeline object, passing in our input. Finally we print the result, so in the terminal we see this:
[tensor([[[ 0.0936, 0.2309, 0.0579, ..., -0.0105, 0.0384, -0.0012],
[-0.1750, 0.2057, 0.0515, ..., -0.3433, 0.2265, 0.5729],
[-0.1380, -0.0084, -0.0826, ..., -0.2994, 0.4941, -0.1115],
...,
[ 0.8105, -0.0103, 0.0128, ..., -0.4513, -0.3229, -0.3055],
[-0.5017, -0.1092, -0.2608, ..., 0.5027, 0.2166, -0.0970],
[ 1.1945, 0.4546, -0.1763, ..., -0.1278, -0.8650, -0.2725]]],
grad_fn=<NativeLayerNormBackward0>)]
Because a pipeline can have multiple outputs, the returned result
is a list β even though we only have one output in this case. Great, we've just validated that the pipeline works by running it locally!
Running the model on Pipeline Cloud
Before we can run the model on Pipeline Cloud, we need to upload it to the servers. Again we 'get' the pipeline, before instantiating a connection to Pipeline Cloud and uploading our pipeline. If you don't yet have a token, make sure to create one in the dashboard.
from pipeline import PipelineCloud
huggingface_pipeline = Pipeline.get_pipeline("huggingface-demo")
api = PipelineCloud(token="PIPELINE_API_TOKEN")
uploaded_pipeline = api.upload_pipeline(huggingface_pipeline)
print(f"Uploaded pipeline id: {uploaded_pipeline.id}")
You can't modify uploaded pipelines
Once a pipeline has been uploaded to Pipeline Cloud, it's considered immutable, which is to say it can't be updated or modified in any way, even if its a buggy pipeline. This means you have to upload a new pipeline every time you make a change.
During this stage, the pipeline-ai
library will parse (serialise) all your code so that it's formatted as a JSON object. The library then POSTs your pipeline to the endpoint for creating pipelines, because underneath pipeline-ai
is just an SDK around our HTTP API.
And now we run the pipeline using a slightly different syntax to the earlier local run. Internally this a POST request to the /v2/runs
endpoint; so if you're building an app in a different language you don't need to worry about dropping the pipeline-ai
library.
example_input = "a [MASK] is worth a thousand words"
run = api.run_pipeline(
uploaded_pipeline.id,
[example_input],
)
print(f"Run id: {run.id}")
print(run.result_preview)
Notice how the inputs to the .run_pipeline()
method are kept within a list. Additionally, the return of run_pipeline()
is a RunGet
object, not the raw result. So we need to access the result_preview
property in order to see the actual output of our pipeline. What happens if we run this?
Well, as it stands, this pipeline won't work, because tensors are not JSON-serialisable. Remember when we ran the pipeline locally, it returned a tensor of the last hidden states. We need to convert this tensor to a JSON-serialisable format such as a list so that the API can output it. Fortunately there's a helpful utility function in pipeline-ai
which does exactly that.
from pipeline.util.torch_utils import tensor_to_list
with Pipeline("huggingface-demo") as pipeline_blueprint:
...
raw_output = model.predict(masked_sentence)
output = tensor_to_list(raw_output)
pipeline_blueprint.output(output)
We also need to change the pipeline blueprint, as shown, so that it converts the output tensor into a list, in order that we receive our output correctly. Now let's upload and run the pipeline, and see what happens.
[[[[0.09360659867525101, 0.23093537986278534, 0.05791196972131729...
Whoo! A very long list of numbers (aka the hidden states tensor) is returned. We submitted a run to Pipeline Cloud, it executed it on GPU, and then the server returned the result. To repeat, we assembled an API endpoint for serverless inference, and submitted a run to it. And we only get billed for the compute time!
Speed up model loading
Previously we used the run_once
flags on our load
function in order to instruct Pipeline Cloud servers to only run that function once, when it saves the model into GPU cache. There are some other things we can set to make remote loading of the model even faster.
Fixing the minimum GPU demand
First, we can explicitly tell Pipeline Cloud how much GPU VRAM the model requires. By setting a min_gpu_vram_mb
value, we skip the stage during model loading when Pipeline Cloud identifies what size GPU to employ.
...
with Pipeline("huggingface-demo", min_gpu_vram_mb=1605) as pipeline_blueprint:
...
Don't worry if you don't know how much VRAM your model takes up (while doing inference): this kwarg is completely optional. To figure it out the number, you can run the model locally and in another terminal tab run watch -n1 nvidia-smi
. Note down the highest amount of VRAM that you process occupies.
Uploading weights as PipelineFile
s
PipelineFile
sWe can also handle downloading the model weights, because if we upload them as PipelineFile
s, then the Pipeline Cloud servers can take advantage of a rapid local cache in order to make cold starts even smaller. This is quite an advanced process, so only follow the steps below if you're comfortable with HuggingFace's libraries and your local filesystem.
We'll start by saving the HuggingFace model to an obvious location locally.
from transformers import BertModel
bert_model = BertModel.from_pretrained("bert-base-uncased")
bert_model.save_pretrained("saved-bert-base-uncased-model")
Using save_pretrained()
we can store the model's config.json
(config) and pytorch_model.bin
(weights) locally. Next we modify the pipeline blueprint to include two new PipelineFile
s (for these two files).
with Pipeline("huggingface-demo") as pipeline_blueprint:
masked_sentence = Variable(str, is_input=True)
bert_model_config = PipelineFile(path="saved-bert-base-uncased-model/config.json")
bert_model_weights = PipelineFile(path="saved-bert-base-uncased-model/pytorch_model.bin")
pipeline_blueprint.add_variables(
masked_sentence, bert_model_config, bert_model_weights
)
model = CustomBertModel()
model.load(bert_model_config, bert_model_weights)
raw_output = model.predict(masked_sentence)
output = tensor_to_list(raw_output)
pipeline_blueprint.output(output)
We must pass each PipelineFile
to our CustomBertModel
class's load
method, because we're going to reference them inside that function, as below:
import torch
...
@pipeline_function
def load(
self, bert_model_config: PipelineFile, bert_model_weights: PipelineFile
) -> bool:
self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.bert_model = BertModel.from_pretrained(
None,
state_dict=torch.load(bert_model_weights.path),
config=bert_model_config.path,
)
return True
...
The important lines to look at here are lines 9-13, where we pass in None
for the model identifier, instead loading the state_dict
and config
manually. To access the 'path' of a PipelineFile
, use the .path
property, as shown above.
It's okay to pass just a path to the config
kwarg, but the state_dict
must be an actual loaded dictionary of tensors; so we're using torch.load()
to import this from the PipelineFile
. The tokenizer is so small that using PipelineFile
probably won't accelerate loading by a noticeable amount, so for this tutorial we'll skip it.
Conclusion
Although bert-base-uncased
is a simple model, the above pattern can be used to deploy pretty much any model available from transformers
or diffusers
!
Complete script
from transformers import BertTokenizer, BertModel
from pipeline import pipeline_model, pipeline_function, PipelineFile, Pipeline, Variable, PipelineCloud
import torch
from pipeline.util.torch_utils import tensor_to_list
@pipeline_model
class CustomBertModel:
bert_tokenizer = None
bert_model = None
gpu = torch.device("cuda")
@pipeline_function
def load(
self, bert_model_config: PipelineFile, bert_model_weights: PipelineFile
) -> bool:
self.bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.bert_model = BertModel.from_pretrained(
None,
state_dict=torch.load(bert_model_weights.path),
config=bert_model_config.path,
).to(self.gpu)
return True
@pipeline_function
def predict(self, input: str) -> list:
encoded_input = self.bert_tokenizer(input, return_tensors="pt").to(self.gpu)
last_hidden_states = self.bert_model(**encoded_input).last_hidden_state
return last_hidden_states
with Pipeline("huggingface-demo") as pipeline_blueprint:
masked_sentence = Variable(str, is_input=True)
bert_model_config = PipelineFile(path="saved-bert-base-uncased-model/config.json")
bert_model_weights = PipelineFile(
path="saved-bert-base-uncased-model/pytorch_model.bin"
)
pipeline_blueprint.add_variables(
masked_sentence, bert_model_config, bert_model_weights
)
model = CustomBertModel()
model.load(bert_model_config, bert_model_weights)
raw_output = model.predict(masked_sentence)
output = tensor_to_list(raw_output)
pipeline_blueprint.output(output)
huggingface_pipeline = Pipeline.get_pipeline("huggingface-demo")
api = PipelineCloud(token="PIPELINE_API_TOKEN")
uploaded_pipeline = api.upload_pipeline(huggingface_pipeline)
print(f"Uploaded pipeline id: {uploaded_pipeline.id}")
example_input = "a [MASK] is worth a thousand words"
run = api.run_pipeline(
uploaded_pipeline.id,
[example_input],
)
print(f"Run id: {run.id}")
print(run.result_preview)
Updated 6 months ago