Two-Phase Predictions—Hybrid mode of model deployment

Did you ever try to figure out how Alexa, Google Assistant, or any always-listening digital voice assistant devices are able to respond to every query of the user, without featuring complex AI hardware?

We already know because of the device constraints, models deployed on edge devices need to balance the trade-off between accuracy and size, complexity, update frequency, and low latency.

Cloud deployed models often have high latency, causing bad user experience for voice assistant users.
Privacy is also an issue.

Problems like this is where Two-phase predictions can help resolve the conflict.

The idea is to split the use cases into two phases, simpler phase being carried out on the edge device and when required complex one on cloud.

For the use case we talked about earlier,

We'll have one edge optimized model deployed on the device, listening to surrounding for wake-up words (like, "Alexa", "Hey, Google", etc) to determine if the user wants to begin a conversation.
Upon successful detection, proceed with recording of the sound and upon detection of conclusion of the conversation, send to cloud for complex phase predictions to determine the intention of the user.

This implies the two phases are split as:

Smaller, cheaper model deployed on edge device for the simpler task.
Larger, complex model deployed on cloud and triggered only when needed.

Let's try it out!

Phase 1: Building the offline model

We'll need to convert a trained model to a model suitable to run and store on edge devices. This can be done via a process known as quantization, where learned model weights are represented with fewer bytes.

TensorFlow, for example, uses a format called TensorFlow Lite to convert saved models into a smaller format optimized for serving at the edge.

This approach is termed as post-training quantization. The goal is to find the maximum absolute weight value, \(m\), then map it to floating-point range (often float32) \(-m\) to \(+m\) to the fixed-point (integer) range \(-127\) to \(+127\). This also requires the inputs to be quantize inputs at inference time, which TFLite automatically does for us.

i.e., from 32 bit floating point values to 8 bit signed integers, reducing size to 1/4th of the original model.

To prepare the trained model for edge serving, we use TF Lite to export it in an optimized format:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
open('converted_model.tflite', 'wb').write(tflite_model)

To generate a prediction on a TF Lite model, we use the TF Lite interpreter, which is optimized for low latency. On edge devices, the platform-specific libraries provide APIs to load and generate predictions/inference purposes.

For this, we create an instance of TF Lite's interpreter and get details on the input and output format it's expecting:

interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

The input_details or output_details is a list with a single dictionary object specifying the input/output specs of the converted TF Lite model, which looks like the following:

[{'name': 'serving_default_digits:0',
  'index': 0,
  'shape': array([  1, 784], dtype=int32),
  'shape_signature': array([ -1, 784], dtype=int32),
  'dtype': numpy.float32,
  'quantization': (0.0, 0),
  'quantization_parameters': {'scales': array([], dtype=float32),
  'zero_points': array([], dtype=int32),
  'quantized_dimension': 0},
  'sparsity_parameters': {}}]

We'll then get the prediction from our validation batch to the loaded TF Lite model as follows:

input_data = np.array([test_batch[42]], dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)

interpreter.invoke()
output_data = interpreter.get_tensor(output_details[0]['index'])

It's worth noting here that, depending on how costly it is to call your model, you can change what metric we're optimizing for when you train the on-device model. For example, precision over recall in case we don't care about False Negatives.

The main problem with quantization is that it loses a bit of accuracy: kinda equivalent to adding noise to the weights and activations. If the accuracy drop is too severe, then we may need to use quantization-aware training. This means adding fake quantization operations to the model so it can learn to ignore the quantization noise during training; making the final weights to be more robust to quantization.

Phase 2: Building the cloud model

Our cloud model doesn't really need to be bounded by any constraints we faced for edge optimized model. We can follow a more traditional approach for training, exporting, and deploying this model. This means we can combine multiple different design patterns, such as Transfer Learning, a cascade of models, or multiple different models depending on the second phase requirement.

After training, we can then deploy this model to a cloud AI service provider (AWS, GCP, etc). Or we can take complete pipeline-based model training and deployment setup using libraries like TFX.

To demonstrate, we'll pretend a model is already trained and then deploy it on Google Cloud AI Platform.

First, we'll directly save our model to our GCP project storage bucket:

cloud_model.save('gs://your_storage_bucket/path')

This will export our model in TF SavedModel format and upload it to Cloud Storage bucket

On Google Cloud AI Platform, a model resource contains different versions of your model. Each model can have hundreds of versions. We'll create the model resource using gcloud, the Google Cloud CLI.

gcloud ai-platform models create second-phase-predictor

Then to deploy our model, we'll use gcloud and point AI Platform at the storage subdirectory that contains our saved model assets:

gcloud ai-platform versions create v1 \
  --model second-phase-predictor \
  --origin 'gs://your_storage_bucket/path/model_timestamp' \
  --runtime-version=2.1 \
  --framework='tensorflow' \
  --python-version=3.7

Trade-Offs and Alternatives

There might be situations where our end users may have very little or no internet connectivity, and thus the services of the second phase/cloud hosted model becomes impossible to access. How can we mitigate this issue? Other than that how we are supposed to perform continuous evaluation, check if the metrics haven't degraded over time and if accuracy is suffering on edge-deployed model?

Standalone single-phase model

In situations where end users of our model may have little or no internet connectivity, instead of relying on a two-phase prediction flow, we can make our first model robust enough that it can be self-sufficient.

To do this, we can create a smaller version of our complex model, and give users the option to dowload this simpler, smaller model for use when they are offline. These offline models may not be quite as accurate as their larger online counterparts, but this solution is infinitely better than having no offline support at all.

To build more complex models designed for offline inference, it's best to utilize quantization aware training, whereby we quantize model's weights and other math operations both during and after training.

Offline support for specific use cases

Another solution for making our application work for users with minimal internet connectivity is to make only certain parts of our app available offline. This means only a few common features work offline or caching the results of an ML model's prediction for later offline use.

This way, the app works sufficiently offline but provides full functionality when it regains connectivity.

Handling many predictions in near real time

In some other cases, end users of our ML model may have reliable connectivity but might need to make hundreds or even thousands of predictions to our mode at once. This is the case of sensor stream data, maybe trying to detect some kind of anomaly.

Getting prediction responses on thousands of examples at once will take too much time due to the excess amount of requests and network bandwidth issues.

Instead of constantly sending requests over the network for anomaly detection, we can have a model deployed directly on the sensors to identify possible anomaly candidates from incoming data and then send only potential anomalies to our cloud model for verification.

The main difference being that both the offline and cloud models perform the same prediction task but with different inputs.

Continuous evaluation for offline models

We can save a subset of predictions that are received on-device. We could then periodically evaluate our model's performance on these examples and determine if the model needs retraining.

Another option is to create a replica of our on-device model to run online, only for continuous evaluation purposes. This solution is preferred if our offline and cloud models are running similar prediction tasks, like in Neural Machine Translation.

That's all for today. Hope you learned something new.

This is Anurag Dhadse, signing off.