ML System Design - Feature Store

3 minute read

This is my ML-system Design series - Feature Store

Feature Store

What is Feature Store

There are many different feature store definitions which focusing on different perspectives.

Narrow definition usually focused on the storage, which is the 1st problem to solve when you think where to store the features. However, in general, we aggress on a definition in a larger scope. In this case, Feature Store is equivilent to a wholist feature engineering platform solution.

For example, in [1], Feature Store is defined as a service that ingest data, compute features, and then store them, and provide access to the stored feature.

Why do we need a Feature Store

Features are used in many stages of machine learning lifycycle, e.g. experimentation, training, inference. We don’t want to maintain different daata pipelines. This will also reduces the change of model drifting (inconsistencies in data for training and serving), and also reduce costs in data pipelines (dedup).

Enabled feature sharing acorss organizations. Feature store helps data scientists (machine learning engineers) to find and share features. This is usually achieved by providing a management plane that expose these information to DS/MLE.

Handle privacy. This is usually achieved by building lineage system and policy system.

Why building and operating a Feature Store is challenging

  • Fragmented landscaping.

Operational Challenges

MlOps faces its unique challenges than DevOps.

To give you an example, pipeline broken definitely will affect the feature quality, however, even if the pipeline doesn’t break, the feature quality can still go south. in DevOps, our build monitoring to make sure the pipelines are running. In MlOps, we need build monitoring that peeks into the feature data to make sure the feature quality are satisfying.

Component Deep Dive

It is not too hard to categorize what every component into three aspects of the Machine Learning Lifecycle: feature generation, feature storage, and feature serving.

Feature Generation

Batch Processing

Streaming Processing

The process of creating feature generation pipeline is part of the what we called feature authoring.

In an ideal world, we want this authoring experience to be unified, effortless. However, this is very challenging to achieve.

Feature Storage

When it comes to storage, we care more about the usecases which will help us to make design decisions.

First of all, we should ack that : there might not be a single storage that can work for all usecases. It is not saying, as feature storage, we are not wanting a unified batch and streaming storage layer.

Offline Feature Storage

Batch Storages are mostly optimizied for batch processing.

On a higher level, it is dataware houses-like solutions with diffeernt variants.

  • (Hive, Snowflow, DeltaLake, BigQuery)

On the lower level, as for how the feature data are stored

  • local file system (not preferred since we could have very large files)
  • Distributed File System like HDFS,
  • Object Storage like S3, Minio

Online Feature Storage

Ideal World

In an ideal world, we want a storage that can be both online and offline storage.

However, even if we dont have such storage, there is an opportutnies to provide an abstraction layer on top to hide the complexities. Such abstraction should be considered in 1) write 2) read paths.

On the write path, the unification challenges are being attacked at different stages.

  • pipeline author stage: we have seen efforts in unifying programming model (e.g., Apache Beam) to achieve the write-once, run anywhere.

Feature Serving

Feature serving allows on-demand transform, i.e., computation happens on-the-fly when fechting the feature from feature storage.

Why? Precomputation isn’t always possible and efficient [6].

  • freshness constraint
    • the feature is extractecd from a online transaction
    • a feature that depends on current time
    • feature needs to be extracted from click stream events
  • space constraint
    • crossed feature that might explod to precompute, it would be more reasonable to just calculate on-the-fly.
  • there is a cost to spin up feature generation pipelines. for very simple computation (e.g, type casting), do it on-the-fly would be most cost-efficient.

This is the part where we need low latency, high throughput, and high availablity.

Challenges

  • serve the need for training large model
  • serve the need for model inference

Management Plane

Control Plane

Topic Deep Deive

Industrial Solutions

DoorDash

FEAST

https://docs.feast.dev/

References