ML System Design - Feature Store
This is my ML-system Design series - Feature Store
Feature Store
What is Feature Store
There are many different feature store definitions which focusing on different perspectives.
Narrow definition usually focused on the storage, which is the 1st problem to solve when you think where to store the features. However, in general, we aggress on a definition in a larger scope. In this case, Feature Store is equivilent to a wholist feature engineering platform solution.
For example, in [1], Feature Store is defined as a service that ingest data, compute features, and then store them, and provide access to the stored feature.
Why do we need a Feature Store
Features are used in many stages of machine learning lifycycle, e.g. experimentation, training, inference. We don’t want to maintain different daata pipelines. This will also reduces the change of model drifting (inconsistencies in data for training and serving), and also reduce costs in data pipelines (dedup).
Enabled feature sharing acorss organizations. Feature store helps data scientists (machine learning engineers) to find and share features. This is usually achieved by providing a management plane that expose these information to DS/MLE.
Handle privacy. This is usually achieved by building lineage system and policy system.
Why building and operating a Feature Store is challenging
- Fragmented landscaping.
Operational Challenges
MlOps faces its unique challenges than DevOps.
To give you an example, pipeline broken definitely will affect the feature quality, however, even if the pipeline doesn’t break, the feature quality can still go south. in DevOps, our build monitoring to make sure the pipelines are running. In MlOps, we need build monitoring that peeks into the feature data to make sure the feature quality are satisfying.
Component Deep Dive
It is not too hard to categorize what every component into three aspects of the Machine Learning Lifecycle: feature generation, feature storage, and feature serving.
Feature Generation
Batch Processing
Streaming Processing
The process of creating feature generation pipeline is part of the what we called feature authoring.
In an ideal world, we want this authoring experience to be unified, effortless. However, this is very challenging to achieve.
Feature Storage
When it comes to storage, we care more about the usecases which will help us to make design decisions.
First of all, we should ack that : there might not be a single storage that can work for all usecases. It is not saying, as feature storage, we are not wanting a unified batch and streaming storage layer.
Offline Feature Storage
Batch Storages are mostly optimizied for batch processing.
On a higher level, it is dataware houses-like solutions with diffeernt variants.
- (Hive, Snowflow, DeltaLake, BigQuery)
On the lower level, as for how the feature data are stored
- local file system (not preferred since we could have very large files)
- Distributed File System like HDFS,
- Object Storage like S3, Minio
Online Feature Storage
Ideal World
In an ideal world, we want a storage that can be both online and offline storage.
However, even if we dont have such storage, there is an opportutnies to provide an abstraction layer on top to hide the complexities. Such abstraction should be considered in 1) write 2) read paths.
On the write path, the unification challenges are being attacked at different stages.
- pipeline author stage: we have seen efforts in unifying programming model (e.g., Apache Beam) to achieve the write-once, run anywhere.
Feature Serving
Feature serving allows on-demand transform, i.e., computation happens on-the-fly when fechting the feature from feature storage.
Why? Precomputation isn’t always possible and efficient [6].
- freshness constraint
- the feature is extractecd from a online transaction
- a feature that depends on current time
- feature needs to be extracted from click stream events
- space constraint
- crossed feature that might explod to precompute, it would be more reasonable to just calculate on-the-fly.
- there is a cost to spin up feature generation pipelines. for very simple computation (e.g, type casting), do it on-the-fly would be most cost-efficient.
This is the part where we need low latency, high throughput, and high availablity.
Challenges
- serve the need for training large model
- serve the need for model inference
Management Plane
Control Plane
Topic Deep Deive
Industrial Solutions
DoorDash
FEAST
References
- [1] https://neptune.ai/blog/feature-stores-components-of-a-data-science-factory-guide
- [2] Machine Learning Design Patterns - Book by Michael Munn, Sara Robinson, and Valliappa Lakshmanan
- [3] https://towardsdatascience.com/mlops-building-a-feature-store-here-are-the-top-things-to-keep-in-mind-d0f68d9794c6
- [4] https://www.tecton.ai/blog/how-to-build-a-feature-store/
- [5] https://www.informatica.com/blogs/adopt-a-kappa-architecture-for-streaming-and-ingesting-data.html
- [6] Feast RFC-021: On-Demand Transformations