- article
important
This feature is currently in public preview. This preview release is provided without a service level agreement and is not recommended for production workloads. Some features may not be supported or may have limited functionality. For more information, seeMicrosoft Azure Preview Supplemental Terms of Use.
This tutorial series shows how features can seamlessly integrate all phases of the ML lifecycle: prototyping, training, and operationalization.
Part 1 of this tutorial showed how to create a feature set specification using a custom transformation. Part 2 of this tutorial showed how to enable materialization and perform backfilling. This tutorial shows how to experiment with features to improve model performance. By the end of the tutorial, you'll see how feature stores can increase the agility of your experimentation and training pipelines.
Part 3 of the tutorial here shows how to:
- Prototype a new
account
Feature set specification, using existing precomputed values as features. You would then register the local feature set specification as a feature set in the feature library. This is different from Part 1 of the tutorial, where we created a function set with a custom transformation. - Features from which to choose a model
trade
andaccount
feature sets and save them as feature retrieval specifications. - Run the training pipeline that uses the feature retrieval specification to train a new model. The pipeline uses built-in feature retrieval components to generate training data.
prerequisites
- Make sure you've done Part 1 and Part 2 of this tutorial.
set up
Configure an Azure Machine Learning spark notebook
Select Azure Machine Learning Spark Compute in the Compute dropdown on the top navigation bar. Wait for the status bar at the top to display "Configuration Session".
Configure session:
- Select "Configure Session" in the bottom navigation
- chooseupload conda file
- Select a document
azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml
from your local device - (Optional) Increase session timeout (idle time) to avoid frequent prerequisite re-runs
start spark session
# Run this cell to start a spark session (any code block will start a session). This may take about 10 minutes. print("start spark session")
Set example root directory
import os# please update the dir to ./Users/{your-alias} (or any custom directory you upload the samples to).# You can find the name from the directory structure on the left navroot_dir = "./Users//featurestore_sample" if os.path.isdir(root_dir): print("The folder exists.") else: print("The folder does not exist. Please create or modify the path")
Initialize project workspace CRUD client
###Initialize MLClient in this project workspace import osfrom azure.ai.ml import MLClientfrom azure.ai.ml.identity import AzureMLOnBehalfOfCredentialproject_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]project_ws_rg = os.environ["AZUREML_AR M_RESOURCEGROUP"]project_ws_name= operating system. environ["AZUREML_ARM_WORKSPACE_NAME"]#Connect to project workspacews_client = MLClient( AzureMLOnBehalfOfCredential(), project_ws_sub_id, project_ws_rg, project_ws_name)
Initialize feature store CRUD client
Make sure to updatefeaturestore_name
Reflect what you created in Part 1 of this tutorial
from azure.ai.ml import MLClientfrom azure.ai.ml.identity import AzureMLOnBehalfOfCredential# feature storefeaturestore_name = "my-featurestore" # 使用教程第 1 部分中的相同名称featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]featurestore_resource_group_name = os .environ["AZUREML_ARM_RESOURCEGROUP"]# feature store ml clientfs_client = MLClient( AzureMLOnBehalfOfCredential(), featurestore_subscription_id, featurestore_resource_group_name, featurestore_name,)
Initialize feature storage SDK client
# 特征存储客户端来自 azureml.featurestore import FeatureStoreClientfrom azure.ai.ml.identity import AzureMLOnBehalfOfCredentialfeaturestore = FeatureStoreClient( credential=AzureMLOnBehalfOfCredential(), subscription_id=featurestore_subscription_id, resource_group_name=featurestore_resource_group_name, name=featurestore_name,)
In the project workspace, create a compute cluster called cpu-cluster
Here we run training/batch inference jobs that depend on this compute cluster
from azure.ai.ml.entities import AmlComputecluster_basic = AmlCompute( name="cpu-cluster", type="amlcompute", size="STANDARD_F4S_V2", # You can replace it with other supported VM SKU location=ws_client.workspaces. get (ws_client.workspace_name).location, min_instances=0, max_instances=1, idle_time_before_scale_down=360,)ws_client.begin_create_or_update(cluster_basic).result()
Step 1: Create account feature set locally based on precomputed data
In Tutorial Part 1, we created a transaction function set with a custom transformation. Here we create an account feature set that uses precomputed values.
To load precomputed features, you can create a feature set specification without writing any conversion code. A feature set specification or specification is a specification for developing and testing a feature set in a fully local development environment, without the need to connect to a feature store. This step locally creates a feature set specification and samples values from it. To get managed feature store functionality, you must register a feature set specification with the feature store using a feature asset definition. More information is provided later in this tutorial.
Step 1a: Explore the account's source data
notes
The sample data used in this notebook is hosted in a publicly accessible blob container. It can only be read in Spark via the wasbs driver. When you create feature sets with your own source data, host them in the adls gen2 account and use the abfss driver in the data path.
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"accounts_df = spark.read.parquet(accounts_data_path)显示(accounts_df.head (5))
Step 1b: Create aaccount
Local feature set specification from these precomputed features
Creating a feature set specification does not require transcoding because we reference precomputed features.
from azureml.featurestore import create_feature_set_spec, FeatureSetSpec from azureml.featurestore.contracts import (DateTimeOffset, FeatureSource, TransformationCode, Column, ColumnType, SourceType, TimestampColumn,) accounts_featureset_spec = create_feature_set_spec( source=FeatureSource( type=SourceType.parquet, path="wasbs:/ /data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet", timestamp_column=TimestampColumn(name="timestamp"), ), index_columns=[Column(name= "accountID", type=ColumnType.string)], # The account profile in the source is updated once a year. Set temporal_join_lookback to 365 days temporal_join_lookback=DateTimeOffset(days=365, hours=0, minutes=0), infer_schema=True, )# Generate a spark dataframe according to the feature set specification accounts_fset_df = accounts_featureset_spec.to_spark_dataframe()# display few recordsdisplay(accounts_fset_df.head(5))
Step 1c: Export as Feature Set Specification
To register a feature set specification with a feature store, the feature set specification needs to be saved in a specific format. Action: After running the next cell, check the resultingaccount
FeaturesetSpec: Open this file from the file tree to view the specification:featurestore/featuresets/accounts/spec/FeatureSetSpec.yaml
The specification contains the following elements:
source
: A reference to a storage resource. In this case the storage is a parquet file in blob storage.feature
: List of functions and their data types. If you provide conversion code (see part Day 2), the code must return a data frame mapped to features and data types. If you don't provide a conversion code (for accounts, since accounts are precomputed), a query is built to map the characteristics to the sourceindex column
: the join key needed to access a value from a feature set
For more information, seeUnderstand the top-level entities in the managed feature storeandCLI (v2) feature set specification YAML schema.
import os# Create a new folder to dump the feature set specaccounts_featureset_spec_folder = root_dir + "/featurestore/featuresets/accounts/spec" # Check if the folder exists, if not create an os.path.exists(accounts_featureset_spec_folder): os .makedirs(accounts_featureset_spec_folder)accounts_featureset_spec.dump(accounts_featureset_spec_folder)
Preserving the specification in this way means it can be under source control.
Step 2: Experiment with unregistered features locally and register to feature store when ready
In feature development, you may want to test and validate locally before proceeding with feature store registration or executing cloud training pipelines. In this step, you will generate training data for your ML model from feature combinations. These features include a locally unregistered set of features (accounts) and a set of features registered in the feature store (transactions).
Step 2a: Select Model Features
# Get registered transaction feature set, version 1 transactions_featureset = featurestore.feature_sets.get("transactions", "1") # Note that the account feature set specification is in your local development environment (this notebook): not yet registered to features store features = [accounts_featureset_spec.get_feature("accountAge"), accounts_featureset_spec.get_feature("numPaymentRejects1dPerUser"), transactions_featureset.get_feature("transaction_amount_7d_sum"), transactions_featureset.get_feature("transaction_amount_3d_sum") , transactions_featureset.get_feature("transaction_amount_7d_avg"), ]
Step 2b: Generate training data locally
This step generates training data for illustration purposes. You can choose to use this data to train the model locally. Later in this tutorial shows how to train the model in the cloud.
from azureml.featurestore import get_offline_features#Load observation data. For observation data see part 1 of this tutorial observation_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet" observation_data_df = spark. read. parquet(observation_data_path)obs_data_timestamp_column = "timestamp"#Use feature data and observation data to generate training data frame training_df = get_offline_features(features=features, observation_data=observation_data_df, timestamp_column=obs_data_timestamp_column,)#Ignore the message that the feature set is not implemented (implementation is Elective ). We will enable materialization in the next part of the tutorial. display(training_df)# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can call training_df.show() to see the values in the correct format
Step 2c: Registeraccount
feature set for feature store
After experimenting with different feature definitions locally and testing their sanity, you can register them with the feature store. For this step you register the feature set asset definition with the feature store.
从 azure.ai.ml.entities import FeatureSet, FeatureSetSpecificationaccounts_fset_config = FeatureSet( name="accounts", version="1", description="accounts featureset", entities=["azureml:account:1"], stage="Development ", specification=FeatureSetSpecification(path=accounts_featureset_spec_folder), tags={"data_type": "nonPII"},)poller = fs_client.feature_sets.begin_create_or_update(accounts_fset_config)print(poller.result())
Step 2d: Take the registered feature set and sanity test it
# Find feature set by providing name and version accounts_featureset = featurestore.feature_sets.get("accounts", "1") # access feature dataaccounts_feature_df = accounts_featureset.to_spark_dataframe()display(accounts_feature_df.head(5))# note: please ignore This warning: Failed loading azureml_run_type_providers. Could not load entry point azureml.scriptrun
Step 3: Run the training experiment
Here, you select the list of features, run the training pipeline, and register the model. You can repeat this step until you are satisfied with the model performance.
(Optional) Step 3a: Discover features from the feature store UI
Part 1 of the tutorial introduces the transactional feature set after you register the transactional feature set. Since you also have an account feature set, you can browse the available features:
- go toAzure Machine Learning global landing page
- In the left navigation, select
specialty store
- It shows a list of feature stores that you can access. Select the feature store you created earlier in this tutorial.
You can see the feature sets and entities you created. Select a feature set to browse for feature definitions. You can also search for feature sets in the feature library using the global search box.
(Optional) Step 3b: Discover Capabilities from the SDK
# List available feature sets all_featuresets = featurestore.feature_sets.list() for fs in all_featuresets: print(fs) # Transaction version list fs) # View the attributes of transaction feature sets, including feature list featurestore.feature_sets.get(name= "transactions", version="1"). features
Step 3c: Select features for the model and export them as a feature retrieval specification
In the previous steps, you selected features from registered and unregistered feature sets for local experimentation and testing. Now you can experiment in the cloud. Save selected features as a feature retrieval specification and use that specification in mlops/cicd pipelines for training and inference, increasing agility when delivering models.
Select features for the model
# You can in pythonic wayfeatures = [ accounts_featureset.get_feature("accountAge"), transactions_featureset.get_feature("transaction_amount_7d_sum"), transactions_featureset.get_feature("transaction_amount_3d_sum"),] #You can also specify the features in string form: featurestore: featureset :version:featuremore_features = [ "accounts:1:numPaymentRejects1dPerUser", "transactions:1:transaction_amount_7d_avg",]more_features = featurestore.resolve_feature_uri(more_features)features.extend(more_features)
Export selected features as feature retrieval specification
notes
A feature retrieval specification is a portable definition of a list of features associated with a model. This helps simplify the development and manipulation of ML models. This will be the input to the training pipeline that generates the training data. It will be packaged with the model and look up features during inference. It becomes the glue that integrates all phases of the ML lifecycle. Changes to training and inference pipelines can be kept to a minimum as you experiment and deploy.
The use of feature retrieval specifications and built-in feature retrieval components is optional. You can directly useget_offline_features()
api as shown earlier in this tutorial.
A specification should have the namefeature_retrieval_spec.yaml
, so that the system can recognize the name of the specification when packaging with the model.
# Create feature retrieval specfeature_retrieval_spec_folder = root_dir + "/project/fraud_model/feature_retrieval_spec"# Check if the folder exists, if not, create a os.path.exists(feature_retrieval_spec_folder): os.makedirs(feature_retrieval_spec_folder)featurestore.generate_feature_retrie val_spec(feature_retrieval_spec_ folder, function)
Step 4: Use pipelines to train in the cloud, and register the model if you are satisfied
In this step, you manually trigger the training pipeline. The ci/cd pipeline can trigger the training pipeline in production based on changes to the feature retrieval specification in the source repository.
Step 4a: Run the training pipeline
The training pipeline has the following steps:
- Feature retrieval step: Here, the built-in component takes as input the feature retrieval specification, observation data, and timestamp column names. Then, it generates training data as output. It runs the feature retrieval step as a managed spark job.
- Training step: This step trains the model based on the training data and generates the model (not yet registered)
- Evaluation Step: This step verifies that the model performance/quality is within thresholds (here, it is used as a placeholder/dummy step for illustration purposes)
- Register model step: This step registers the model
In Part 2 of this tutorial, you ran a backfill job to materialize the data for the transactional feature set. The feature retrieval step reads feature values from the offline storage of this feature set. Even if you useget_offline_features()
interface.
from azure.ai.ml import load_job # will be used later training_pipeline_path = ( root_dir + "/project/fraud_model/pipelines/training_pipeline.yaml") ws_client. jobs.stream(training_pipeline_job.name)# NOTE: Each step in the pipeline may take ~15 minutes the first time you run it. However, subsequent runs may be faster (assuming the spark pool is warm - default timeout is 30 minutes)
Open the pipeline run "web view" in a new window to inspect the steps in the training pipeline.
Step 4b: Check the feature retrieval specification in the model artifact
- In the left navigation of the current workspace, select Model to open in a new tab or window
- choose
fraud model
- In the top navigation, select
Artifact
Note that the feature retrieval specification is packaged with the model early in the model registration step of the training pipeline. During the lab you created a feature retrieval specification that became part of the model definition. The next tutorial will show how inference uses feature retrieval specifications.
Step 5: View feature sets and model dependencies
Step 5a: Review the list of feature sets associated with the model
On the same model page, selectfeature set
Label. This tab also showstrade
andaccount
The set of features that this model depends on.
Step 5b: Review the list of models that use the feature set
- Open the feature store UI (described earlier in this tutorial)
- In the left navigation, select
feature set
- Choose any feature set
- Select the Model tab
You can see a list of models that use the feature set (determined according to the feature retrieval specification when the model was registered).
to clean up
part 4This section of the tutorial describes how to delete resources
Next step
- Understand concepts:specialty store concept,Top-level entity in managed feature store
- Understanding Identity and Access Control for Feature Stores
- View the Feature Store Troubleshooting Guide
- reference:YAML reference