Deploying MLflow on the Google Cloud Platform using App Engine
MLOps platforms delivered by GetInData allow us to pick best of breed technologies to cover crucial functionalities. MLflow is one of the key…
Read moreA tutorial on how to deploy one of the key pieces of the MLOps-enabling modern data platform: the Feature Store on Azure Databricks with Terraform as IaC.
Machine Learning Operations (MLOps) — that’s a trendy buzzword nowadays, isn’t it? Well, today we won’t be debating herein about the merits of why this is what you need for your business and what the different aspects of this that you should consider are. Instead, we want to share here what we’ve learnt during the process of setting up one of the pieces of the data platform that would truly empower data scientists and machine learning engineers - the Feature Store on Azure Databricks. Owing to its popularity, cloud-agnostic approach and a great user-experience due to the declarative syntax, we went with Terraform as the Infrastructure as Code (IaC) framework of choice, which has paid its dividends generously. Without further ado, let’s get to it.
Within this blog-post we’ll feature the following services deployed on Azure cloud:
Fully fledged data platforms may obviously feature more resources, but the ones used herein will get you the full functionality of the Feature Store, including its offline and online versions.
Within the code repository featured in this blog-post, we’ve included two slightly modified notebooks originally created by Databricks. Our modifications allow you to run the notebooks right after the infrastructure is deployed, so you can familiarize yourself with Feature Store functionalities straightaway. Since they’re extremely well documented, we’ll just let you explore them on your own. You can find them here.
Herein, we assume running the Terraform code from your own computer, which will result in the fact that the state information will be kept locally. If you need to share this file with other team members or you already have pipelines in place for deploying infrastructure, you should resort to storing state info in a more accessible place, such as Blob storage. On how to do so, please refer to this official Azure tutorial.
In order to deploy the Databricks Feature Store on Databricks, we’ll leverage two Terraform files that contain the code:
The ready-to-use Terraform code is available in the Github repo. While you can have a look at it anytime, we’ll carefully guide you through all of it so you can understand both the code and services that will be a part of the DataBricks Feature Store. The full code can be found here:
GitHub - getindata/azure-databricks-feature-store-poc
Prior to running any Terraform code, you must authenticate to Azure cloud. This is done via Azure CLI, which must be installed on the computer that you’ll use to deploy the infrastructure. Instructions on how to do so can be found here.
When CLI is installed, make sure that you login to CLI by executing the following command and following the on-screen prompts:
az login
The first step in the Terraform workflows is the initialization of a working directory. This also makes sure that all of the provider plugins specified in the code are properly installed and tracked in a local lock file. You can do this by calling this simple command:
terraform init
If you want to leverage an existing resource group for this tutorial, you should at first import its definition to a local state file. In order to do so, you need to comment out the sections of the code that get evaluated (at this point) prematurely. Those specifications are as follows:
- provider "databricks"
- data “local_file” “aad_token_fle”
- data "databricks_current_user" "me"
So the code for those pieces should be as follows:
# provider "databricks" {
# host = azurerm_databricks_workspace.dbx.workspace_url
# token = trimsuffix(data.local_file.aad_token_file.content, "\r\n")
# }
# data "local_sensitive_file" "aad_token_file" {
# depends_on = [null_resource.get_aad_token_for_dbx]
# filename = var.aad_token_file
# }
# data "databricks_current_user" "me" {
# depends_on = [azurerm_databricks_workspace.dbx]
# }
After you’ve commented those sections out, copy the ID of your Azure subscription (you can check it via the Azure portal or via Azure CLI). Then, run the following command:
terraform import azurerm_resource_group.rg /subscriptions/<your-subscription-id>/resourceGroups/<your-resource-group-name>
The successful run of the import statement will result in the creation of a state file, in which the resources managed by Terraform are meticulously tracked. Depending on your needs, sometimes you may need to modify the state file, for instance if you want to delete all of the resources that were created via a given Terraform code but you don't want to delete the whole resource group. Then you can remove the temporary resource group from the state file (azurerm_resource_group.rg) and destroy all of the resources by running terraform destroy. If you want to deploy the resources into this resource group again, just run the above terraform import command once more. Please be aware, that you should never need to modify the state file manually — state modifications should be always done via arguments supplied to terraform state command.
If there are resources in this resource group that are not imported into Terraform they will remain intact, no matter what you do in the code (unless you decide to destroy the resource group from Terraform — in such case removing the resource group from the state file would be the better option).
The following code is used for the providers declaration:
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.26.0"
}
databricks = {
source = "databricks/databricks"
version = ">= 1.4.0"
}
}
required_version = ">= 1.1.0"
}
provider "azurerm" {
features {
key_vault {
purge_soft_delete_on_destroy = true
}
}
}
provider "databricks" {
host = azurerm_databricks_workspace.dbx.workspace_url
token = trimsuffix(data.local_sensitive_file.aad_token_file.content, "\r\n")
}
data "local_sensitive_file" "aad_token_file" {
depends_on = [null_resource.get_aad_token_for_dbx]
filename = var.aad_token_file
}
resource "null_resource" "get_aad_token_for_dbx" {
triggers = { always_run = "${timestamp()}" }
provisioner "local-exec" {
command = format("az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | jq -r .accessToken > %s", var.aad_token_file)
}
}
Throughout this tutorial, we use 2 different Terraform providers: azurerm
and databricks
. As you can see, there is an additional option specified for key_vault
within the azurerm
provider and that is purge_soft_delete_on_destroy
. Since the infrastructure showcased here was never intended to be used in real production workflows, by leveraging this option we make sure that key_vault
content is purged along with its deletion.
The provider for Databricks is more interesting though. Since the goal here was to set up everything required for the Feature Store usage from one codebase, it was necessary to create a secret scope within Databricks that is backed by Azure Key Vault. In order to do so, you must authenticate to the Databricks provider using either Azure CLI or AAD token. You can’t use Databricks Personal Access token (standard authentication method) for this purpose because it does not authorize you to access native Azure resources in any way, such as Key Vault. Authentication via Service Principal is not supported as of now. For more info, please refer to this link.
The authentication via AAD token is handled as follows: "null_resource"
"get_aad_token_for_dbx"
resource is used for Azure CLI call that retrieves a token including flag — resource 2ff814a6–3304–4ab8–85cb-cd0e6f879c1d
, which is Azure’s programmatic ID for Databricks workspace. The output of Azure CLI call is piped to jq command: jq -r .accessToken > %s
which outputs token value to %s
. The whole CLI call and a pipe is wrapped within the native Terraform format()
command, which maps %s
to the name of the file that is used for keeping the AAD token. As specified in variables.tf
file, it is aad_token.txt
.
Once the token is written to a local file, it can be referenced via a data block that is specified here as ”local_sensitive_file” “aad_token_file”
.
This data block is referenced in the provider "databricks"
block, where we specify its content as a token, while also taking care of redundant file formatting that might have got into the file, depending on our operating system of choice. This trimming is done via the trimsuffix
function.
Prior to handling the actual resources, let’s make our life a little easier later on and declare some meta-data.
data "azurerm_client_config" "current" {}
data "databricks_current_user" "me" {
depends_on = [azurerm_databricks_workspace.dbx]
}
By specifying those two data blocks, we’ll be able to reference a few arguments throughout the code, such as tenant_id and object_id that are connected to our user account on Azure and our home path of the Databricks Workspace. By specifying depends_on = [azurerm_databricks_workspace.dbx] in databricks_current_user, we ensure that the Databricks user is not evaluated until the Databricks Workspace is actually deployed.
resource "azurerm_resource_group" "rg" {
name = var.resource_group
location = "West Europe"
tags = {
"tag1" = "tag1value"
"tag2" = "tag2value"
}
}
As already mentioned, you can leverage an existing resource group for deploying your Feature Store but you can also create a brand new one directly from Terraform. If you import your resource group though, if you don't want to change any of its specification, make sure that its definition from the Terraform code is exactly the same as the definition of an existing resource group, including any tags. You can check them either on the portal or when you run the terraform plan
or terraform apply
command. If the resource group is not modified in any way in the message received after running one of those commands, your definition completely matches the definition of an already existing resource group. Otherwise, you can either modify your code or overwrite the definition of the resource group.
resource "azurerm_storage_account" "adls2" {
name = var.stg_name
resource_group_name = azurerm_resource_group.rg.name
location = azurerm_resource_group.rg.location
account_tier = "Standard"
account_replication_type = "LRS"
account_kind = "StorageV2"
is_hns_enabled = "true"
}
resource "azurerm_storage_data_lake_gen2_filesystem" "stg1" {
name = "offline-feature-store"
storage_account_id = azurerm_storage_account.adls2.id
}
resource "azurerm_storage_data_lake_gen2_filesystem" "stg2" {
name = "temp"
storage_account_id = azurerm_storage_account.adls2.id
}
Although a dedicated storage solution is not necessary for spinning up Databricks with its Feature Store, it is generally a good idea to keep your data separate from the Databricks Workspace. It does not limit your capabilities at all, but rather provides much more flexibility to the architecture you’re building. In such a case it’s a great idea to leverage unmanaged Delta tables — this means that Databricks would only handle the metadata (such as the path on the underlying storage) of the Delta table and not the data itself. For more info on this, please refer to this link.
Due to the inherent advantages of Azure Data Lake ver. 2 over Azure Blob Storage, it's a good idea to leverage it as the underlying solution for storing large amounts of data. In order to achieve this, account_kind
and is_hns_enabled
in azurerm_storage_account
resource must be specified to StorageV2
and true
, respectively.
Please notice how the attributes of resources are reused here. Instead of manually inputting resource_group_name
and its location
, we just take advantage of the fact that those attributes are specified in the state and we just provide a reference to them here.
Containers provide the hierarchy for the data that is kept on Azure Storage solutions. It can be treated as a parent directory in a given Storage account, so the path to <storage-name>/<container-name>
will map to the top level of the container. From within Terraform code, containers can be created using azurerm_storage_data_lake_gen2_filesystem
resource.
resource "azurerm_databricks_workspace" "dbx" {
name = var.dbx_workspace_name
resource_group_name = azurerm_resource_group.rg.name
location = azurerm_resource_group.rg.location
sku = "premium"
}
Deploying the Databricks workspace is done within the azurerm
provider and not the databricks
provider — the latter is for interacting with the workspace itself and requires the workspace to be created first. Parameter sku
controls whether you’ll be using a standard
or premium
workspace. The comparison of their functionalities and pricing can be found here.
Please note that this only deploys the workspace. Deployment of a workspace automatically creates a new resource group that is managed by the Databricks workspace. Inside this resource group there will appear resources that will be provisioned by Databricks whenever new resources are needed. So when you start a new cluster, a new set of resources for a given VM type will get deployed therein.
resource "azurerm_key_vault" "kv" {
name = var.kv_name
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
enabled_for_disk_encryption = true
tenant_id = data.azurerm_client_config.current.tenant_id
soft_delete_retention_days = 7
purge_protection_enabled = false
sku_name = "standard"
}
resource "azurerm_key_vault_access_policy" "kv_ap" {
key_vault_id = azurerm_key_vault.kv.id
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azurerm_client_config.current.object_id
secret_permissions = [
"Set",
"List",
"Get",
"Delete",
"Recover",
"Restore",
"Purge"
]
depends_on = [azurerm_key_vault.kv]
}
Since Databricks provided a way to use secrets from the Azure Key Vault directly within the Databricks Runtime, let’s leverage this for storing the keys required for providing connection credentials. In order to do so, we need to provision a Key Vault first.
Access to a Key Vault may be handled in two ways, either via an access policy or a Role Based Access Control. We chose the former, due to its simplicity. The way Terraform code is specified here creates a full set of secret permissions for the authenticated user to azurerm
provider, since object_id
parameter is set to data.azurerm_client_config.current.object_id
. In order to create an access policy for a different user, group or service principal, change this value.
Please also note that depends_on
option in azurerm_key_vault_access_policy
is set to [azurerm_key_vault.kv]
. This is done to make sure that the policy is created after the underlying Key Vault has already been provisioned.
resource "databricks_secret_scope" "kv" {
name = azurerm_key_vault.kv.name
keyvault_metadata {
resource_id = azurerm_key_vault.kv.id
dns_name = azurerm_key_vault.kv.vault_uri
}
}
Making Key Vault secrets accessible from the Databricks Workspace is very easy when it's done via Terraform. All that is needed to be done is hookup the resource_id
and dns_name
as a keyvault_metadata
in databricks_secret_scope
resource. But please remember, that in order for this setup to work, you need to authenticate to the Databricks
provider in a way that provides access to both Databricks Workspace and Azure Key Vault, as described in the section Providers. Registering a Key Vault inside Databricks Workspace as a secret scope is exactly the reason why the authentication is done via an AAD token.
Please also note that in such a case, access to secrets is not managed via Databricks but via the access control on a Key Vault. The way of controlling this access is described in the section above (Azure Key Vault).
resource "azurerm_cosmosdb_account" "cdbacc" {
name = var.cosmos_db_name
location = azurerm_resource_group.rg.location
resource_group_name = azurerm_resource_group.rg.name
offer_type = "Standard"
kind = "GlobalDocumentDB"
enable_automatic_failover = true
tags = {
defaultExperience = "Core (SQL)"
hidden-cosmos-mmspecial = ""
}
consistency_policy {
consistency_level = "BoundedStaleness"
max_interval_in_seconds = 300
max_staleness_prefix = 100000
}
geo_location {
location = azurerm_resource_group.rg.location
failover_priority = 0
}
}
resource "azurerm_cosmosdb_sql_database" "db" {
name = "tf-db"
resource_group_name = azurerm_resource_group.rg.name
account_name = azurerm_cosmosdb_account.cdbacc.name
}
resource "azurerm_cosmosdb_sql_container" "cnt" {
name = "tf-container"
resource_group_name = azurerm_resource_group.rg.name
account_name = azurerm_cosmosdb_account.cdbacc.name
database_name = azurerm_cosmosdb_sql_database.db.name
partition_key_path = "/definition/id"
partition_key_version = 1
throughput = 400
indexing_policy {
indexing_mode = "consistent"
included_path {
path = "/*"
}
included_path {
path = "/included/?"
}
excluded_path {
path = "/excluded/?"
}
}
unique_key {
paths = ["/definition/idlong", "/definition/idshort"]
}
}
There are 3 different backends available for the Online Feature Store on Azure Databricks:
From those three options, Cosmos DB provides the most features and that’s why it’s used here.
There are some limitations when it comes to using Cosmos DB as a Feature Store backend though, namely:
Apart from that, the connection between Databricks Runtime and Cosmos DB is handled by a separate OLTP connector that needs to be installed on the cluster — we’ll revisit this issue later on. If all of those downsides did not discourage you from using Cosmos DB as a Feature Store backend — let’s continue.
Standard Cosmos DB SQL hierarchy consists of the following entities:
Cosmos DB Workspace > database > container > table > items
So at first, the workspace must be created. Cosmos with Core (SQL) API can be provisioned from within Terraform by specifying kind = “GlobalDocumentDB”
and setting tags to
tags = {
defaultExperience = "Core (SQL)"
hidden-cosmos-mmspecial = ""
}
Once the Cosmos DB account is created, we can create databases and containers. Please note that throughput
is set within containers and by changing this value, you can set how much computing power will be reserved for your workloads, so it also controls the costs of the resource.
Once the Cosmos DB basic setup is done, we can go forward with establishing the connection between Databricks Workspace and the database. For this purpose, we’ll need a cluster that will serve as our computing engine on Databricks.
resource "databricks_cluster" "dbx_cluster" {
cluster_name = var.cluster_name
spark_version = "11.2.x-cpu-ml-scala2.12"
node_type_id = "Standard_F4"
autotermination_minutes = 30
num_workers = 1
spark_conf = {
format("fs.azure.account.key.%s.dfs.core.windows.net", azurerm_storage_account.adls2.name) = azurerm_storage_account.adls2.primary_access_key
}
}
The tweaking of a cluster specification is something that you shouldn’t (initially) spend too much time on. It can always be changed and adjusted as you go on with your workflows. You can also create as many clusters as you want — creation itself does not cost anything, you only pay when the clusters are running. Therefore it’s a very good idea to set up an auto-termination option so you won't pay for the clusters sitting idle overnight, were you to forget to shut them down. You can also choose what VM types to utilize and the amount of workers that will be spinned up. If you need to, you can also specify autoscaling for your clusters but as far as POC purposes go, the cheapest VMs available will be sufficient. Herein, you can also choose a Spark version that will be available to your workloads. In order to have a Feature Store library preinstalled on the cluster, you should refer to this one to check which version of Databricks Machine Learning runtime you should use. As displayed in the code, Databricks Runtime 11.2 for Machine Learning was used while preparing this blogpost.
Apart from the cluster hardware and runtime specification, with databricks_cluster
resource you can also add spark configuration properties, which can be leveraged for customizing runtime experience. Herein, spark_conf
is used to provide access key to the ADLS2 account. Please notice, that at first a format
command is used, in order to reference azurerm_storage_account.adls2.name
within the string and then it’s assigned the key value itself. Since this assignment is done via referencing properties of objects that were created earlier on, the storage key is never exposed in the code itself — it is evaluated at the runtime, whenever you hit terraform apply
.
Owing to adding the ADLS2 keys to the clusters’ spark configuration, you can access data sitting on this storage via the Azure Blob Filesystem driver. It works in such a way that storage paths are mapped to the abfss path. For instance, you can list objects sitting in one of the containers we created earlier on by executing the following command on Databricks the workspace (but remember that you must use a cluster with the storage key added to the spark config):
dbutils.fs.ls("abfss://offline-feature-store@<stg_name_from_variables.tf>.dfs.core.windows.net/external-location/path/to/data")
We find using a dedicated storage account an optimal way to work with Delta tables on Databricks. The full control over all of the storage functionalities provide great flexibility to the platform users.
resource "databricks_library" "cosmos" {
cluster_id = databricks_cluster.dbx_cluster.id
maven {
coordinates = "com.azure.cosmos.spark:azure-cosmos-spark_3-2_2-12:4.14.1"
}
}
Now that the cluster and a dedicated Cosmos DB workspace for our Feature Store are both ready, we need to set up the connection between those two resources. The connection is handled by Azure Cosmos DB OLTP Spark connector shipped on the Maven platform. This library is not preinstalled on the cluster though, and requires a manual installation on the cluster. However, owing to databricks_library
resource this is very easy to do from the script. We just need to specify the source of the library (maven
) and its coordinates along with the appropriate cluster id.
Now, the connection between resources for the Feature Store is ready to be established. For this purpose, we need to provide Cosmos DB credentials to the Databricks cluster, which will be accessed via the OLTP connector that we just installed. Those credentials can live in any secret scope in the Databricks workspace but since we earlier created a secret scope that has Azure Key Vault as its backend, let’s just leverage it.
resource "azurerm_key_vault_secret" "cdb-primary-key-write" {
name = "cosmosdb-primary-key-write-authorization-key"
value = azurerm_cosmosdb_account.cdbacc.primary_key
key_vault_id = azurerm_key_vault.kv.id
depends_on = [azurerm_key_vault_access_policy.kv_ap]
}
resource "azurerm_key_vault_secret" "cdb-primary-key-read" {
name = "cosmosdb-primary-key-read-authorization-key"
value = azurerm_cosmosdb_account.cdbacc.primary_readonly_key
key_vault_id = azurerm_key_vault.kv.id
depends_on = [azurerm_key_vault_access_policy.kv_ap]
}
When the Cosmos DB account is created, its access keys are available as parameters that can be reused throughout the Terraform script, similarly to what we’ve done with ADLS2. Therefore, let’s just put the required access key in our Key Vault.
Please note the naming convention for the secrets: both of the access keys (read and write) have a suffix of authorization-key
. This is not done by accident. In fact, the Feature Store library expects that the access keys will have this particular suffix, regardless of what else is there in the name. We find this peculiar, especially since you always need to reference the secret name in the code but without this suffix the Feature Store library just hides this suffix addition from the user. For more info, please, refer to the original docs. You can also have a look at the Python code for this connection in the notebooks uploaded to the repo along the Terraform code.
resource "databricks_token" "pat" {
comment = "Terraform Provisioning"
// 100 day token
lifetime_seconds = 8640000
}
resource "azurerm_key_vault_secret" "databricks-token" {
name = "databricks-token"
value = databricks_token.pat.token_value
key_vault_id = azurerm_key_vault.kv.id
depends_on = [azurerm_key_vault_access_policy.kv_ap, databricks_token.pat]
}
One of the notebooks available in the repo features testing real-time inference, by calling an endpoint that will serve your Mlflow model. In order to authenticate such a connection, we’ll use a Personal Access Token for Databricks workspace. Its creation can be done via databricks_token
resource. Once it’s provisioned, let’s insert it into Key Vault just as we’ve done for other credentials.
resource "databricks_notebook" "online-fs-wine-example" {
source = format("${path.module}/databricks_notebooks/%s", var.notebook_1)
path = format("${data.databricks_current_user.me.home}/terraform/%s", var.notebook_1)
}
resource "databricks_notebook" "notebook-2" {
source = format("${path.module}/databricks_notebooks/%s", var.notebook_2)
path = format("${data.databricks_current_user.me.home}/terraform/%s", var.notebook_2)
}
Once the resources are properly setup, we just need to have a working example of how the Feature Store operates. To achieve this, we just need to upload git-versioned notebooks to our Databricks workspace. This is done with a databricks_notebook
resource. For dealing with paths, we’ll leverage the powerful format
function available in Terraform. The source
specifies the local path
to the file and path is the target path on Databricks workspace. We can use ${path.module}
for getting the filesystem path to where our code sits. For parsing the path on the Databricks workspace, we can leverage data.databricks_current_user.me.home
, which is the attribute of data that we specified earlier on. By leveraging it, the beginning of the path will be equal to our user's home
path on the Databricks workspace.
resource "databricks_job" "run-online-fs-wine-example" {
name = "run-wine-example"
new_cluster {
num_workers = 1
spark_version = "11.2.x-cpu-ml-scala2.12"
node_type_id = "Standard_DS3_v2"
}
spark_conf = {
format("fs.azure.account.key.%s.dfs.core.windows.net", azurerm_storage_account.adls2.name) = azurerm_storage_account.adls2.primary_access_key
}
notebook_task {
notebook_path = databricks_notebook.online-fs-wine-example.path
}
library {
maven {
coordinates = "com.azure.cosmos.spark:azure-cosmos-spark_3-2_2-12:4.14.1"
}
}
}
So far, what we’ve done was focused on getting the setup for an interactive experience with Databricks. That’s the default one, when you go around, produce code and execute lines of code as you go. However, Databricks was never intended to only be an interactive platform — it also supports scheduled jobs that are necessary for any production-case workloads. Those are run on job clusters, which are distinct from interactive ones and they are in fact cheaper to use. They exist only for the duration of running a given piece of code. They also require their full setup prior to the job start and if your job fails, the cluster will need to be provisioned again.
Workflows are created with a databricks_job
resource. Basically it’s the same setup as what was done for interactive clusters within one resource. The only real difference in the code is the addition of the notebook_task
, which specifies the path to the notebook containing the Python code that will be run within the workflow. Once the Terraform code is successfully run, this is what the task specification for this workflow looks like in Databricks UI:
So, that’s it for the description of the whole Terraform code. If you’re deploying into an existing resource group, you’ll need to uncomment the code that was commented out in section Deploying into a resource group that isn't empty, that involves the following entities:
- provider "databricks"
- data “local_sensitive_file” “aad_token_fle”
- data "databricks_current_user" "me"
Now we can investigate what resources Terraform will attempt to create by calling:
terraform plan
You should receive a prompt detailing what will happen in the resource group if you go forward with resources deployment. If there are any changes that you want to make to your code, go ahead and do them now.
If you’re ready, it’s time to deploy all of the services. Please go ahead and run the command:
terraform apply
At first you will receive the same prompt that was shown to you after running terraform plan
. However, you’re also now asked whether you want to go forward with the deployment. If you are, type yes
, click enter, and wait a bit for Terraform to do its job and provision all of the resources.
Once the deployment is done, you can go ahead and explore the notebooks that got uploaded to the Workspace. They highlight the functionalities of the Feature Store using the setup made from Terraform code. You can also run the workflow that was created — it will appear in the Workflows tab in the UI.
And there you have it! We hope that this blog post not only showed you how to deploy the Databricks Feature Store but that it also highlighted that working with Terraform is just effective and simple, mostly owing to its very well-thought through declarative syntax. Especially when there are quite a few pieces on the cloud that need to have connections established in-between them. Of course, since state management in Terraform is up to the end-users, it sometimes may be troublesome to get it right but we still find it to be a very low price for what we’re getting in return.
We hope this tutorial helped you start off your Feature Store within the Databricks platform. If you would like to read about how the Feature Store is leveraged within MLOps workflows, what are the use-cases of the Databricks platform in the modern data world or why we like Terraform so much please, let us know.
In the meantime we encourage you to check out our others blog post about Feature Store:
Feature Store comparison: 4 Feature Stores - explained and compared
Feature store - managing multiple data sources with Feast
And our free step-by-step guide covering all you need to know about Feature Store, including:
Cheers!
MLOps platforms delivered by GetInData allow us to pick best of breed technologies to cover crucial functionalities. MLflow is one of the key…
Read moreIt's been a year since the announcement of the dbt-flink-adapter, and the concept of enabling real-time analytics with dbt and Flink SQL is simply…
Read moreSource: GetInData, Google. Kubernetes. What is it? Undoubtedly one of the hottest topics in Big Data world over the last months and a subject of…
Read moreIntroduction Almost two decades ago, the first version of Spring framework was released. During this time, Spring became the bedrock on which the…
Read moreAs the effort to productionize ML workflows is growing, feature stores are also growing in importance. Their job is to provide standardized and up-to…
Read moreDiscovering anomalies with remarkable accuracy, our deployed model successfully identified 90% true anomalies within a 2-months evaluation period…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?