Foundation for your Data Science and Machine Learning modelling
Can you imagine Machine Learning modelling without processes and tools to support it? In the long term having proper automation around working on models and executing them in production environments is a must-have that will make you focus on data, experimenting and business goals. It will also let you minimise the possibility of human error and boost the productivity of your Data Science team.
How does the Machine Learning Platform work?
Data that you are going to use for modelling and Feature Engineering can be loaded from offline and online sources.
In case of offline data source first of all we are talking about data lake, but if you want to use data that is not available there you can always connect data from multiple systems, like external databases, files and data stores, within a single query having proper SQL query federation engine. You do not need to copy data from different sources to use them in your analysis.
Online data source is usually a data stream provided by your messaging solution or real-time streaming platform.
The biggest value in having unified way of loading data is that you can combine different data sets, offline and online, and build a consistent view on top of them.
Feature in simple words is a measurable property of an entity. This can be a set of attributes of a customer, impression on the website or computed values like average order amount of certain user.
Feature engineering is a process of combining and transforming online and offline data into reusable datasets containing versioned and curated (i.e. passing quality checks) features that are inputs to machine learning training process.
This can be done offline, in a batch mode, to prepare the whole data set of features to be used in further modelling or model execution. Features can be also calculated online, based on events, to automatically adjust the model to be more accurate. e.g. propensity to purchase based on current behaviour on website.
Important part of feature engineering process are data quality checks that can automatically verify if there are no data flaws and unit tests to verify if our code is behaving the way it should.
Feature store is a component providing centralized access to calculated and version-controlled features for data analytics and machine learning modelling. It is a single source of truth for data scientist who can reuse/share their work and easily collaborate, ensuring data accuracy. Feature store improves data consistency for model training and serving and finally contributes to data democratization in the organisation.
Offline part of Feature store is mainly used for model scoring in a batch mode and model validation for certain point in time. Very important feature provided by Feature store is an ability to present historical features, like a view of the customer 6 months ago - including demographics, segmentation, but also a number of purchased services at that moment.
Online componentis used for or serving the latest version of the features needed by real-time models to compute the score. It should provide a very fast random access to features of a single entity.
ML Platform is a module that is managing the modelling lifecycle, with the emphasis on experimentation, reproducibility and deployment. While doing the research a data scientist is testing multiple hypothesis based on different set of features to achieve the best results. Proper experiment tracking is a key for boosting productivity and achieving reproducibility - it is very easy to lose oneself while working on hundreds of sets of features.
ML model training is a multi-step and repetitive process with many optional preprocessing. Implementation of ML Platform is a way to automate and measure this process. Proper toolset increases productivity of Data Scientists and help to keep the quality of the process under control.
Model registry allows to store information about model lineage (which model was produced by which experiment), versioning and staging (which model is on production). This is a must-have component in a collaborative environment.
Model monitoring is recording model metrics to assess business performance of the model (e.g. efficiency), which can come back as a feedback loop to feature engineering process.
Model deployment component is ensuring that all models are deployed in a standard and automated way in the form of microservice running on top of orchestrator for online models or SQL-statement for offline scoring.
Security and access management tool allows to control user access to data and components of the environment. It provides audit capabilities for verifying who has access to specific resources.
Deployment automation with proper configuration management are key to ensure the high quality of software delivery and to reduce risk of production deployments. All our code is stored in version control system. We design tests to be a part of the Continuous Integration and Continuous Deployment pipelines.
Complex monitoring and observability solution gives detailed information on the state and performance of the components. You can also deploy metrics to observe application processing behaviour. Monitoring includes also alerting capabilities, needed for reliability and supportability.
Originally all of the components of Hadoop ecosystem were installed with Yarn as an orchestrator to achieve scalability and manage infrastructure resources. Nowadays Kubernetes is becoming a new standard for managing resources in distributed computing environments. We design our applications and workloads to work directly on Kubernetes.
The adoption of Machine Learning modeling is increasing in many industries. The most popular use cases involve predicting anomalies or frauds for improving efficiency of business processes and better risk management. On the marketing and sales side we have many flavours of customer segmentation models, recommendation models, churn prediction and sophisticated dynamic pricing or customer elasticity models. Another domain is social media where Machine Learning is used for sentiment analysis, that can be used for marketing and PR but also Product Management.
There are almost endless possibilities to employ Machine Learning to improve processes. It all depends on the scenario we would like to work on. What use case would you like to discuss??
How does the Machine Learning Platform work?
Get Free White Paper
Take a look at some of the big data projects delivered by our big data expert team
How we work with customers?
We have a different way of working with clients, that allows us to build deep trust based partnerships, which often endure over years. It is based on a few powerful and pragmatic principles tested and refined over many years of our consulting and project delivery experience.
Big Data is a process
Big Data is not about technologies, but about employing culture of collecting, analyzing and using data in a structured way, in innovation-friendly environment. We can help you start this journey.
Our code is versioned, unit tested and, deployed using CI/CD. We also design unit tests for data to measure the its quality in large data sets.
Open source or native cloud services
We build our solutions with openness in mind, so we extensively use open Source software, however in some cases we suggest to use managed services offered by public cloud providers
On-premise or in public cloud
Our solutions are designed to be deployed on your local infrastructure, in hybrid cloud or fully in the public cloud.
Our solutions are designed to accommodate best practices and our vast experience in Big Data and are not based on specific technologies. This gives us a flexibility to adjust the design to the project specifics and current state-of-the-art to better serve the goal.
For our customers who want to stick to Open Source and free version of Hadoop, we have prepared our own distribution build out of the latest packages.
Ready to build Machine Learning Platform?
Please fill out the form and we will come back to you as soon as possible to schedule a meeting to discuss about your event processing needs.
What did you find most impressive about GetInData?