Power of Big Data: Marketing
In the "Power of Big Data" series, I will talk about the possibilities that Big Data solutions give to individual business sectors. It should be noted…
Read moreCICD can be also described as the methodology focused on automating tasks related to running tests, build phases, deploying to development or production and also carrying out some A/B tests in a fully automated fashion. It became a must in the rapidly changing world in which we need to deploy changes frequently. The old-fashioned maintenance means would be tough, especially for a complex data streaming platform, and CICD tools are the only way to make this manageable.
CI regards the automation process for developers while the CD is about delivering new releases to the target environment
Everything started in 2006 when Martin Fowler wrote an article about the need of having Continuous Integration to reduce the number of bugs and detecting them quickly. The first two important players were CruiseControl (released in 2001) and Hudson CI (later Jenkins) released in 2005. 2008 was the year when Hudson became more and more important, in 2010 Sun Microsystems, in which Hudson’s author, Kohsuke Kawaguchi worked, was purchased by Oracle which wanted to commercialize the open source Hudson. Jenkins has been with us in its open source form since 2011. During this time there were a lot of other competitors but we can chose what we felt was best for us.
Currently, we can see a trend causing more and more developers to learn how to deploy their applications, run tests using GitOps and follow the best practices. At the beginning of the CICD store, setting up the environment, writing scripts or deployment files used to be be a bit complex, but now it has become pretty easy and GitLab CI, the latest version of Jenkins or GitHub Action are the best examples of tools used to achieve this. Due to the popularity of DevOps principles (which, in fact, contain a lot of definitions related to Ops phrases), CICD tools must become easier to use and faster in daily use.
Jenkins needs to be installed on the dedicated infrastructure, writing new pipelines in, for example, Groovywhich in fact can be quite complex to manage. . Jenkins is one of the authors used to implement CICD, but it’s been a player since the first generation of CICD tools, talking about the first CICD phase.
The next ones are GitLab CI, Circle CI and Travis CI,which deliver Pipeline as Code. It is required to write some simple scripts but thanks to managed runners, building CICD pipelines doesn’t take too much time and they are great examples of progress in this market.
The last generation is about building pipelines from prepared blocks the same way as in GitHub Actions. Here we have modules managed by the community (that can be also a disadvantage for some developers) which we can easily mix with each other and build a CICD pipeline even faster and in a simpler way. We can predict that the next step is topopularize more drag&drop solutions (with Web UI or in a similar way to the current GitHub Actions).
When talking about a real-time data streaming platform processing millions of events per second, it is a must to have a smoothly working CICD platform.
Managing rules are a must in the case of CICD tools and fortunately all of them support it. Thismeans that we can set up the following rule: when merging from staging to prod, run tests, build app or build model, run integration tests and deploy to production.
For some use cases, it is important to have the chance to install the CICD tool in the “local” infrastructure within on-premise, or in the public cloud. Here Jenkins and GitLab (with its CI) exhibit extensive capabilities. Even during deployment and when adding all components around each service, you can discover the difference between the old Jenkins and much newer, robust GitLab. Here we also need to mention ease of deployment. In the case of GitLab CI and GitHub Actions, we need to create a repo and start writing the CICD pipeline by creating the files (.gitab-ci.yml or YAML within .github/workflows, for example) while in the case of Jenkins we would need to set up the whole platform which would obviously be quite time consuming.
When talking about mixed deployment (cloud repository with own runners used for running the CICD pipeline), we have multiple configurations for GitLab CI (we can use bare-metal server, Virtual Machine, Docker or Kubernetes) and a bit more limited options for GitHub Actions. Why is it important to mention this ? First of all, it may be more cost-effective in comparison to buying more minutes for managed runners. Secondly, it’s more secure to use its own infrastructure to run tests, build and push images with applied network security groups directly to the runners.
The next thing is the size of the Operational team that will handle CICD pipelines. It’s also worth checking how we can integrate the CICD tool into our target environment - tools like Flux CD are useful nowadays when we need to handle managing multiple platforms.
In the case of making complex pipelines, we should also check how we can manage all secrets, variables, and tokens which are required within our pipelines.
When talking about CICD tools, it’s worth mentioning about the CICD pipelines themselves.
The first is about creating separate stages of the pipeline. Imagine we have a Spark application written in Scala, that we want to deploy to our Kubernetes cluster and we have three environments: development, QA and production.
The second one is about automating running unit tests or validation tests (in case of YAML scripts, for example).
The third one is about the building process, where we can store artifacts from the job and how we want to build it (once per environment or once for all environments?).
The fourth is focused on deployment to development, staging, QA and the different environments for testing the code and verifying if it works as expected.
Deployment to the production environment comes as the fifth component and here it’s important to define some A/B testing scenarios, running Flink jobs in the incubation mode (and here verify results once again), performing blue-green deployment and making it fully automated or not. In some cases, it makes sense to only use manual triggers to update the release in the production environment.
The last part is about managing all secrets used within the pipeline.
Jenkins is the oldest one and it’s not worth using if we want to use something modern and we do not have a technological debt with Jenkins used in all pipelines in the company.
GitLab with its CI is a reasonable solution when we want to have a tool that's easy to use with a lot of advanced features and great integration with all that public cloud provides and also on-premises.
GitHub Actions is the right choice if you don’t have too much time to spend on writing and defining CICD pipelines. You want to have a complex CICD pipeline which can be written in a few moments. Here GitHub Actions appears to be the hero, especially for some open source projects hosted there.
Modern Data Streaming and/or AI platforms contain multiple microservices and delivering new releases in a smooth, efficient way requires using the right CICD tools and well built CICD pipelines. Based on our experience in GetInData projects, we can highly recommend using GitLab CI as the flexible solution for on-premise and SaaS infrastructure.
In the "Power of Big Data" series, I will talk about the possibilities that Big Data solutions give to individual business sectors. It should be noted…
Read moreData processing in real-time has become crucial for businesses, and Apache Flink, with its powerful stream processing capabilities, is at the…
Read moreFlink is an open-source stream processing framework that supports both batch processing and data streaming programs. Streaming happens as data flows…
Read moreIt has been almost a month since the 9th edition of the Big Data Technology Warsaw Summit. We were thrilled to have the opportunity to organize an…
Read moreOne of the core features of an MLOps platform is the capability of tracking and recording experiments, which can then be shared and compared. It also…
Read moreFounded by former Spotify data engineers in 2014, GetInData consists of a team of experienced and passionate Big Data veterans with proven track of…
Read moreTogether, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.
What did you find most impressive about GetInData?