Use-cases/Project
10 min read

5 questions you need to answer before starting a big data project

getindata-how-start-big-data-project

For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases, it really is. Handle a large amount of data in your system and not over engineer at the same time. Build a performant solution but not over invest into infrastructure. Choose the right technology to process and store your data. These are just a few you could imagine.

Over the last 7 years working on distributed data platforms I have seen many common mistakes made by organizations in various industries, like unnecessarily spending fortune on cluster infrastructure, starting big data projects without people with real experience or building ‘data lake’ with no use-case. Please check my colleague’s post to find out How to avoid wasting money with Big Data technologies and get some ROI - while it was written a few years ago, many of the mistakes are still repeatedly made.

In this post, I will show you that answering a few crucial questions before you start your first big data project, may let you avoid failure.

1. What’s the expected value to be delivered by your use case(s)?

How large data sets could be collected and stored? How could data sources be integrated? In most cas,es technical questions are not the right ones to ask at the beginning. Rather than that, for your project, (...) it is always important to focus on the value it will provide to its users and other stakeholders. Building the platform with no valid use cases, with the hope of finding out some in the future (could be a potentially valid strategy which big corps may afford) frequently leads to underestimated investments which don’t pay back.

Usually, Big Data & distributed processing comes with relatively increased complexity, infrastructure costs, experts are harder to find etc. That’s why defining proper use cases from a business value delivery perspective is crucial. What is the final value big data project is going to deliver? How would it improve business processes? Would it generate income? Do we have all data sources defined or is the initial research phase required? These are very important questions to answer honestly before starting your investment. 

Assure that all involved stakeholders understand answers to above questions. They need to know what the expected outcomes and project goals are, especially if cross department coordination is needed.

2. What is the nature of your data?

Data is collected at different frequencies. That could be daily snapshot files, monthly aggregates or incoming events in real-time. The nature of the data source will determine the design of your system. Nevertheless, data sources with real-time nature (e.g. like button clicks, song plays etc) could be accessed in less frequent manner (e.g. downloading daily snapshots). It’s important to keep in mind that usually lower latency is harder to achieve. That may mean infrastructure and development costs.

Nevertheless, in reality, most data sources are streams as data points appear over time, so it seems to be natural to collect and store them in a real-time manner. Even if your application shows only aggregated values now, streaming could be a trade-off not to close your architecture to future real-time use cases. 

How should you deliver your data? Focus on your use case and how data is presented to the end user, but also consider potential future extensions in your architecture. Value delivered to the final user should lead to non functional requirements definition.

3. What are the key factors for your big data project? 

Having a first look at the list of available Big Data tools, processing & ingestion frameworks, search engines, libraries etc gives a huge headache. Many of them developed in the open-source community and are available for free. On the other hand, more and more companies compete with their commercial products and enterprise support. Many tools are quite new on the market, lacking maturity, good quality documentation or reference implementations. Development resources may not be easy to acquire, as technology is fresh and not many experienced people are available on the market.

Selecting key project factors may help you make a decision:

  • performance
  • maturity
  • support (community or commercial)
  • portability 
  • flexibility
  • licencing

Before you select concrete tools/technologies, you may select strategy - open-source or commercial. Working with open source may require more development work and building a knowledge base in the organization. It also gives more flexibility, but sometimes lacks maturity and support (for some open source projects, vendor support is available, e.g Spark, Kafka, Hadoop). Paying for commercial tools licences you get the promise to get more straight forward onboarding and high-quality support.

Decisions should always be made referring to the selected key factors and organization resources. 

getindata-big-data-how-to-start-big-data-project

4. What is your data model?

With no experience in Big Data, when thinking about databases usually people have in mind RDBMS. The relational data model is something everyone is familiar with. With large data volumes, usually, it’s necessary to go out of that comfort zone and consider NoSQL solutions. That’s important to keep in mind nothing comes for free - you can achieve better performance, store more data, solve problems harder or even impossible to solve with RDBMS, but need to pay in more development, harder maintenance and less generic model. Different types of NoSQLs are dedicated to solving a specific range of problems. You have a choice of:

  • document-based databases (MongoDB, Couchbase, ElasticSearch)
  • graph databases (JanusGraph, ArangoDB)
  • columnar databases (HBase, Cassandra, Google Bigtable)
  • distributed OLAP solutions (Druid, ClickHouse)
  • key-value stores (AeroSpike, Redis)

Additionally, the characteristics of different implementations from the same category may vary. That means two different document databases might be good to solve different problems (e.g. MongoDB vs ElasticSearch).

How to select the right one? Again, focus on your use case and business needs - how does the data need to be served to the end-user? What are the questions (queries) you will answer with your data? Once you collect those facts you will be able to model your data, decide the level of denormalization etc. Your model will let you select the right data store. For instance, in online content base systems (e.g. e-commerce ) majority of the queries is single page/content retrieval. That could be kept even in a key-value store or in a document database (for better search capabilities) in denormalized form. For reports or aggregated data OLAP solutions (e.g. Druid) could apply, where you can perform ad-hoc analytical queries.

5. Where your solution is going to run?

Finally, the system you build requires infrastructure to run. Nowadays, you have a choice of highly developed cloud platforms such as Google Cloud Platform, Amazon Web Services, Microsoft Azure or Alibaba Cloud to name a few. At the same time, running your own on-premise environment is still a valid strategy. It’s important to understand the differences and take the approach suitable for your project and organization. You need to consider different pricing models, as well as initial costs. For on-premise hardware cost for Hadoop cluster might be tremendous, while you can start in the cloud with cost close to zero and pay as you go (usage based pricing). Cloud providers offer dedicated, fully managed services with reduced maintenance costs. On the other hand that locks you in with the concrete vendor, which could be avoided with open source technologies. Cloud and on-premise also vary in terms of security, deployments, upgrades and the way you control your system. 

getindata-first-big-data-project-pch-vector

How to start a big data project?

Start simple 

Keep your architecture simple and open to extension. Setting up large infrastructure and installing lots of tools at the beginning is a common mistake. Complicated architecture, especially if you just entered the Big Data area may lead to unnecessary over engineering and delivery delays. Select tools and frameworks that are good enough rather than the most performant/efficient. Focus on what meets your requirements, at potentially lower cost.

Focus on the right data

Drive your decision by the goal of the project, not by the way it could be achieved. That will let you focus on the right data, which gives value to the business. It’s easy to be trapped by trendy buzzwords. 

Check a few reference architectures

Even if your big data project is very innovative you have quite a high chance that something similar was already built. You should not necessarily get anything as it is but rather treat that as reference architectures. Many companies are happy to publish their success stories with a good level of technical detail. As these are already released and working solutions you have the advantage to know their findings and learnings, so you can even use that to improve. 

You can definitely find much more on data-driven companies tech blogs such as Spotify, Allegro, ING, GetInData and more.

Find big data experts

Years of experience exploring technology and developing production data driven systems may be priceless. Advices you can get from industry experts may change your perspective and put your project on the success track. Find, recruit and hire architects and data engineers with a range of experience in Big Data projects, so their expertise will cover every aspect of your system. 

Deliver fast a first successful project

Complex systems, especially in the Big Data area require huge efforts. It may be very hard to find promoters for long-running and expensive projects. At GetInData we always try to isolate use cases where we can deliver fast. When a client is able to see good achievements and delivery in a short time, they're more eager to invest in further projects. Please find our story with KCell where with relatively small incremental steps we went into bigger cooperation.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems from the very beginning since Apache Hadoop was released. Do not hesitate to contact us, our team will be happy to discuss your first big data project.

big data
getindata
big data project
data model
big data experts
17 February 2021

Want more? Check our articles

0LThQo4TotB93NHz6
Use-cases/Project

Streaming analytics better than classic batch — when and why?

While a lot of problems can be solved in batch, the stream processing approach can give you even more benefits. Today, we’ll discuss a real-world…

Read more
big data technology warsaw summit 2020 getindata
Big Data Event

Review of presentations on the Big Data Technology Warsaw Summit 2020

It’s been exactly two months since the last edition of the Big Data Technology Warsaw Summit 2020, so we decided to share some great statistics with…

Read more
geospatial analytics hadoop
Use-cases/Project

Geospatial analytics on Hadoop

A few months ago I was working on a project with a lot of geospatial data. Data was stored in HDFS, easily accessible through Hive. One of the tasks…

Read more
getindata 1000 followers

5 reasons to follow us on Linkedin. Celebrating 1,000 followers on our profile!

We are excited to announce that we recently hit the 1,000+ followers on our profile on Linkedin. We would like to send a special THANK YOU :) to…

Read more
logs analytics in cloud loki albert lewandowski getindata big data blog notext
Tutorial

Logs analytics at scale in the cloud with Loki

Logs can provide a lot of useful information about the environment and status of the application and should be part of our monitoring stack. We'll…

Read more
getindata transfer pipelines to modern gitlab cicd small
Tutorial

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 1

This blog series is based on a project delivered for one of our clients. We splited the content in three parts, you can find a table of content below…

Read more

Contact us

Fill out this simple form. Our team will contact you promptly to discuss the next steps.

hello@getindata.comFist bump illustration

Any questions?

Choose one
By submitting this form, you agree to our  Terms & Conditions