Use-cases/Project

10 min read

5 questions you need to answer before starting a big data project

getindata-how-start-big-data-project

For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases, it really is. Handle a large amount of data in your system and not over engineer at the same time. Build a performant solution but not over invest into infrastructure. Choose the right technology to process and store your data. These are just a few you could imagine.

Over the last 7 years working on distributed data platforms I have seen many common mistakes made by organizations in various industries, like unnecessarily spending fortune on cluster infrastructure, starting big data projects without people with real experience or building ‘data lake’ with no use-case. Please check my colleague’s post to find out How to avoid wasting money with Big Data technologies and get some ROI - while it was written a few years ago, many of the mistakes are still repeatedly made.

In this post, I will show you that answering a few crucial questions before you start your first big data project, may let you avoid failure.

1. What’s the expected value to be delivered by your use case(s)?

How large data sets could be collected and stored? How could data sources be integrated? In most cas,es technical questions are not the right ones to ask at the beginning. Rather than that, for your project, (...) it is always important to focus on the value it will provide to its users and other stakeholders. Building the platform with no valid use cases, with the hope of finding out some in the future (could be a potentially valid strategy which big corps may afford) frequently leads to underestimated investments which don’t pay back.

Usually, Big Data & distributed processing comes with relatively increased complexity, infrastructure costs, experts are harder to find etc. That’s why defining proper use cases from a business value delivery perspective is crucial. What is the final value big data project is going to deliver? How would it improve business processes? Would it generate income? Do we have all data sources defined or is the initial research phase required? These are very important questions to answer honestly before starting your investment.

Assure that all involved stakeholders understand answers to above questions. They need to know what the expected outcomes and project goals are, especially if cross department coordination is needed.

2. What is the nature of your data?

Data is collected at different frequencies. That could be daily snapshot files, monthly aggregates or incoming events in real-time. The nature of the data source will determine the design of your system. Nevertheless, data sources with real-time nature (e.g. like button clicks, song plays etc) could be accessed in less frequent manner (e.g. downloading daily snapshots). It’s important to keep in mind that usually lower latency is harder to achieve. That may mean infrastructure and development costs.

Nevertheless, in reality, most data sources are streams as data points appear over time, so it seems to be natural to collect and store them in a real-time manner. Even if your application shows only aggregated values now, streaming could be a trade-off not to close your architecture to future real-time use cases.

How should you deliver your data? Focus on your use case and how data is presented to the end user, but also consider potential future extensions in your architecture. Value delivered to the final user should lead to non functional requirements definition.

3. What are the key factors for your big data project?

Having a first look at the list of available Big Data tools, processing & ingestion frameworks, search engines, libraries etc gives a huge headache. Many of them developed in the open-source community and are available for free. On the other hand, more and more companies compete with their commercial products and enterprise support. Many tools are quite new on the market, lacking maturity, good quality documentation or reference implementations. Development resources may not be easy to acquire, as technology is fresh and not many experienced people are available on the market.

Selecting key project factors may help you make a decision:

performance
maturity
support (community or commercial)
portability
flexibility
licencing

Before you select concrete tools/technologies, you may select strategy - open-source or commercial. Working with open source may require more development work and building a knowledge base in the organization. It also gives more flexibility, but sometimes lacks maturity and support (for some open source projects, vendor support is available, e.g Spark, Kafka, Hadoop). Paying for commercial tools licences you get the promise to get more straight forward onboarding and high-quality support.

Decisions should always be made referring to the selected key factors and organization resources.

getindata-big-data-how-to-start-big-data-project

4. What is your data model?

With no experience in Big Data, when thinking about databases usually people have in mind RDBMS. The relational data model is something everyone is familiar with. With large data volumes, usually, it’s necessary to go out of that comfort zone and consider NoSQL solutions. That’s important to keep in mind nothing comes for free - you can achieve better performance, store more data, solve problems harder or even impossible to solve with RDBMS, but need to pay in more development, harder maintenance and less generic model. Different types of NoSQLs are dedicated to solving a specific range of problems. You have a choice of:

document-based databases (MongoDB, Couchbase, ElasticSearch)
graph databases (JanusGraph, ArangoDB)
columnar databases (HBase, Cassandra, Google Bigtable)
distributed OLAP solutions (Druid, ClickHouse)
key-value stores (AeroSpike, Redis)

Additionally, the characteristics of different implementations from the same category may vary. That means two different document databases might be good to solve different problems (e.g. MongoDB vs ElasticSearch).

How to select the right one? Again, focus on your use case and business needs - how does the data need to be served to the end-user? What are the questions (queries) you will answer with your data? Once you collect those facts you will be able to model your data, decide the level of denormalization etc. Your model will let you select the right data store. For instance, in online content base systems (e.g. e-commerce ) majority of the queries is single page/content retrieval. That could be kept even in a key-value store or in a document database (for better search capabilities) in denormalized form. For reports or aggregated data OLAP solutions (e.g. Druid) could apply, where you can perform ad-hoc analytical queries.

5. Where your solution is going to run?

Finally, the system you build requires infrastructure to run. Nowadays, you have a choice of highly developed cloud platforms such as Google Cloud Platform, Amazon Web Services, Microsoft Azure or Alibaba Cloud to name a few. At the same time, running your own on-premise environment is still a valid strategy. It’s important to understand the differences and take the approach suitable for your project and organization. You need to consider different pricing models, as well as initial costs. For on-premise hardware cost for Hadoop cluster might be tremendous, while you can start in the cloud with cost close to zero and pay as you go (usage based pricing). Cloud providers offer dedicated, fully managed services with reduced maintenance costs. On the other hand that locks you in with the concrete vendor, which could be avoided with open source technologies. Cloud and on-premise also vary in terms of security, deployments, upgrades and the way you control your system.

getindata-first-big-data-project-pch-vector

How to start a big data project?

Start simple

Keep your architecture simple and open to extension. Setting up large infrastructure and installing lots of tools at the beginning is a common mistake. Complicated architecture, especially if you just entered the Big Data area may lead to unnecessary over engineering and delivery delays. Select tools and frameworks that are good enough rather than the most performant/efficient. Focus on what meets your requirements, at potentially lower cost.

Focus on the right data

Drive your decision by the goal of the project, not by the way it could be achieved. That will let you focus on the right data, which gives value to the business. It’s easy to be trapped by trendy buzzwords.

Check a few reference architectures

Even if your big data project is very innovative you have quite a high chance that something similar was already built. You should not necessarily get anything as it is but rather treat that as reference architectures. Many companies are happy to publish their success stories with a good level of technical detail. As these are already released and working solutions you have the advantage to know their findings and learnings, so you can even use that to improve.

You can definitely find much more on data-driven companies tech blogs such as Spotify, Allegro, ING, GetInData and more.

Find big data experts

Years of experience exploring technology and developing production data driven systems may be priceless. Advices you can get from industry experts may change your perspective and put your project on the success track. Find, recruit and hire architects and data engineers with a range of experience in Big Data projects, so their expertise will cover every aspect of your system.

Deliver fast a first successful project

Complex systems, especially in the Big Data area require huge efforts. It may be very hard to find promoters for long-running and expensive projects. At GetInData we always try to isolate use cases where we can deliver fast. When a client is able to see good achievements and delivery in a short time, they're more eager to invest in further projects. Please find our story with KCell where with relatively small incremental steps we went into bigger cooperation.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems from the very beginning since Apache Hadoop was released. Do not hesitate to contact us, our team will be happy to discuss your first big data project.

big data

getindata

big data project

data model

big data experts

Last updated: 17 February 2021

Written by

Grzegorz Kołpuć

Lead Data Engineer

Like this post?
Spread the word

Want more? Check our articles

power of big data ii obszar roboczy 1 3x 100

Tutorial

Power of Big Data: Healthcare

Welcome to another Power of Big Data series post. In the series, we present the possibilities offered by solutions related to the management, analysis…

getindata big data blog apache sedona introduction

Tutorial

Introduction to Apache Sedona (incubating)

Apache Sedona is a distributed system which gives you the possibility to load, process, transform and analyze huge amounts of geospatial data across…

Tech News

Is my company data-driven? Here’s how you can find out

Planning any journey requires some prerequisites. Before you decide on a route and start packing your clothes, you need to know where you are and what…

getindata amundsen feast machine learining notext

Tutorial

Machine Learning Features discovery with Feast and Amundsen

One of the main challenges of today's Machine Learning initiatives is the need for a centralized store of high-quality data that can be reused by Data…

getindata nifi ingestion universe made out flow files nifi architecture big data

Tutorial

NiFi Ingestion Blog Series. PART IV - Universe made out of flow files - NiFi architecture

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

e commerce chatbot llmobszar roboczy 1 4

Tutorial

How to build an e-commerce shopping assistant (chatbot) with LLMs

In the dynamic world of e-commerce, providing exceptional customer service is no longer an option – it's a necessity. The rise of online shopping has…

Check All

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.

Type the form or send a e-mail: hello@getindata.com

5 questions you need to answer before starting a big data project

1. What’s the expected value to be delivered by your use case(s)?

2. What is the nature of your data?

3. What are the key factors for your big data project?

4. What is your data model?

5. Where your solution is going to run?

How to start a big data project?

Start simple

Focus on the right data

Check a few reference architectures

Find big data experts

Deliver fast a first successful project

Like this post?Spread the word

Want more? Check our articles

Power of Big Data: Healthcare

Introduction to Apache Sedona (incubating)

Is my company data-driven? Here’s how you can find out

Machine Learning Features discovery with Feast and Amundsen

NiFi Ingestion Blog Series. PART IV - Universe made out of flow files - NiFi architecture

How to build an e-commerce shopping assistant (chatbot) with LLMs

Contact us

Interested in our solutions?Contact us!

Like this post?
Spread the word

Interested in our solutions?
Contact us!