10 min read

5 questions you need to answer before starting a big data project


For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases, it really is. Handle a large amount of data in your system and not over engineer at the same time. Build a performant solution but not over invest into infrastructure. Choose the right technology to process and store your data. These are just a few you could imagine.

Over the last 7 years working on distributed data platforms I have seen many common mistakes made by organizations in various industries, like unnecessarily spending fortune on cluster infrastructure, starting big data projects without people with real experience or building ‘data lake’ with no use-case. Please check my colleague’s post to find out How to avoid wasting money with Big Data technologies and get some ROI - while it was written a few years ago, many of the mistakes are still repeatedly made.

In this post, I will show you that answering a few crucial questions before you start your first big data project, may let you avoid failure.

1. What’s the expected value to be delivered by your use case(s)?

How large data sets could be collected and stored? How could data sources be integrated? In most cas,es technical questions are not the right ones to ask at the beginning. Rather than that, for your project, (...) it is always important to focus on the value it will provide to its users and other stakeholders. Building the platform with no valid use cases, with the hope of finding out some in the future (could be a potentially valid strategy which big corps may afford) frequently leads to underestimated investments which don’t pay back.

Usually, Big Data & distributed processing comes with relatively increased complexity, infrastructure costs, experts are harder to find etc. That’s why defining proper use cases from a business value delivery perspective is crucial. What is the final value big data project is going to deliver? How would it improve business processes? Would it generate income? Do we have all data sources defined or is the initial research phase required? These are very important questions to answer honestly before starting your investment. 

Assure that all involved stakeholders understand answers to above questions. They need to know what the expected outcomes and project goals are, especially if cross department coordination is needed.

2. What is the nature of your data?

Data is collected at different frequencies. That could be daily snapshot files, monthly aggregates or incoming events in real-time. The nature of the data source will determine the design of your system. Nevertheless, data sources with real-time nature (e.g. like button clicks, song plays etc) could be accessed in less frequent manner (e.g. downloading daily snapshots). It’s important to keep in mind that usually lower latency is harder to achieve. That may mean infrastructure and development costs.

Nevertheless, in reality, most data sources are streams as data points appear over time, so it seems to be natural to collect and store them in a real-time manner. Even if your application shows only aggregated values now, streaming could be a trade-off not to close your architecture to future real-time use cases. 

How should you deliver your data? Focus on your use case and how data is presented to the end user, but also consider potential future extensions in your architecture. Value delivered to the final user should lead to non functional requirements definition.

3. What are the key factors for your big data project? 

Having a first look at the list of available Big Data tools, processing & ingestion frameworks, search engines, libraries etc gives a huge headache. Many of them developed in the open-source community and are available for free. On the other hand, more and more companies compete with their commercial products and enterprise support. Many tools are quite new on the market, lacking maturity, good quality documentation or reference implementations. Development resources may not be easy to acquire, as technology is fresh and not many experienced people are available on the market.

Selecting key project factors may help you make a decision:

  • performance
  • maturity
  • support (community or commercial)
  • portability 
  • flexibility
  • licencing

Before you select concrete tools/technologies, you may select strategy - open-source or commercial. Working with open source may require more development work and building a knowledge base in the organization. It also gives more flexibility, but sometimes lacks maturity and support (for some open source projects, vendor support is available, e.g Spark, Kafka, Hadoop). Paying for commercial tools licences you get the promise to get more straight forward onboarding and high-quality support.

Decisions should always be made referring to the selected key factors and organization resources. 


4. What is your data model?

With no experience in Big Data, when thinking about databases usually people have in mind RDBMS. The relational data model is something everyone is familiar with. With large data volumes, usually, it’s necessary to go out of that comfort zone and consider NoSQL solutions. That’s important to keep in mind nothing comes for free - you can achieve better performance, store more data, solve problems harder or even impossible to solve with RDBMS, but need to pay in more development, harder maintenance and less generic model. Different types of NoSQLs are dedicated to solving a specific range of problems. You have a choice of:

  • document-based databases (MongoDB, Couchbase, ElasticSearch)
  • graph databases (JanusGraph, ArangoDB)
  • columnar databases (HBase, Cassandra, Google Bigtable)
  • distributed OLAP solutions (Druid, ClickHouse)
  • key-value stores (AeroSpike, Redis)

Additionally, the characteristics of different implementations from the same category may vary. That means two different document databases might be good to solve different problems (e.g. MongoDB vs ElasticSearch).

How to select the right one? Again, focus on your use case and business needs - how does the data need to be served to the end-user? What are the questions (queries) you will answer with your data? Once you collect those facts you will be able to model your data, decide the level of denormalization etc. Your model will let you select the right data store. For instance, in online content base systems (e.g. e-commerce ) majority of the queries is single page/content retrieval. That could be kept even in a key-value store or in a document database (for better search capabilities) in denormalized form. For reports or aggregated data OLAP solutions (e.g. Druid) could apply, where you can perform ad-hoc analytical queries.

5. Where your solution is going to run?

Finally, the system you build requires infrastructure to run. Nowadays, you have a choice of highly developed cloud platforms such as Google Cloud Platform, Amazon Web Services, Microsoft Azure or Alibaba Cloud to name a few. At the same time, running your own on-premise environment is still a valid strategy. It’s important to understand the differences and take the approach suitable for your project and organization. You need to consider different pricing models, as well as initial costs. For on-premise hardware cost for Hadoop cluster might be tremendous, while you can start in the cloud with cost close to zero and pay as you go (usage based pricing). Cloud providers offer dedicated, fully managed services with reduced maintenance costs. On the other hand that locks you in with the concrete vendor, which could be avoided with open source technologies. Cloud and on-premise also vary in terms of security, deployments, upgrades and the way you control your system. 


How to start a big data project?

Start simple 

Keep your architecture simple and open to extension. Setting up large infrastructure and installing lots of tools at the beginning is a common mistake. Complicated architecture, especially if you just entered the Big Data area may lead to unnecessary over engineering and delivery delays. Select tools and frameworks that are good enough rather than the most performant/efficient. Focus on what meets your requirements, at potentially lower cost.

Focus on the right data

Drive your decision by the goal of the project, not by the way it could be achieved. That will let you focus on the right data, which gives value to the business. It’s easy to be trapped by trendy buzzwords. 

Check a few reference architectures

Even if your big data project is very innovative you have quite a high chance that something similar was already built. You should not necessarily get anything as it is but rather treat that as reference architectures. Many companies are happy to publish their success stories with a good level of technical detail. As these are already released and working solutions you have the advantage to know their findings and learnings, so you can even use that to improve. 

You can definitely find much more on data-driven companies tech blogs such as Spotify, Allegro, ING, GetInData and more.

Find big data experts

Years of experience exploring technology and developing production data driven systems may be priceless. Advices you can get from industry experts may change your perspective and put your project on the success track. Find, recruit and hire architects and data engineers with a range of experience in Big Data projects, so their expertise will cover every aspect of your system. 

Deliver fast a first successful project

Complex systems, especially in the Big Data area require huge efforts. It may be very hard to find promoters for long-running and expensive projects. At GetInData we always try to isolate use cases where we can deliver fast. When a client is able to see good achievements and delivery in a short time, they're more eager to invest in further projects. Please find our story with KCell where with relatively small incremental steps we went into bigger cooperation.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems from the very beginning since Apache Hadoop was released. Do not hesitate to contact us, our team will be happy to discuss your first big data project.

big data
big data project
data model
big data experts
17 February 2021

Want more? Check our articles

kedro snowflake getindata

From 0 to MLOps with ❄️ Snowflake Data Cloud in 3 steps with the Kedro-Snowflake plugin

MLOps on Snowflake Data Cloud MLOps is an ever-evolving field, and with the selection of managed and cloud-native machine learning services expanding…

Read more
dynamicsqlprocessingwithapacheflinkobszar roboczy 1 4

Dynamic SQL processing with Apache Flink

In this blog post, I would like to cover the hidden possibilities of dynamic SQL processing using the current Flink implementation. I will showcase a…

Read more
dynamodb aws jedraszewski getindata big data blog

Amazon DynamoDB - single table design

DynamoDB is a fully-managed NoSQL key-value database which delivers single-digit performance at any scale. However, to achieve this kind of…

Read more
albert1obszar roboczy 1 100

Apache NiFi and Apache NiFi Registry on Kubernetes

Apache NiFi is a popular, big data processing engine with graphical Web UI that provides non-programmers the ability to swiftly and codelessly create…

Read more
backendobszar roboczy 1 2 3x 100

Data Mesh as a proper way to organise data world

Data Mesh as an answer In more complex Data Lakes, I usually meet the following problems in organizations that make data usage very inefficient: Teams…

Read more
logs analytics in cloud loki albert lewandowski getindata big data blog notext

Logs analytics at scale in the cloud with Loki

Logs can provide a lot of useful information about the environment and status of the application and should be part of our monitoring stack. We'll…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

What did you find most impressive about GetInData?

They did a very good job in finding people that fitted in Acast both technically as well as culturally.
Type the form or send a e-mail:
The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the Terms & Conditions. For more information on personal data processing and your rights please see Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy