10 min read

5 questions you need to answer before starting a big data project


For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases, it really is. Handle a large amount of data in your system and not over engineer at the same time. Build a performant solution but not over invest into infrastructure. Choose the right technology to process and store your data. These are just a few you could imagine.

Over the last 7 years working on distributed data platforms I have seen many common mistakes made by organizations in various industries, like unnecessarily spending fortune on cluster infrastructure, starting big data projects without people with real experience or building ‘data lake’ with no use-case. Please check my colleague’s post to find out How to avoid wasting money with Big Data technologies and get some ROI - while it was written a few years ago, many of the mistakes are still repeatedly made.

In this post, I will show you that answering a few crucial questions before you start your first big data project, may let you avoid failure.

1. What’s the expected value to be delivered by your use case(s)?

How large data sets could be collected and stored? How could data sources be integrated? In most cas,es technical questions are not the right ones to ask at the beginning. Rather than that, for your project, (...) it is always important to focus on the value it will provide to its users and other stakeholders. Building the platform with no valid use cases, with the hope of finding out some in the future (could be a potentially valid strategy which big corps may afford) frequently leads to underestimated investments which don’t pay back.

Usually, Big Data & distributed processing comes with relatively increased complexity, infrastructure costs, experts are harder to find etc. That’s why defining proper use cases from a business value delivery perspective is crucial. What is the final value big data project is going to deliver? How would it improve business processes? Would it generate income? Do we have all data sources defined or is the initial research phase required? These are very important questions to answer honestly before starting your investment. 

Assure that all involved stakeholders understand answers to above questions. They need to know what the expected outcomes and project goals are, especially if cross department coordination is needed.

2. What is the nature of your data?

Data is collected at different frequencies. That could be daily snapshot files, monthly aggregates or incoming events in real-time. The nature of the data source will determine the design of your system. Nevertheless, data sources with real-time nature (e.g. like button clicks, song plays etc) could be accessed in less frequent manner (e.g. downloading daily snapshots). It’s important to keep in mind that usually lower latency is harder to achieve. That may mean infrastructure and development costs.

Nevertheless, in reality, most data sources are streams as data points appear over time, so it seems to be natural to collect and store them in a real-time manner. Even if your application shows only aggregated values now, streaming could be a trade-off not to close your architecture to future real-time use cases. 

How should you deliver your data? Focus on your use case and how data is presented to the end user, but also consider potential future extensions in your architecture. Value delivered to the final user should lead to non functional requirements definition.

3. What are the key factors for your big data project? 

Having a first look at the list of available Big Data tools, processing & ingestion frameworks, search engines, libraries etc gives a huge headache. Many of them developed in the open-source community and are available for free. On the other hand, more and more companies compete with their commercial products and enterprise support. Many tools are quite new on the market, lacking maturity, good quality documentation or reference implementations. Development resources may not be easy to acquire, as technology is fresh and not many experienced people are available on the market.

Selecting key project factors may help you make a decision:

  • performance
  • maturity
  • support (community or commercial)
  • portability 
  • flexibility
  • licencing

Before you select concrete tools/technologies, you may select strategy - open-source or commercial. Working with open source may require more development work and building a knowledge base in the organization. It also gives more flexibility, but sometimes lacks maturity and support (for some open source projects, vendor support is available, e.g Spark, Kafka, Hadoop). Paying for commercial tools licences you get the promise to get more straight forward onboarding and high-quality support.

Decisions should always be made referring to the selected key factors and organization resources. 


4. What is your data model?

With no experience in Big Data, when thinking about databases usually people have in mind RDBMS. The relational data model is something everyone is familiar with. With large data volumes, usually, it’s necessary to go out of that comfort zone and consider NoSQL solutions. That’s important to keep in mind nothing comes for free - you can achieve better performance, store more data, solve problems harder or even impossible to solve with RDBMS, but need to pay in more development, harder maintenance and less generic model. Different types of NoSQLs are dedicated to solving a specific range of problems. You have a choice of:

  • document-based databases (MongoDB, Couchbase, ElasticSearch)
  • graph databases (JanusGraph, ArangoDB)
  • columnar databases (HBase, Cassandra, Google Bigtable)
  • distributed OLAP solutions (Druid, ClickHouse)
  • key-value stores (AeroSpike, Redis)

Additionally, the characteristics of different implementations from the same category may vary. That means two different document databases might be good to solve different problems (e.g. MongoDB vs ElasticSearch).

How to select the right one? Again, focus on your use case and business needs - how does the data need to be served to the end-user? What are the questions (queries) you will answer with your data? Once you collect those facts you will be able to model your data, decide the level of denormalization etc. Your model will let you select the right data store. For instance, in online content base systems (e.g. e-commerce ) majority of the queries is single page/content retrieval. That could be kept even in a key-value store or in a document database (for better search capabilities) in denormalized form. For reports or aggregated data OLAP solutions (e.g. Druid) could apply, where you can perform ad-hoc analytical queries.

5. Where your solution is going to run?

Finally, the system you build requires infrastructure to run. Nowadays, you have a choice of highly developed cloud platforms such as Google Cloud Platform, Amazon Web Services, Microsoft Azure or Alibaba Cloud to name a few. At the same time, running your own on-premise environment is still a valid strategy. It’s important to understand the differences and take the approach suitable for your project and organization. You need to consider different pricing models, as well as initial costs. For on-premise hardware cost for Hadoop cluster might be tremendous, while you can start in the cloud with cost close to zero and pay as you go (usage based pricing). Cloud providers offer dedicated, fully managed services with reduced maintenance costs. On the other hand that locks you in with the concrete vendor, which could be avoided with open source technologies. Cloud and on-premise also vary in terms of security, deployments, upgrades and the way you control your system. 


How to start a big data project?

Start simple 

Keep your architecture simple and open to extension. Setting up large infrastructure and installing lots of tools at the beginning is a common mistake. Complicated architecture, especially if you just entered the Big Data area may lead to unnecessary over engineering and delivery delays. Select tools and frameworks that are good enough rather than the most performant/efficient. Focus on what meets your requirements, at potentially lower cost.

Focus on the right data

Drive your decision by the goal of the project, not by the way it could be achieved. That will let you focus on the right data, which gives value to the business. It’s easy to be trapped by trendy buzzwords. 

Check a few reference architectures

Even if your big data project is very innovative you have quite a high chance that something similar was already built. You should not necessarily get anything as it is but rather treat that as reference architectures. Many companies are happy to publish their success stories with a good level of technical detail. As these are already released and working solutions you have the advantage to know their findings and learnings, so you can even use that to improve. 

You can definitely find much more on data-driven companies tech blogs such as Spotify, Allegro, ING, GetInData and more.

Find big data experts

Years of experience exploring technology and developing production data driven systems may be priceless. Advices you can get from industry experts may change your perspective and put your project on the success track. Find, recruit and hire architects and data engineers with a range of experience in Big Data projects, so their expertise will cover every aspect of your system. 

Deliver fast a first successful project

Complex systems, especially in the Big Data area require huge efforts. It may be very hard to find promoters for long-running and expensive projects. At GetInData we always try to isolate use cases where we can deliver fast. When a client is able to see good achievements and delivery in a short time, they're more eager to invest in further projects. Please find our story with KCell where with relatively small incremental steps we went into bigger cooperation.

At GetInData you have access to 50+ distributed systems and cloud experts working with big data systems from the very beginning since Apache Hadoop was released. Do not hesitate to contact us, our team will be happy to discuss your first big data project.

big data
big data project
data model
big data experts
17 February 2021

Want more? Check our articles

Big Data Event

Five big ideas to learn at Big Data Tech Warsaw 2020

Hello again in 2020. It’s a new year and the new, 6th edition of Big Data Tech Warsaw is coming soon! Save the date: 27th of February. We have put…

Read more
bqmlobszar roboczy 1 4

A Step-by-Step Guide to Training a Machine Learning Model using BigQuery ML (BQML)

What is BigQuery ML? BQML empowers data analysts to create and execute ML models through existing SQL tools & skills. Thanks to that, data analysts…

Read more
screenshot 2022 08 02 at 10.56.56
Tech News

2022 Big Data Trends: Retail and eCommerce become one of the hottest sectors for AI/ML

Nowadays, we can see that AI/ML is visible everywhere, including advertising, healthcare, education, finance, automotive, public transport…

Read more
transfer legacy pipeline modern gitlab cicd kubernetes kaniko

How we helped our client to transfer legacy pipeline to modern one using GitLab's CI/CD - Part 2

Please dive in the second part of a blog series based on a project delivered for one of our clients. If you miss the first part, please check it here…

Read more

How to build Digital Marketing Platform making the best out of Google Cloud

Nowadays digital marketing is a competitive business and it’s easy to tell that we are way past the point when a catchy slogan or shiny banner would…

Read more
getindata nifi flow cicd notext

NiFi Ingestion Blog Series. PART II - We have deployed, but at what cost… - CI/CD of NiFi flow

Apache NiFi, a big data processing engine with graphical WebUI, was created to give non-programmers the ability to swiftly and codelessly create data…

Read more

Contact us

Interested in our solutions?
Contact us!

Together, we will select the best Big Data solutions for your organization and build a project that will have a real impact on your organization.

The administrator of your personal data is GetInData Poland Sp. z o.o. with its registered seat in Warsaw (02-508), 39/20 Pulawska St. Your data is processed for the purpose of provision of electronic services in accordance with the  Terms & Conditions. For more information on personal data processing and your rights please see  Privacy Policy.

By submitting this form, you agree to our Terms & Conditions and Privacy Policy