NiFi Scripted Components - the missing link between scripts and fully custom stuff
Custom components
As we probably know, the biggest strength of Apache Nifi is the large amount of ready-to-use components. There are, of course, situations where we need something custom that we can’t do with the use of available processors or it will make our flow unreadable, messy, ugly, the kind that you wouldn’t like to introduce to your parents. If you have any experience with Nifi then you have probably found some ways to handle this particular problem and (some of them are presented in the article here). The solution depends largely on the complexity of the processing. Available external components are also an important factor, but today we will focus on the internal features of Nifi. If it’s something simple, you could do with one or two bash commands then the ExecuteStreamCommand processor will do the rest of the job for you. It's very simple, it usually only requires one line of code and you’re good to go. If that’s not enough, it’s time to get out the big guns. ExecuteGroovyScript or ExecuteScript both provide a way to insert code directly into the processor and make it do whatever we want it to. It’s almost the perfect solution… Almost.
So what’s wrong with the scripts?
Even though script executors are great tools, they do have some downsides. We can more or less separate them into two kinds, based on their nature, explicitness and phase of development in which they appear.
Development phase, functional limitations
These are some that you can encounter while designing your flow or in the implementation, to name a few:
only two outgoing relationships
dependency updates issue
dynamic properties only
working in processor-like manner
As we can see, these are strictly related to the development phase. If we want to have more relationships, we need to set some attribute with status and later route the flowfile with RouteOnAttribute processor and as a rule we would like to keep the size of our flow to a minimum. If we want to update dependency jars, we also need to invalidate the processors that use them, otherwise they will use a cached (old) version of the jar. Dynamic properties make it impossible for us to use sensitive properties. Finally, the manner of working is problematic if we want to achieve functionality of e.g. a controller service.
It’s possible to make workarounds for those but they’re more like hacks rather than solutions.
Maintenance phase, good practice hurdles
Here instead of clear and explicit lack in functionality, we can see secondary consequences of the design. While they may not seem that important to the developer that creates the first PoC version of the flow, they definitely can make someone trying to make changes in the flow a year later question the qualifications and sanity of the creator. Among others:
lack of testing frameworks or good practices
incentive for using implicit arguments
no way of describing the properties
The first one is quite obvious, we want to test our changes and do regression tests. Frameworks for automatic tests are a blessing that we often don't appreciate enough in normal programming. Two of the following require more explanation.
The devil lies in the implicit
As we know the Nifi is full of implicit values that we pass on throughout the flow in the form of attributes. Using attributes is convenient, but we need to remember that if we create a custom component, we need to make it visible somehow, that we are in fact using them.
Let’s imagine a situation where you are a maintainer of the flow and you want to make some change which will modify the value of one attribute. You need to check if it's not used anywhere in the flow, fortunately each processor has documentation that contains the information which attributes it uses… except for those scripts. You need to find all the scripts that use the attribute inside the script body. Even more-so, they might call some method that has flowfile reference as an argument. Then you need to find the dependency code and check there.
It can happen with any custom component in Nifi but in the case of scripts, there is no incentive to make values explicit right away. What’s more, taking value from attributes is in fact the most convenient way to get it.
Documentation matters
Imagine a different scenario; you want to use a script someone else created in your flow. The creator was reasonable enough to take all the values from dynamic properties, so you can see which ones are used right away, but what for? The script is big and properties have generic names. Guess it’s time to study the code and figure out the author's intentions. That’s usually not a pleasant experience.
Custom components to the rescue
Apache Nifi provides API for all types of components, so the user can easily create custom processors, controller services and so on. To put it bluntly, they solve all the problems mentioned above. The question is then, why don’t we use them by default?
So what’s the problem?
We can divide this question into two separate questions that are easier to answer. First of all - why use scripts in the first place and secondly, why not migrate to custom components?
Why scripts?
The reason is simple, they’re faster to implement. To write a custom processor you need to create a project, compile it and make sure all the libraries are added correctly. It’s quite a lot of hassle. On the other hand if it’s something simple, the script will be more than enough. The problem is, we start with scripts because the logic they implement is simple, but later it gets more and more complicated and we have already made the decision to use scripts. The only way then, is migration.
Hurdles of making non-functional improvements
Migrations to better solutions and non-functional improvements are something we would all like to have done in our project. I could say that we all would like to do it but let's face it, we don’t necessarily want to… It will not improve functionality of the solution and will take a lot of time, the only people who will see this are the developers, so a business will not notice. What’s more, they want new functionalities so we have a thousand items in the backlog. If that wasn’t enough, a new solution means changes in deployment so it’s not only a job for developers but for devopses as well… and we don’t even know whether it will be worth it or not.
Some of the factors mentioned above will not change, but if we could create a quick PoC that works, we would have more arguments to push for such migration.
Salvation for migration - scripted components
Nifi provides several components that can be a half-way solution that is able to solve a few of the migration issues.
InvokeScriptedProcessor
ScriptedTransformRecord
SimpleScriptedLookupService
ScriptedActionHandler
ScriptedLookupService
ScriptedReader
ScriptedRecordSetWriter
ScriptedRecordSink
ScriptedRulesEngine
ScriptedReportingTask
These components work in a straightforward way. You provide custom implementation in the component body and then it transforms accordingly.
How do you use it?
Let's go with the example of the processor that has two properties and two relationships.
class ExampleProc implements Processor {
public static final PropertyDescriptor REQUIRED_PROPERTY = new PropertyDescriptor.Builder()
.name("required property")
.displayName("Required Property")
.description("Description of the required property, can be as detailed as we want it to be")
.required(true)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.VARIABLE_REGISTRY)
.build()
public static final PropertyDescriptor OPTIONAL_PROPERTY = new PropertyDescriptor.Builder()
.name("optional property")
.displayName("Optional Property")
.description("Description of the optional property, can be as detailed as we want it to be")
.required(false)
.addValidator(StandardValidators.NON_BLANK_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.build()
public static final Relationship REL_SUCCESS = new Relationship.Builder()
.name("success")
.description("Description of the success relationship, can be as detailed as we want it to be")
.build()
public static final Relationship REL_FAILURE = new Relationship.Builder()
.name("failure")
.description("Description of the failure relationship, can be as detailed as we want it to be")
.build()
@Override
void initialize(ProcessorInitializationContext processorInitializationContext) {}
@Override
Set<Relationship> getRelationships() {
return new HashSet<>(Arrays.asList(REL_SUCCESS, REL_FAILURE));
}
@Override
void onTrigger(ProcessContext processContext, ProcessSessionFactory processSessionFactory) throws ProcessException {
}
@Override
Collection<ValidationResult> validate(ValidationContext validationContext) {
return validationContext.getProperties().entrySet().stream()
.map{e -> e.getKey().validate(e.getValue(), validationContext)}
.collect(Collectors.toSet())
}
@Override
PropertyDescriptor getPropertyDescriptor(String s) {
return getPropertyDescriptors().find {p -> p.getName().equalsIgnoreCase(s)}
}
@Override
void onPropertyModified(PropertyDescriptor propertyDescriptor, String s, String s1) {}
@Override
List<PropertyDescriptor> getPropertyDescriptors() {
return Arrays.asList(REQUIRED_PROPERTY, OPTIONAL_PROPERTY)
}
@Override
String getIdentifier() {
return "Example Processor"›
}
}
This is the code of a processor that does literally nothing, but if we paste it into a Script Body property of InvokeScriptedProcessor we will see this:
So what are the changes? All the properties defined in the code are visible in the processor, they are not dynamic and have documentation in the Nifi UI. We can also see relationships defined in the code, also with documentation.
What about testing?
If we assume that scripted processors are placed somewhere between scripts and custom components, then testing frameworks are between scripted and custom components. It’s because you can’t do them in the processor's body, you have to set up a project with unit tests. Example:
Ok, looks good but now what? So as stated previously this is a step towards custom components, for it to be good it has to have two features:
make migration to custom components easier (the end goal would be to have all things we want as custom)
be in some way better than the scripts (show the benefits of the migration)
Migration aids
There are few benefits on the migration side. Let’s go through them
We don't have to create new deploy pipelines - if we can either paste the code in the script body or use any jar on the worker just like in the case of the scripts.
Less overhead on first development - normally we would have to implement the component class, create some maven project, configure dependencies and deploy it on Nifi + handle any types of issues. With this solution we can skip the project creation and deploy, which sometimes can be the most time consuming part.
Let’s also have a look at what improvements can be made compared to using scripts.
Not only dynamic properties - we can set up validators for property values, make property optional or required and everything that comes with creating custom components.
More types of components - although we can't have all of them, we’re no longer bound to processors only
Easier to follow good practices, easier to keep things explicit - documentation, clear requirements for properties help to keep the flow structure clean
And that’s where the good parts end
The moment you think that maybe InvokeScripted* could replace the custom components, well… here is your bucket of cold water.
Dependency update issue - still present and going strong
No sensitive values - because of the way it’s implemented, it cannot really store sensitive values, more than that if we try to use sensitive parameters, we will probably break the whole flow
Testing - Custom components are still much better for unit testing.
So what do we do now?
In conclusion, if you feel like your scripts could use an upgrade, there you have few arguments to back it up. Remember to take into consideration that scripts are in general faster to create and have their place in the Nifi ecosystem. In the end, every project has its own specification and the decision rests on yours and your teammates shoulders. Cheers!
Would you like to read something more about Apache NiFi? Check out our blog series NiFi Ingestion Blog Series
big data
technology
apache nifi
7 September 2021
Like this post? Spread the word
Want more? Check our articles
Tutorial
Apache Spark with Apache Iceberg - a way to boost your data pipeline performance and safety
SQL language was invented in 1970 and has powered databases for decades. It allows you not only to query the data, but also to modify it easily on the…
5 questions you need to answer before starting a big data project
For project managers, development teams and whole organizations, making the first step into the Big Data world might be a big challenge. In most cases…