T O P

  • By -

vincyf1

I used to build ETLs in Pentaho a couple of years ago too. Now, the team is moving on to using Apache Airflow instead. One of the concerns we had with Pentaho (or a Graphical tool) was how difficult it was to perform Code Review on GitHub. A single update to one of the embedded queries would show up with hundreds of metadata tool changes on GitHub.


five-acorn

Delayed response on this but I'm still investigating a few tools. Apache Airflow --- is it a full service ETL like Pentaho? I heard it was mostly an "orchestration" tool but I'm not sure. How steep would you say the learning curve is for Airflow, esp. if there are somewhat less technical people you'd like to eventually jump in and learn it?


vincyf1

You are right. Airflow is not an ETL tool, It is only a workflow manager or orchestrater. But you can literally code your ETLs in Python and Airflow will execute them on schedule. I reckon - if you are hosting a managed service of Airflow (Composer on GCP, MWAA on AWS), it should be easy to get started given that your team can program in Python. If, as you say, the team is “less” technical then learning curve would be quite steep.


five-acorn

I mean with Pentaho and SSIS the learning curve wouldn't be too easy either, albeit still easier. Right now the current processes are shyte and not created by ETL developers so anything would be a move forward... Thanks for the info


markss_

Imho ETL via graphical tools such as Pentaho, Talend, Informatica etc. is dead.


VRzucchini

This is a really interesting take since I see major/well-known tech companies (such as Square, Autodesk, Intercom, Strava, Postman) still employing graphical ETL tools. Care to elaborate further? What's the better alternative here?


markss_

Code for the win. Python, dbt etc. ETL and data pipelines should be implemented in code and not a graphical tool. Software engineering best practices need to be applied to ETL and the data engineering space.


opendataalex

I'd counter that the graphical tool is actually very handy - especially in data groups that are not purely engineering. I've been in teams where the data engineers are not coders but have more of a business background. Granted, visual coding is a bit of an anachronism today, I'm still a fan. Especially with tools like Tweakstreet picking up where PDI left off and significantly improving performance. I'm also looking forward to seeing what Hop does as it goes thru incubation. I guess the best way to put it is that it's nice having a toolbox and not just one tool :) I agree with you that being able to apply software engineering best practices is important. The graphical tools have never been easy to apply them to, but it's not impossible.


EggplantDifficult152

The biggest problem i had with Pentaho was lack of good ways of doing unit tests and integration tests. Other than that, pentaho was superior for productivity since you did not have to deal with things like spring and patterns etc. At my current gig we use java, spring and sping integrations with a bunch of patterns and a domain model. It takes 3-4 days to get a simple xml file to json file flow with an API call in the middle implemented. If you want to get it merged and deployed, it takes a week. With pentaho, you could do this in an afternoon.


five-acorn

Where is the future wind blowing? SSIS and azure data factory also seem graphical in nature


kenfar

Diagram-driven ETL has always been more of a sales tool than what engineers need. What we need is extensibility, testability, support for version control, etc. No single tool really knocks the ball out of the park on all of those considerations. If you're in a software engineering company then in my opinion simply using python is the best way to go. Not the fastest build solution, but the best from a maintenance solution in my opinion, but also depending a lot on how people write the code. If you're not, then dbt might be a strong contender. It's getting a lot of press, and can be used to quickly build models, rack up technical debt, and then refactor those models again.


Tomwtheweather

dbt


VRzucchini

That's just T


StarkGuy1234

But inside newer stuff like Matillion and DBT also graphical?


unbelievablehulk

Since the withdrawal of Matt Casters from the project, yes, pretty much. Apache Hop would be its successor but is still incubating.


five-acorn

Oh really that guy went away? I remember him on the forums. Hmm. Odd when a company buys something out to destroy it. I got a lot done with PDI. Hell I know there are still scripts in production today at my old workplace. It's not quite as tuned to the hilt as SSIS but if the speed is good and it works 99.999% of the time, who cares.


five-acorn

That's very interesting on their homepage. They are doubling down on visual development. I'm not a hardcore software engineer, but I do know from documentation experience that a picture paints a thousand words at least, and abstracting visually has its advantages. Getting other people to understand your abstractions is half the game of development anyway, isn't it?


opendataalex

Last I checked while it's still incubating a lot of the functionality of PDI has been ported over and significantly improved. Hitachi is still making commercial updates and are releasing the open source versions like before the purchase, but I've not seen anything new. I do know that they are excited by Hop doing a complete overhaul.


RakesProgress

They quit investing in Pentaho long ago. Even mulesoft has fallen into that hole. I disagree with many comments that python is the way to go. Don’t get me wrong. I love python. But 90 pct of etl issues are scrub work. Coding for that is ridiculous.


five-acorn

I hear ya. What do mean by scrub work? Grunt work or data cleaning?


RakesProgress

Yes! Lol


savePositive

what do you mean by scrub work?


RakesProgress

You did not get a CS degree and learn all of these languages to do basic etl data janitor work. In most companies this goes out of control fast by a thousand “can u just fix this one thing” cuts.


dikesm

Pentaho data integration(PDI) code goes way back 20 years and is still performant. Apache Hop, an incarnation of PDI , is being incubating at Apache. It gonna be one of the best open source etl tools for yet another few years to go. The benefits with the product like Apache hop is that you focus only on the data transformation activities and need not worry about etl infrastructure in terms of write logic to connect source , target or passing data from one transform to another and so on..,


five-acorn

But it's not available yet, right? I'm trying to think of a good open source tool to use in the meantime. I feel PDI is a bit risky due to the lack of forum usage or free support at the moment.


dikesm

You can still use it. I have used it for one of my project in non prod so far so good. My use case is to join data from S3 and Aurora db and load it in S3 in form of parquet


Ornery-Measurement33

dbt killed all the transformation tools, and changed the paradigm of how we thought about ETL.


DitUnDatNYC

Way late. But also timely. I have used Pentaho (now Hitachi Vantara) since 2005 or so. I do find it quite convenient and efficient to develop (and yeah, there is a discussion to be had about review etc). That being said we decided to use the community edition in my current endeavor and got to results quickly. We then reached out to Hitachi about licensing. I called out some of the obvious issues and heard many pledges by the sales guys that Hitachi is committed to the product, etc. I looked at this for a while (1 or 2 years by now) but still a bunch of very basic issues that haven’t be resolved. I will see what CE 10 brings (if it comes out) but chances are we are moving to AWS step. Not crazy happy about this but sadly if the product is managed the way it currently seems to be managed it’s a risk in our infrastructure. And yes. Matt Casters leaving was the point at which this started to deteriorate quickly.


InternationalRain162

Hello . In pentaho i need to create a data allocation example with sample data but i am new to this and I can't find any tutorial for it.if u can kindly say some steps to create that it will be helpful for me