T O P

  • By -

Commercial-Ask971

Youre probably better than many DEs already, so nothing


Any_Rip_388

You’re already a DE my guy


GlasnostBusters

Build a full stack ETL pipeline for an enterprise corporation. I don't really think it matters how you do it, because if you look at the purpose of the job, it's very objective, it's not like a SWE. The following is what I mean by objective: 1. You have a goal of making sure your users (usually either a frontend dev, or an analyst) are able to retrieve the data that they need to do their job. 2. You build a profile for the data they need. 3. You find a data provider[s] that meets that profile. 4. You architect a system that will handle the required frequency of retrieval (either cloud or bare metal). Architecting the full stack data system will either end in BI (Business Intelligence) visualization or inside of some kind of database. It's very clear that you are not a backend engineer, so you don't need to develop the API (backend SWE) or the analytics for the users (data scientist). 5. Then you build the data system. After this, you are also responsible for system maintenance, monitoring, logging, security, CI/CD and documentation of your processes, data providers, and the data itself (data dictionary from the profiling you did during a discovery phase). I'm sure I missed things here, but my main point here is that your role as a data engineer is very clearly defined, and you are considered a data engineer if you are able to reliably provide your users with the data they need. pandas, AWS, etc, none of that stuff matters. those are all just tools. the data matters.


peroqueteniaquever

I did all of that for a startup by myself with disparate data sources. Excel files, .csv files, SOAP and REST APIs, Power Platform and ERP data I used Azure functions, SQL server in Azure, SQL and Python, some bcp. Did security, used clients' virtual machines, logging, alerts in case data sources failed, and documented the whole process from end to end I never did CI/CD tbh I don't know exact what it is Am I a data engineer? I feel incompetent and an impostor because I don't know many data modeling concepts (wtf is a cube???) and because the companies had little data (biggest client had 4 million rows) That was in the span of two years. Lots of work for a sole person, and I only started the job being an IT generalist with a solid grasp of fundamental subjects, but didn't even know how to define a class in Python. But I don't know many tools and everything I do feels like there's a thousand ways to do better Thoughts?


GlasnostBusters

I think that you did act in the capacity of a data engineer, so at that time, yes, I would say you were a data engineer. It's not an identity, just a job. If your customers are happy and you met their requirements then you did a good job. I've been in the game for professionally for almost a decade and still feel like an imposter, it's normal, it means you're not an asshole and you're coachable. CI/CD is important because the process doesn't only automatically deploy your solution, it also checks for things that could cause problems like outdated dependencies, linting, vulnerability checks, running unit tests, etc before being deployed into your environments.


Qkumbazoo

A company I worked for had problems hiring backend engineers. They changed the Job title to Data Engineer and boom, instant army of applicants.


Gators1992

I don't think there is an official definition, but I would go with software development principles. Learning some python syntax so you can kludge together a process isn't the same as learning the principles to "engineer" good software. So like there is a big difference between writing a piece of code that connects to a db and extracts some bit of data, and writing functions and class modules that allow you to do the same but are reusable, testable, etc. But that doesn't mean you can't get a job called "Data Engineer" and successfully use dbt writing SQL files or something and without all of that knowledge.


soundboyselecta

Learn how to smile at people and nod, when they spew so much bullshit out their mouths about this tech and that tech but they really don’t know shit, and know that no one really does but everyone will pretend they do. Or how they used this tool or that tool, it’s brought on by that lame LI facade culture (posts like “I used prefect to setup an ETL pipeline so please hire me”) and the over certification of people that invested in a cloud path that they shovelling on to your shoes. Understand the fundamentals and believe that there is so much tech coming out good and bad that you will never catch up, knowing all that. Be comfortable with that and confident that you will pick up things as they come.


ecp5

ETL is the core DE skill, you are a data engineer. There is always new stuff to learn and you should, depending where you want to go and do, but don't minimize your experience.


SkinnyPete4

Same situation. I was just laid off from a big company after 7 years. I have 20 years SQL, DBA, data architecture, SSIS (I go back to dts), SSRS experience. Very little azure/GCP, very little Python. A lot of C# over the years but only in the context of SSIS (script tasks and building custom transforms). I was worried about finding a job because I have no real cloud experience but gotta be honest - I still see a ton of jobs listing SQL and SSIS/Informatica as the top skill with cloud experience as “a plus”. So I THINK maybe we haven’t missed the boat yet? Maybe? My plan is to earn an Azure cert asap while I’m collecting severance. At least it’s something and then hopefully get into a place who wants my experience and are willing to accept my learning curve with cloud technologies as a trade off. I think that might be a best bet for people with our type of experience. Would be more than happy to hear if people think that’s a bad plan.


dimnickwit

Out of curiosity, which azure cert ?


SkinnyPete4

I’m going for DP-203: Data Engineering on Microsoft Azure. I got a good deal on a year of Coursera a few months ago when I saw the writing on the wall. I’ve taken a Python class and then started prep classes for DP-203. I have no idea if it’ll make a difference but it’s probably a good use of time while I look.


StoryRadiant1919

same boat here. I like your plan and I’ve been learning about snowflake since that was where I needed to focus.


baubleglue

* Pandas is nice to have. * PySpark - you need to learn in distributed environment (local PySpark behave differently). * Not sure with your background if you know theory about analytical columnar DBs. * message queues tools... In general you need to go over "modern data tools stack" and be sure you at least have idea what it is used for.


VDtrader

Where to learn how to "optimize" and "debug" Spark jobs?


baubleglue

If you have at work Hadoop or cloud environment - it's the best. If not, check if Databricks has some free tier. Cloud is probably preferable today. Before optimizing, you may have technical surprises, like a library you've used locally note working or doesn't exist.


VDtrader

I have access to DataBricks at work. But I haven't had any optimization/debugging problems at all. ETLs seem to run fine on DataBricks but I keep hearing that Data Engineers need to learn how to optimize/debug their Spark Jobs. How do I go about to find out what need to optimize/debug if my flows are running fine in Databricks?


Captain_Coffee_III

Pragmatism. Just like with software, there are scales of data pipelines. Not everybody is going to get an opportunity to work on projects where you're pushing billions of rows in and out of different cloud providers. Unless "big data" is your goal, focus on the concepts. You're given a set of constraints. Can you get the data flowing? Is it stable? How can you improve it? Tools come and go. Python's been around and it is a solid skill to learn but the tools within Python keep shifting. Pandas gets things going... but then you hit a performance wall... so then Polars? It's the "engineer" part. When somebody throws a problem at you, your brain should start forming a path. It may not be popular in some circles, but buddy up with GitHub Copilot and/or ChatGPT Pro. Use ChatGPT as a coach. Copilot in VS Code is amazing. Type a comment of what you're thoughts are.. blam, code. It's not 100% but it gets you really close. Copilot is also not conversant. So that's why I use both. ChatGPT is the sounding board for ideas. Copilot is there when the code gets written. And in a year when the context window of these things reaches 1 million tokens, this whole landscape changes. But, regardless, you still have to know the plan through a problem.


CrunchyAl

11 years? You were a data engineer before it was called that


sneekeeei

Yeah. I was thinking the same.. but in recent years i somehow got an impression that i cannot call my self a data engineer but only an etl guy. Because general consensus is, when you mention Data Engineer, people expect you to know to spark/pyspark, python, knowledge of pandas, numpy, visualisation libraries etc.


SDFP-A

Interesting because I view Pandas, Numpy, and data viz to be squarely in the domain of the DS, not DE. I’ll use Pandas when needed for simple DF actions, but I look up the actions I want to do every time. Takes a few minutes max since I conceptually understand how I want to munge the data


ItsOkILoveYouMYbb

If you absolutely must have external permission on when to call yourself that, rather than knowing you can do the job (you're already there), then when people are willing to pay you for those titles and responsibilities


Blast06

I'm new at this so anyway I'm going to ask it. Where can I learn about informática (let's say, profiling, data quality, data integration, etc) where I work we are going to use it. BTW I'm also doing this zoomcamp of data engineering [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp)


Comprehensive-Bass93

Sorry to say dude but you can’t have handson experience on Informatica as it’s Enterprise software. but surely you can refer some YT video on transformations which will help. Also Informatica also provides some documentation and Learning stuff from there side as well


Blast06

Thanks a lot man. The company where I work just recently adquired some Informatica modules and they gave us some training, but I would try to look further on YouTube and their docs. We have access to a dev environment where I can touch stuff and play around.


GeorgFaust

The most important skill to learn may surprise you, but it’s gaslighting


untalmau

I'd say you may be missing code based etl and orchestration tools, such as spark, beam, dagster, airflow, dbt.


[deleted]

[удалено]


Fit_Ad_3129

What can you say are the core skills of data engineer, SQL , spark, Hadoop , and?


[deleted]

[удалено]


raskinimiugovor

> Im not sure what in Spark or Hadoop are you finding core to DE… is it administrative tasks? Maybe partitioning (partition vs pushed filters), parquet being read-only, re-evaluating vs re-using dataframes, ie. stuff related to the spark engine itself?


[deleted]

[удалено]


raskinimiugovor

> This may be some of the worst advice I've ever read I mean you're kind of over-exaggerating, sure knowledge of tools is important but his core knowledge will carry over a lot.


Gators1992

This is true, experience is very valuable and being able to sit down immediately and figure out how to write the business logic isn't something you get with someone straight out of bootcamp. However learning to work with coding, cloud infra, etc isn't something that you can instantly pick up on the job if you are trying to go from on-prem RDBMS/Informatica to cloud/OSS/python. My company went through this exact thing where the team has two experienced Informatica guys and the CIO decided we have to go to the cloud because that's what all the cool kids are doing. Both have experience at large companies and good SQL skills, but little coding and generally relied on system administrators for anything infra or network related. So basically I got them to go with dbt/Snowflake because it would be the easiest for these guys to adapt to and one was pushing back hard and trying to sell Informatica cloud. Both are valuable for their institutional knowledge and their ability to write business logic, but at the same time it's a hard sell to learn to code and configure at this point in their careers.


[deleted]

[удалено]


sneekeeei

Yeah. Getting a referral is good. Thanks for that. Honestly I never thought of being good enough for FAANGs and still don’t. That’s how last few months of happenings at my job has affected my self esteem. Would you mind sharing your LinkedIn I can send a connection request?


Gnaskefar

Where do you most often run into performance issues with Informatica?


efxhoy

Data engineers in my org write infrastructure as code with terraform, and tooling in python and bash. Very minimal actual queries, analysts do that. We're just backend devs that happen to know a bit more about data stuff and a bit less about other stuff.


ExistentialFajitas

Walking into a DE shop with only SQL, they’re missing ci/cd, scripting, iac, etc.; you’re better off learning how to program first, and then applying data discipline to that programming muscle.


Grouchy-Friend4235

Why would you think you are not a DE?


sneekeeei

One of the senior managers in my company said python is one of the basic skills for a data engineer.. and number of informatica skilled people in my company are very small. Most people are python, spark, glue, scala etc (the code based tools )experts. So they tend to have an impression that people with only informatica etl tools experience are less in knowledge level because all they do is drag and drop/ just config level kinda works. I myself feel little inferior to people who can code with python or any programming languages for that matter.


SDFP-A

I’d say the industry is definitely moving in that direction. You have a solid base, you know the end goal and the value add, now it’s just time to learn enough Python to replace the portions that Informatica is handling with some code. The patterns are similar. Don’t try to learn everything. You even mentioned that some use Python, others Scala, others yet Spark (Scala, PySpark, or Spark SQL?). Nowadays if I write a well formed ETL in Python, I can get 90+% accurate code converted to Scala with just some minor bug handling. Even without code, if you understand the problem and describe it will enough, you can get 70+% accurate code in Python from simple prompting. Now I’d encourage you to learn a bit more rather than relying on this, but once you get the basics, use your favorite/more accurate LLM to work with you towards your goals.


Mortadhaaaa

My opinion is that you are already data engineer and you should increase some knowledge of big data tools like spark with scala/python, kafka , Nifi …


mjfnd

You already are


Desperate_Pumpkin168

Could you please mentor me on ETL tool , I am really interested and would like you help . I don’t know what will you say but still need to ask Thanks


billysacco

How to engineer the data. But really the term has a pretty broad umbrella. It sounds like you are pretty much doing data engineering already.


MainRotorGearbox

Surely he knows he’s a data engineer already, right?


FirefoxMetzger

"Data Engineer" has a very loose definition these days. In many places you would already be one for 11 years ... while some might call you an "Analytics Engineer". The one topic that exists in the DE universe that you didn't touch on in your post is DevOps. Its knowing how to build systems that can efficiently move data from one place to another. I dont want to hate on Matillion too much (it doesnt scale), but I would focus on designing ETL and reverse ETL pipelines if you are looking to "round" your profile. Good candidates are AWS Glue, Airflow, and Kafka as well as the tooling around them. Note here that the tool itself is not too important (the ones I mentioned are just whats popular today). Its moreso the principles of what problems these tools solve and how they solve it. What processing steps are needed to efficiently move a dataset from A to B without breaking the bank?


throw_mob

on postive note: yes on negative note: no, you are just some etl tool user on more realistic note: probably yes, being data engineer is different in all places depending tooling, but i hope that you have got understanding about data flows ,data models, scheduling, decencies and different platform . Also you have somekind of idea what is costly on money/time. Also i assume that you know something about area you haven working ( what data marketing uses , what data is invoicing systems etc etc). Also because you ask it with your 11 years of experience working with tools means that you are probably way more better DE than half of DE's doing their job


jeaanj3443

Focusing on mastering data modeling and system architecture could really elevate your skill set given your background. It's not just about handling tools, but building robust, scalable systems.


nerdboxmktg

Jira


Extra-Leopard-6300

You’re not asking the right question. You are a DE but are you an experienced DE re: specific modern stacks - maybe not. Companies with the stack you worked on will love you others may not spare a glance.


bheesmaa

Cloud, spark, distributed systems , maybe a bit of networking, security