I'm currently in my first job that involves Spark. Code is written in Java.
My current task is to continue what was left off by a previous dev.
The task in itself isn't very hard but context:
The previous Dev wrote the Rules, the joins in only one specific class (which are getting really big, like thousand liners) with a huge main class too.
I can easily follow the same process and make those classes even bigger but I find it extremely irritating to read and maintain for the future.
For now, I restructured the project in packages like:
But I also doubt my own methodology and I have no Senior to rely on in the company.
My classes are just utility classes with many (but organized ?) static methods that contain the Rules and I call from a helper class to chain all the processes. I feel like it is kind off "dirty" code but I'm unable to come up with any other solution
I do have a few classes in which I inject a dataset as a parameter so that this dataset serves as a reference in all the methods inside the class. But all the methods are really "specific" for example:
public class CurrencyService {
private final Dataset<Row> currencyReference;
// constructor
public Dataset<Row> applyCurrencyConversion(Dataset<Row> targetDataset){
// target joins reference and apply conversion
}
private Column conversion(Column targetColumn){
// conversion rules
}
}
How should I handle Join Keys? Every dataset has a different name for its ID. I was thinking about forcing a key String from a constant for each ID at preprocessing step. Because otherwise, I have to join every dataset manually by knowing their Column's name each time.
With preprocessed join keys, I may do something like dsA.join(dsB, Sequence of (joinKeys)).
In the end, I feel like I'm spaghettying the code even more while the previous "single class" style was easier to understand but not to maintain...
I have a background in POO, so am I taking this project too Object-Oriented?
This is a really very broad and impressive question :) - I wish developers have this level of maturity when they start a project.
You are building a pipeline, I recommend you follow:

If you follow this diagram, you should not apply rules after the Gold zone, unless you work on specific sub-projects.
For organizing the transformations, static functions are perfectly fine, especially as you are inheriting an existing (working?) project. I would think a little differently for future rules or a project from scratch (you could build a small rules engines where you add the rules to a list and execute at once for example).
For joining, if you have complex joins you reused, you can imagine isolating them, but your idea of isolating the id columns is good: it is more metadata driven. In some projects, I created a "SuperDataframe" object where I could add some of the missing intelligence to the dataframe. You can also add custom metadata to any columns and have your rules use that: check out https://github.com/jgperrin/net.jgp.labs.spark/blob/master/src/main/java/net/jgp/labs/spark/l090_metadata/l000_add_metadata/AddMetadataApp.java. It looks like:
Metadata metadata = new MetadataBuilder()
.putString("x-source", filename)
.putString("x-format", format)
.putLong("x-order", i++)
.build();
df = df.withColumn(colName, col, metadata);
Dataset<Row>) not Datasets/RDDs for performance optimization.You're on a good track!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With