WILT – Amazon Web Services Journey

Pluralsight was generous enough to provide all their courses for free, during the month of April; #FREEApril. It did not take me more than 1 minute to jump on the site and create an account. I’ve used them before via employment accounts, but I never had a personal one. Having said that, the main courses I wanted to dive into were related to big data and Amazon Web Services. Here is what I learned.

Making Sense of AWS

Everyone has a general idea that Amazon Web Services (AWS) has a ton of services; it’s literally in the name. I was originally overwhelmed by the services and what did what until one of the courses laid out some of the major services:

AWS Big Data Major Services

Warehouse: Redshift
Data Lake: S3, Athena
Relational: RDS
NoSQL: DynamoDB
Batch: EMR
Streaming: Kinesis, MSK (Kafka)
Data Integration: Glue, Data Pipeline

By taking these major services, splitting them out to processing or storage layers helped me understand things even more:

Once that clicked in my mind, the connections between the services made sense to me. After that, I went into another course that explained about stacks of services being created by another service called CloudFormation.

CloudFormation

CloudFormation is an AWS service that allows someone to create a stack of services that will integrate with one another. For instance, creating a stack that consists of Relational Database Service (RDS) and Simple Cloud Storage Services (S3) to move data back and forth. These stacks can be generated manually using a designer or using templates via JSON or YAML files.

Coming back to the course, I was able to follow the demos to create an Aurora stack and a Glue stack.

Creating Aurora Stack

The course already prepped a YAML file for me, so all I had to do was upload it to create this stack that includes an RDS database.

Here are a couple snapshots of the stack information when it is completed:

Here is a snapshot of the RDS dashboard:

Once I validated RDS was up, I downloaded MySQL Workbench, connected to RDS, and ran some scripts to generate a database and tables.

Creating Glue Stack

Same as creating the Aurora stack, most of the heavy-lifting was already in the YAML file given to me. Before creating the Glue stack, I had to create an S3 bucket to store a python file that would ultimately be an ETL job.

Once that was completed, I uploaded the YAML template and the following items were automatically created; S3 bucket to store parquet data, Glue Crawler to crawl against sakiladb, and a Glue ETL job.

side note: Before we get any farther, I wanted to define a Glue Crawler. A Glue Crawler is simple a metadata tool that crawls against data stores and generates metadata based on it. In doing so, that metadata is stored in the Glue Data Catalog which is used as a central data repository.

Once everything was created, we ran the crawler against the sakiladb database created in the Aurora Stack phase. We then ran the ETL job to move data from sakiladb to the destination bucket generated by the YAML file. We then created another Glue Crawler to crawl against the destination bucket. Then finally, we used Athena to run queries against the destination bucket. Here are the snapshots of all that:

AWS Glue Crawler to generate S3 destination data

Using AWS Athena to query the S3 destination data

After running through these courses, I have a way better understanding of what and how to use AWS. I’m looking forward to learning more while I have the time. I hope you find this information as motivation to learn more about AWS if you don’t already have that knowledge.

Happy coding!

WILT – Amazon Web Services Journey

Making Sense of AWS

AWS Big Data Major Services

CloudFormation

Creating Aurora Stack

Creating Glue Stack

SQL – Pivot Entity-Attribute-Value Data

What’s Better? Centralized or Decentralized Data Teams?

Data – Another Perspective on This Deer Chart