CLI, SDK, Boto3, CloudShell

Ways to manage AWS cloud resources without using the GUI


CLI guidesarrow-up-right


AWS CloudShell

Browser-based command-line environment pre-authenticated and integrated with its cloud platform, enabling users to manage cloud resources without installing local tools.


SDKs

Boto3arrow-up-right: The AWS SDK specifically for Python. It's widely used for automation and integration in data engineering workflows, allowing easy programmatic access to services like S3, Glue, Redshift, Lambda, and more. Boto3 uses credentials from AWS CLI profiles or environment variables and supports session management.

SDK for Golangarrow-up-right


AWS Data Wrangler

AWS Data Wranglerarrow-up-right, also known as awswrangler, is an open-source Python library designed to simplify the interaction between the pandas data frame library and various AWS data-related services. It extends the capabilities of pandas to the AWS ecosystem, enabling users to perform common ETL (Extract, Transform, Load) tasks with less code. Key Features and Functionality:

  • Integration with AWS Services: awswrangler provides abstracted functions to connect pandas DataFrames with services like Amazon S3, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Timestream, Amazon EMR, Amazon DynamoDB, Amazon CloudWatch Logs, and Amazon Secrets Manager.

  • Simplified ETL: It streamlines the process of loading and unloading data from data lakes, data warehouses, and databases, allowing users to focus on data transformation logic using familiar pandas commands.

  • Data Formats: The library supports reading and writing various data formats to Amazon S3, including CSV, JSON, Parquet, Excel, and fixed-width files.

  • AWS Glue Catalog Interaction: It offers dedicated functions for interacting with metadata organized within the AWS Glue Catalog.

  • Querying and Data Retrieval: awswrangler can execute SQL queries on Amazon Athena to process large datasets in S3 and return results as pandas DataFrames. It also enables querying data from services like Amazon Redshift and Amazon Timestream.

  • Lightweight Workloads: While not designed for petabyte-scale data processing, it is highly efficient for lightweight to medium-sized data workloads (e.g., up to 5GB of plain text data), offering potential cost and speed advantages over distributed big data tools like Spark for these specific use cases.

  • Deployment Flexibility: It can be installed and utilized in various AWS environments, including AWS Lambda (as a layer), AWS Glue (Python Shell jobs), Amazon SageMaker Notebooks, and Amazon EMR.

Benefits:

  • Reduced Code Complexity: Abstracts away the complexities of interacting with individual AWS service APIs.

  • Increased Productivity: Allows data engineers and data scientists to leverage their existing pandas knowledge for AWS data operations.

  • Faster Development: Simplifies the development of data pipelines and reduces the time spent on data integration.

  • Cost-Effective for Lightweight Workloads: Offers an efficient solution for smaller-scale data processing needs.


GCP vs. AWS Commands

Action

GCP Command (gcloud)

AWS Command (aws)

Auth

gcloud auth login

aws configure

Check Identity

gcloud config list

aws sts get-caller-identity

List Buckets

gcloud storage ls

aws s3 ls

Copy File

gcloud storage cp [src] [dst]

aws s3 cp [src] [dst]

List VM Info

gcloud compute instances list

aws ec2 describe-instances

SSH

gcloud compute ssh [name]

ssh -i key.pem user@ip


Last updated