CLI, SDK, Boto3, CloudShell
Ways to manage AWS cloud resources without using the GUI
AWS CloudShell
Browser-based command-line environment pre-authenticated and integrated with its cloud platform, enabling users to manage cloud resources without installing local tools.
SDKs
Boto3: The AWS SDK specifically for Python. It's widely used for automation and integration in data engineering workflows, allowing easy programmatic access to services like S3, Glue, Redshift, Lambda, and more. Boto3 uses credentials from AWS CLI profiles or environment variables and supports session management.
AWS Data Wrangler
AWS Data Wrangler, also known as awswrangler, is an open-source Python library designed to simplify the interaction between the pandas data frame library and various AWS data-related services. It extends the capabilities of pandas to the AWS ecosystem, enabling users to perform common ETL (Extract, Transform, Load) tasks with less code. Key Features and Functionality:
Integration with AWS Services:
awswranglerprovides abstracted functions to connect pandas DataFrames with services like Amazon S3, AWS Glue, Amazon Athena, Amazon Redshift, Amazon Timestream, Amazon EMR, Amazon DynamoDB, Amazon CloudWatch Logs, and Amazon Secrets Manager.Simplified ETL: It streamlines the process of loading and unloading data from data lakes, data warehouses, and databases, allowing users to focus on data transformation logic using familiar pandas commands.
Data Formats: The library supports reading and writing various data formats to Amazon S3, including CSV, JSON, Parquet, Excel, and fixed-width files.
AWS Glue Catalog Interaction: It offers dedicated functions for interacting with metadata organized within the AWS Glue Catalog.
Querying and Data Retrieval:
awswranglercan execute SQL queries on Amazon Athena to process large datasets in S3 and return results as pandas DataFrames. It also enables querying data from services like Amazon Redshift and Amazon Timestream.Lightweight Workloads: While not designed for petabyte-scale data processing, it is highly efficient for lightweight to medium-sized data workloads (e.g., up to 5GB of plain text data), offering potential cost and speed advantages over distributed big data tools like Spark for these specific use cases.
Deployment Flexibility: It can be installed and utilized in various AWS environments, including AWS Lambda (as a layer), AWS Glue (Python Shell jobs), Amazon SageMaker Notebooks, and Amazon EMR.
Benefits:
Reduced Code Complexity: Abstracts away the complexities of interacting with individual AWS service APIs.
Increased Productivity: Allows data engineers and data scientists to leverage their existing pandas knowledge for AWS data operations.
Faster Development: Simplifies the development of data pipelines and reduces the time spent on data integration.
Cost-Effective for Lightweight Workloads: Offers an efficient solution for smaller-scale data processing needs.
GCP vs. AWS Commands
Action
GCP Command (gcloud)
AWS Command (aws)
Auth
gcloud auth login
aws configure
Check Identity
gcloud config list
aws sts get-caller-identity
List Buckets
gcloud storage ls
aws s3 ls
Copy File
gcloud storage cp [src] [dst]
aws s3 cp [src] [dst]
List VM Info
gcloud compute instances list
aws ec2 describe-instances
SSH
gcloud compute ssh [name]
ssh -i key.pem user@ip
Last updated