GCP Data engineering fundamentals Quick reference notes - Part 1
1. What are the challenges that Data engineers today face?
- Migrating existing workloads
- Analyzing large datasets AT SCALE!
- Building scalable data processing pipelines to enable faster insight derivation
- Applying ML to the data to automate insight derivation
2. What are the four fundamental aspects of GCP's core infrastructure?
- Security, the base layer
- Compute, Storage, Networking (the other three layers that process, store, deliver business insights, data pipelines, and ML models)
3. What does the GCP's top layer consist of?
- In the context of data engineering, The No Ops Service layer of Big data and ML products
4. What is GCP project?
- A Project is a base level logical organization of GCP resources and services for managing billing APIs and permissions.
5. What are Zones and Regions used for?
- Zones and Regions physically contain the resources and services that a project uses.
6. What are folders?
- Folders are used to logically group the collections of projects. Folders can be used only within the organization.
7. What is Organization? Why is it required?
- Organization is root node of the GCP hierarchy. It can contain folders for logically grouping the collections of the projects. Organization can also have projects directly under it.
- Organization allows to create policies at the enterprise level and those policies apply automatically throughout the projects under organization and all the folders under it.
8. What is IAM?
- IAM is a short for Identity and Access Management, where one can fine tune the access control to the GCP resources that is in use by the project.
9. What is the name of GCP network?
- Jupiter network
10. Why there is no need to do processing on single machine or cluster of machines with dedicated storage on GCP?
- GCP Network can deliver enough bandwidth to allow 100000 machines to communicate with each other in the data center at 10 Gbps full-duplex speed. This implies that data locality within the cluster is not important.
11. What is Edge points of presence?
- GCP interconnects public internet at more than 90 internet exchanges and more than 100 points of presence worldwide.
- When an internet user sends traffic to GCP resource, GCP responds to the request from Edge location to achieve lowest latency.
12. What are the different dimensions of security that one needs to consider?
- Hardware level security
- Storage level security
- Physical Network security
- Audit logging
- OS level security
- Network level security
- Access and Authentication
- Operations
- Identity
- Web application security
- Deployment
- Usage
- Access policies
- Content level
13. What are the typical security dimensions that customer should take care?
- Securing the data (content)
- Creating the access policies using IAM
- Choosing type and level of data encryption
14. Is it possible to limit access to data at a row and a column level in BigQuery?
- Yes
15. What are the issues with big data?
- Large data sets
- Fast changing data
- Varied data
16. What is GFS?
- Google created GFS, google file system to handle sharding and storing petabytes of data at scale, which is foundation for Google cloud storage and for BigQuery managed storage.
17. What is Map-Reduce?
- Map-Reduce is a data processing technique to manage large scale data processing across large clusters of commodity servers.
18. What were the issues with Hadoop?
- Developers had to write code to manage all of the infrastructure of commodity servers and could not focus on application logic.
19. What was the inspiration for HBase or MongoDB?
- The need for recording and retrieving millions of streaming user actions with high throughput.
20. What is Dremel's approach to deal with data processing?
- Dremel breaks data into small chunks called shards and compresses them into a columnar format across distributed storage.
Then, uses query optimizer to farm out the tasks between many shards of data and data centers full of commodity hardware to process query in parallel and deliver results.
21. What was main differentiator with Dremel?
- Service auto manages data imbalances and communicates between workers, auto scales to meet query demands.
22. What is BigQuery query engine?
- Dremel
23. What is planet scale relation database in GCP?
- Spanner
24. What is service for messaging in GCP?
- PubSub
Source: GCP Documentation, Coursera notes
Wow, this is a great 101 for Data Engineers willing to learn GCP