Set up an ETL Data Pipeline and Workflow Using Python & Google Cloud Platform (COVID-19 Dashboard)

8 September, 2020

Contributors

Ryder Nguyen

@rydernguyen

Landing page of the COVID-19 Dashboard

Data Pipeline from Extract to Transform to Load

Having learned & used Python for about a year, I am no expert when it comes to data pipeline and cloud platform in general. This guide is my personal journey on learning new techniques and some things to keep in mind when developing a data solution.

Big Data

Google BigQuery

Cloud Storage

Google Cloud Storage

Languages

Python

Cloud Hosting

Google Cloud Platform

Query Languages

JSON API

Lessons Learned:

Try your best to see what kind of data are out there but don’t get hung up on trying to incorporate all of them
Process optimization comes with experience so don’t sweat it if later you find out what used to take half an hour can now take 5 minutes
Data visualization should be user-friendly and so your back-end data and tables should be revised based on user’s feedback and the interface should be self-explainable
Large amount of data can increase loading time (page 2 of the report) so optimization needs to be done
Table structures and schema are important for blending data and need to be designed before incorporating into the workflow (with a lot of deleting and recreate tables in the process)

Next Steps:

With the understanding of the ETL pipeline, optimize and continue to optimize
Incorporate data specific for states such as government measures and business reopening
Look into other models other than ARIMA while evaluating strengths and weaknesses
Build an ML model that incorporates all data relevant to research

machine learning

etl

cloud computing

google cloud platform

bigquery

Set up an ETL Data Pipeline and Workflow Using Python & Google Cloud Platform (COVID-19 Dashboard)

Lessons Learned:

Next Steps:

More Articles