Set up an ETL Data Pipeline and Workflow Using Python & Google Cloud Platform (COVID-19 Dashboard)
8 September, 2020
0
0
0
Contributors
Landing page of the COVID-19 Dashboard
Google BigQuery
Google Cloud Storage
Python
Google Cloud Platform
JSON API
Lessons Learned:
- Try your best to see what kind of data are out there but don’t get hung up on trying to incorporate all of them
- Process optimization comes with experience so don’t sweat it if later you find out what used to take half an hour can now take 5 minutes
- Data visualization should be user-friendly and so your back-end data and tables should be revised based on user’s feedback and the interface should be self-explainable
- Large amount of data can increase loading time (page 2 of the report) so optimization needs to be done
- Table structures and schema are important for blending data and need to be designed before incorporating into the workflow (with a lot of deleting and recreate tables in the process)
Next Steps:
- With the understanding of the ETL pipeline, optimize and continue to optimize
- Incorporate data specific for states such as government measures and business reopening
- Look into other models other than ARIMA while evaluating strengths and weaknesses
- Build an ML model that incorporates all data relevant to research
machine learning
etl
cloud computing
google cloud platform
bigquery