BigQuery in the centre
In order to analyse the datasets across multiple sources, the data obtained from each source had to be analysed. We decided to use GCP’s enterprise warehouse – BigQuery since it is highly scalable, offers outstanding price/performance and helped gather insights across these datasets quickly.
Ingest, Enrichment, Analysis:
An ingestion pipeline was built using Compute Engine which hosts Open Source Talend along with native python scripts for staging and merging the updated data with the existing data sets. Historical archiving is also provided. After analyzing each of the datasets, they are standardised and made consistent according to the business rules in a BigQuery table. In addition to this, the country, state and city attributes is enriched and standardised making use of Google Geocode API. After standardising the datasets, it is inserted back to Cloud SQL for use by the Datalist front end CRM tool. For delivering faster query results, only the active records are made available on BigQuery table which the customer queries to fetch the latest datasets for running their weekly marketing campaigns.
The platform design:
The Datalist platform was designed using the following GCP components:
- Compute Engine
- Cloud SQL (MySQL)
- Cloud Storage
- Cloud Source Repository
- Cloud IAM
Security and privacy concerns are paramount when dealing with contact details. Access to the GCP resources were restricted through IAM, where there is clear lines of delineation between those who have roles with access to data, and those who have roles who access computing resources.