At Cobry, we love data. We’re always searching for new ways to use our data, and since Google Cloud Platform is the cloud of choice for everything from data storage to machine-learning analyses, we have the finest tools at our disposal to make our data dreams come to life.
So, how do we manage our data? And where does it even come from? You may not think that you have any data to process, but if you use even a single SaaS tool, chances are you have plenty to start analysing. Do you run adverts? That’s data. Are you selling things? Data! What about your support desk? That will be generating data too!
Nowadays, basically everything your enterprise does will generate data in some way. And if you’re not capitalising on that data, you’re leaving yourself in the dark. Modern businesses make data-driven decisions, and that is what enables them to succeed over their competitors. For example, when Tesco introduced the Clubcard scheme, they “learned more about [their] customers in three months than [they] had ever learned in thirty years”, enabling them to become the UK’s dominant supermarket.
With this in mind, I began working on a data pipeline for Cobry. Specifically, we were aiming to analyse our own support desk so as to better understand our clients’ needs and identify where we could go further to help them succeed. Our support desk tool of choice is Freshdesk, and so this is where our data journey began.
The Technical Considerations
Do I need streaming or batch data?
Data comes in many forms, and the type of data impacts how your data architecture will look. The two general forms of data for which you will create pipelines are streaming and batch data. Streaming data is a constant flow of real-time data, such as data generated by the stock market. Batch data, on the other hand, is retrieved periodically in large amounts, such as ‘every purchase made this month’.
Which mode of data you choose will depend on your ultimate goal. If you’re aiming to make weekly or monthly reports, then batch data is a good choice because it updates your data all at once at a specified frequency. However, if you need fresh or live data, a streaming pipeline is a better choice since the figures will update as soon as new data is generated.
How can I design with cloud principles in mind?
Like everything we do in the cloud, the goal is to make this pipeline secure, reliable and scalable. This is made relatively easy by GCP’s managed services, which automatically provision more resources as and when they are required, which keeps costs minimised. This process is known as elastic scaling. These managed services also offer multi-regional redundancy, so the pipeline is sure to be reliable. We can also take advantage of other services on GCP - such as the Secret Management System - to keep usernames and passwords, API keys and other sensitive information encrypted and hidden from code files.
For our Freshdesk pipeline, we decided that monthly reports were the goal, and so with these considerations in mind, I started to architect a batch data pipeline on GCP:
There are four main services used in this pipeline: Cloud Functions, Cloud Dataflow, BigQuery, and Looker. There are a few other services at play as well, such as Cloud Storage, which we use as a data lake, and Cloud Scheduler, which is used to set the frequency of the pipeline’s ETL (extract, transform, load) cycle. In fact, thanks to Google's fantastic code libraries and managed services, and Apache Beam’s high compatibility with Cloud Storage and BigQuery, the whole pipeline required as little as 250 lines of code!
Step 1: Retrieving Data with Cloud Functions
Cloud Functions is a fully managed environment in GCP for running instantaneous workloads without any operational overhead. You can write scripts using some of the most popular programming languages, such as Python and Node.js, and you can trigger them in a number of ways, such as via HTTP(s), cron schedules, message queues, and certain events which take place in GCP. Cloud Functions are ideal for short, event-based executions, so it’s perfect for executing a small script once a day.
To begin the extraction of data from Freshdesk, I wrote a short python script that connects to their API, retrieves all of the latest ticket changes and then stores them in Google Storage. I’m then able to upload this script to Cloud Functions, and then I can trigger the script using a cron job managed by Cloud Scheduler, which runs the code once a day. And just like that, we’ve got our data on the Google Cloud Platform!
Step 2: Transforming the Data with Cloud Dataflow
Now that our data has arrived in GCP, I have to convert it to a more easily analysed form. Currently, the data is unstructured in Cloud Storage. This is essentially acting as a data lake, where we can store all of our data in one place before moving it selectively to be used elsewhere.
Cloud Dataflow is a managed service for running Apache Beam workloads. You can create templates using their SDKs in various languages and then execute them on GCP in either streaming or batch form. GCP also offers many pre-existing templates, which means you often don’t even need to write any code. I wrote my own template in python, which takes the raw, unstructured JSON data we extracted into Cloud Storage and converts the values to BigQuery compatible formats. It then adds the new data to a BigQuery table containing all of the previous ticket updates. With Dataflow, we’ve taken care of transforming and loading our data into our data warehouse (that’s BigQuery).
Step 3: Creating a Data Warehouse with BigQuery
BigQuery is a relational database solution for storing large amounts of data ready to be processed. It’s highly available and can link to many other data analysis tools, from simple (but powerful) tools such as Google sheets to full-scale data analysis platforms like Looker. BigQuery’s role in this pipeline is to act as a data warehouse. Initially, our data was unstructured and stored in Cloud Storage. This is good for raw data as it means we can do absolutely anything with it, but it also means that we can’t easily access or query it. Once we move the data into BigQuery, it becomes structured, and we can query it using SQL - in other words, our data is now ready to be analysed.
Step 4: Visualising our Data with Looker
Now that we have our data in BigQuery, we can easily visualise it by connecting it to Looker. Looker is a powerful data analysis platform typically used for business intelligence reporting. Its goal is to create a unified data culture for businesses where data is clear and readily available to be analysed in new ways. Once I connected BigQuery to Looker, I was able to define data models using Looker’s intuitive LookML modelling language to refine the data in a manageable way.
The Results and Business Outcomes
Our managers and directors have a direct picture of our helpdesk performance. We can pull this up in performance reviews, QBRs and forecasts.
Our helpdesk staff can now generate reports on statistics through Looker. They can go and create reports themselves, using the same data that the rest of the company is using. Our technical team also has the ability to change the definitions of these metrics as business needs change.
Potential for additional services
Our customers come first, and we aim to be as transparent as possible. Having this helpdesk information in a system that allows for better reporting means that we can potentially include ticketing dashboards externally for our customers. This way, they can keep tabs on how much help their employees and colleagues get from us.
With the pipeline complete, I can create brilliant Looker dashboards and reports. They can be automatically emailed to stakeholders periodically, so we’ve been able to realise our data dreams using GCP and Looker!
Make your data dreams come true
To find out how you could make your data go further, get in touch with Cobry. Our team of experts will blend their deep knowledge of GCP’s data tools with your organisation’s needs. We’ll get your data infrastructure up and running, and you’ll be able to start making data-driven decisions in no time.
Drop your email and we’ll get back to you ASAP!