Pulling Data Via An API In Python

What Is an API?

An API, or Application Programming Interface, is a powerful tool that enables organizations to access and interact with data stored in various locations in real time. In other words, APIs allow different computer programs to communicate with each other seamlessly. For example, your business may have accounting software, a CRM, a time tracking application, a payroll system, and an inventory tracking app. Likely, you’ll want to pull your organization’s data from all of these sources and bring it together into one place so that you can do more advanced analytics. You may want to answer questions like “What is the revenue per employee per project?” (accounting + time tracking). Or what is my average sale amount by industry? (accounting + CRM). APIs allow quick and accurate access to your data that’s stored in these various places.

Image by Freepik

Why APIs Matter

Two key features that distinguish exceptional data strategies from average ones are automation and real-time analytics. APIs deliver both of these crucial features.

Automation: APIs allow programmers to write scripts that automatically retrieve data and integrate it into your organization’s reports, dashboards, applications, and algorithms. This level of automation streamlines processes, eliminating the need for manual data entry, saving time, and ensuring accuracy.

Real-Time Analytics: APIs offer real-time data pulling, ensuring that reports, dashboards, and algorithms always update as new data flows into your data source. This instantaneous access to data empowers organizations to make decisions based on the most current information.

Taking Advantage of APIs for your Organization

Boxplot can pull your data from the various apps that your business uses. From there, we can visualize data, write data science algorithms, or automate business processes. Set up a call with us to chat about your project. 

If you want to follow an example of implementing an API in Python, keep reading below.

Pulling Data from an API in Python

In this blog post, we’ll focus on pulling data from an API using Python, but it’s important to note that various methods are available for data retrieval, including R, VBA (Excel), Google Sheets, Tableau, PowerBI, JavaScript, and more. Additionally, while we’re discussing data pulling in this post, APIs can also be used to push (insert) data into a database, although we won’t cover that here.

Every API is unique, and mastering any specific API takes time. The example in this post focuses on a basic API to illustrate the main concepts.

Getting Started: Pulling Data in Python

Let’s do a simple example of pulling data in Python using an API. In this example, we’ll be pulling unemployment data from the Federal Reserve of St. Louis (FRED) website using their API; if you want to follow along with me the URL of the dataset is accessible here, and general information on the API is here.

In order to pull data from an API in Python, you first must obtain credentials. If you’re using a payware product’s API it’s likely that you’ll be provided with credentials to access the API upon purchasing access (credentials for a payware API are usually a username/password combination). But for an open-source API ―such as the one in this tutorial― you’ll usually get an API key, which in this case is a 32-character alphanumeric string. To get an API key from FRED, first create a FRED account for yourself (it’s free) and then generate a key by visiting this page. Then, fire up your preferred Python IDE and copy and paste the API key as a string variable into a new Python program; you will need that key in a second.

Now let’s code. The first thing you’ll need is to import ‘json’:

As a brief aside, json (JavaScript Object Notation) is a data formatting style that creates key-value pairs in a way that is easy for humans to read, similar to a Python dictionary (in fact, one step of using the API is converting json data to a Python dictionary, as we’ll cover later on). When you read the data in, it comes to us in json format, which is why we’re using the json format here.

We also need to import a package called ‘requests’. When you pull data from an API, what’s going on under the hood is that your program is making a request to the API’s server for the specified data set, and this ‘requests’ package allows us to do just that.

Next, you must build the URL that you’ll use to access data series. In the FRED API, the general form of this URL is:
https://api.stlouisfed.org/fred/series?series_id=GNPCA&api_key=abcdefghijklmnopqrstuvwxyz123456&file_type=json
but it will be different for every API. To customize this URL for your usage, you must first replace ‘GNPCA’ with the series ID of the series you’re interested in pulling. In our example, the series ID is ‘UNRATE’, but you can find the series ID of any FRED data series by going to its page and looking at the code in parentheses next to the series name:

Then, replace the API key in the URL (abcdefghijklmnopqrstuvwxyz123456) with your own API key. Note that the final parameter in this URL is file_type=json, which allows us to pull the data series into json formatting. And that’s it; overall, if my API key were ‘abcdefghijklmnopqrstuvwxyz123456’, my completed URL would be:

https://api.stlouisfed.org/fred/series?series_id=UNRATE&api_key=abcdefghijklmnopqrstuvwxyz123456&file_type=json

What do you do with this URL now that it’s completed? This is where the requests package comes in. The requests package has a method called get() that makes the request to the API’s server and pulls data into json format. Saving the output of this get() method to a variable allows us to customize the data that is pulled, which is exactly what you want to do. Doing so will return a variable in json format, and so if you call the json() method on this variable, the data will be shown. So, overall, our code to pull data from the API into Python and show the output looks like this:

(of course, remember to swap in your own API key for the ‘key’ variable).

From there, this json object is queryable and iterable just like a regular Python dictionary. For example:

And you can see from the output of resp.json() that

contains the actual data series.

By creating a report, dashboard, algorithm, etc. that runs based off of the data extraction methods covered in this blog post, when new unemployment data is created for this data series, it’ll automatically flow through the next time the API query you wrote is called.

How B2B organizations can use their data

B2B organizations possess a unique advantage when it comes to leveraging data. Data on their institutional clients is often readily available. Some of the most valuable data-driven outcomes for B2B organizations include sector analytics, both retrospective and predictive, prospect analytics, and automated data cleaning. In this blog post, we’ll explore how B2B organizations can gain a competitive edge through superior data analytics.

Customer Stratification

Customer stratification is a pivotal strategy in business that involves segmenting a customer base into distinct categories based on various criteria, such as purchasing behavior, demographics, or customer value. This approach allows organizations to tailor their marketing and service efforts to different customer groups, ensuring more personalized and targeted interactions. By stratifying customers, businesses can identify high-value clients deserving of special attention and low-value customers who may benefit from re-engagement efforts. This segmentation not only enhances customer satisfaction but also maximizes the effectiveness of marketing and sales initiatives, ultimately driving business growth. Customer stratification serves as a valuable tool for optimizing resource allocation and fostering long-lasting customer relationships. Boxplot and Empirical Consulting Solutions have partnered to offer ar robust and proven customer stratification service. You can find more details here.

Product Data

B2B organizations, regardless of their industry, can harness the data they collect about their products or services to make more informed decisions and drive continuous improvement. Take, for example, a business offering phone services to other businesses; by analyzing usage patterns and call data, they can identify peak call hours, call drop rates, and customer preferences, allowing them to optimize service quality and pricing structures. Software companies can leverage customer feedback and usage metrics to refine their software’s features, ensuring it aligns with market demands. Similarly, firms supplying machines to manufacturers can monitor machine performance data to predict maintenance needs, reduce downtime, and enhance product reliability. In each case, data-driven insights empower B2B organizations to adapt and innovate, resulting in better service delivery, enhanced products, and ultimately, more satisfied clients and increased competitiveness in the market.

Sector Analytics

Sector analytics involves analyzing the composition of your client base using known data on these businesses. You’ll address questions like: How much of your revenue came from foreign clients last year? What was the revenue breakdown by industry? To what extent did you engage with large corporate clients versus smaller ones last quarter? These insights are invaluable and often lead to the creation of reports and dashboards that aggregate your organization’s data into concise, visual displays. Properly constructed, these reports can be configured for automated updating, creating a hands-free data ecosystem. Given the unique insights generated by sector analytics, it has become a standard practice among data-savvy B2B organizations.

Sector Predictive Analytics

Sector predictive analytics is an extension of sector analytics, where new or estimated data is generated. For instance, you can use known data on past sales to predict future sales to corporate clients. Data scientists often employ machine learning to build predictive analytics solutions, which optimize the process of extrapolation for accuracy and automation. Predictive analytics isn’t limited to forecasting; it can be used to estimate hypothetical scenarios, retrospectively evaluate decisions, and plan for the future. What sets predictive analytics apart is its ability to generate new, previously unknown information.

Prospect Analytics & Data-Oriented Lead Conversion

Data analytics is instrumental in identifying which of your sales leads are more likely to become clients. It also helps improve lead conversion rates, particularly for prospects who initially seem less likely to convert. B2B organizations often deal with a few large clients at a time, making the conversion of a higher number of clients crucial. Robust, data-driven lead conversion schemes are invaluable. While gathering data from various sources can be challenging, a skilled data analyst can help construct an industry-leading lead conversion strategy.

Automated Data Cleaning

Data quality significantly impacts the effectiveness of your data analytics. B2B organizations serving large corporate clients, with vast databases and numerous users, often struggle to maintain data quality. Modern data analytics tools, such as Excel and Python, allow for partial or full automation of data cleaning. Automated data cleaning not only saves time but also ensures the high quality of the resulting data, reinforcing the quality of the analytics they drive. Small to mid-sized B2B organizations often engage external data analytics experts to set up automated data cleaning processes due to their critical nature.

Superior Data Analytics for Your B2B Organization

Whether you seek to analyze your clients, enhance your lead conversion pipeline, or improve data cleanliness, superior data analytics is your path to a more advantageous future. If you’re unsure where to start, contact Boxplot. We’ve assisted numerous B2B organizations in realizing their data-oriented goals, regardless of their prior experience with data. Discover why leading B2B organizations have adopted a data-oriented strategy. The future is data-driven, and it’s time you experience its benefits. Contact Boxplot to embark on this transformative journey.

A/B Testing Example (Two Mean Hypothesis Test)

A/B testing (sometimes called split testing) is comparing two versions of a web page, email newsletter, or some other digital content to see which one performs better. A company will compare two web pages by showing the two variants (let’s call them A and B) to similar visitors at the same time. Sometimes, the company is trying to see which page leads to a higher average purchase size (the amount that the average user spends on your products per site visit) so the site that has the higher average purchase size wins.

In many cases, A/B testing is included in the software you are using. But in case it’s not, or in case you want to understand the math behind the scenes, this article goes through how A/B testing works. 

Let’s say you work for an ecommerce company that is trying to improve its average purchase size for online sales. To accomplish this task, your company has built two different improved websites; you’ve now been tasked with determining a data-driven answer in terms of which of the two websites is superior from the standpoint of average purchase size.

Step 1: Collect Data

You monitor each of the two candidate sites for one month and collect data on the purchase amount of 100 randomly-selected purchases each day for each site. You end up with 3,100 samples from each site; the first site sees $128,000 of total purchases, while the second sees $117,000 of total purchases. This translates to an average purchase of $41.29 for the first site and $37.74 for the second. Furthermore, let’s suppose that the first site sees a standard deviation of $22 within the sample of data you collected for it while the second site sees a standard deviation of $21 within its sample.

 Step 2: Choosing A Test

From the measured average purchase of the two sites, you cannot necessarily conclude that the first site is the better option, even though its average purchase price is almost $4 higher than the second site. Instead, you need to use a hypothesis test to determine the level of confidence with which you can conclude that. Choosing a statistical test to determine this level of confidence can sometimes be the most difficult part of a statistical analysis! Different test statistics (T, Z, F, etc.) are used for different types of data. Use the Statistics Cheat Sheet for Dummies chart or other related sites like StatTrek to help you choose the right test based on your sample. In this case, since you are trying to test whether one sample mean is higher than a different sample mean (specifically, whether the mean purchase size of the first site is higher than that of the second site) and you don’t know the population standard deviations (these would only be accessible to you if you had measured the size of every single purchase between the two sites, not just a sample of them), the correct test is the Difference of Two Means test with a T-test statistic.

Step 3: Pick A Confidence Level

Almost everyone chooses 95%. If you choose less than that, people may look at you funny or like you have something to hide! Of course there may be appropriate uses for confidence levels less than 95% but it’s not common. If you’re testing something super important, like the safety of airplane parts, you want a confidence level much higher than 95%! Probably like 99.99999% or more! In this case, we’ll stick with 95%.

 Step 4: Null And Alternative Hypotheses

In this case you should use a single-tailed T-test because you believe at this point that specifically the first site is outperforming specifically the second, as opposed to simply believing that one or the other of the sites is outperforming the other; a two-tailed T-test would be more appropriate if you are trying to determine whether the average purchase size of the two sites is merely different, not that one is strictly greater than the other. Thus, you can define the null hypothesis and alternate hypothesis like this:

Step 5: Calculating The T-Score

You now have the following figures:

which means:

Step 6: Calculating The P-Value

The T-score is very high, which means it’s highly likely that you have enough evidence to reject the null hypothesis and conclude with an extremely high level of confidence (well above 95%) that the first site is outperforming the second. In fact, you have so many degrees of freedom in this test, most T tables won’t even show the exact confidence level that you can have in your conclusion. However, a p-value calculator shows that with as many degrees of freedom as there are in a single-tailed T-test like this one, a T-score of about 1.65 or higher would be sufficient to reject the null hypothesis at the 95% confidence level. In other words, any T-score of 1.65 or higher shows that the first sample mean is far enough above the second sample mean, given how volatile the individual measurements are in relation to those means (indicated by the sample standard deviations) as well as the size of each sample, to conclude with at least 95% certainty that the first site is in fact superior.

If your organization is struggling to implement or interpret tests like these, contact us.

What Is Python, And Why Is It Awesome?

What is Python?

Python is a relatively intuitive, general-purpose programming language that allows users to do anything from analyzing data to building websites. In the context of data, it allows for complete flexibility and customizability when performing analyses. Here are some good resources that dive deeper into what Python is if you are interested:

https://www.coursera.org/articles/what-is-python-used-for-a-beginners-guide-to-using-python

https://www.geeksforgeeks.org/history-of-python/

Check out this nifty guide as well:

https://media-exp1.licdn.com/dms/document/C4E1FAQGiHsf7Wgn1rA/feedshare-document-pdf-analyzed/0/1652795096567?e=1654128000&v=beta&t=h8fHewNT4r0ESBAk3k6l9SRQEDntIan6ITVokm3yj_Y

Why is it great?

When a client lets us choose our tools, we always choose Python. It’s the best in three main areas:

  1. Flexibility/Customizability. If an analysis is possible in a general sense, it’s possible to conduct that analysis in Python. In other words, in the data analytics world, if you can’t do it in Python, you can’t do it. Python offers countless libraries in data visualization, data processing, statistics, and more, allowing users to optimize any analysis for speed, storage, and accuracy.
  2. Scalability. Much of the world today remains stuck on Excel. That’s not completely a bad thing; Excel is a highly intuitive product that makes otherwise complex analyses quite straightforward for non-technical users. But where Excel falls well short of Python is in scaling. Excel files face a limit of just over one million rows per tab, which is not a limitation for Python; in today’s big data-backed data science world, one million rows of data is not that huge of a dataset, so Python therefore enables data scientists to complete more thorough analyses than Excel. And even if you can fit all of your data onto an Excel sheet, Excel’s performance starts to deteriorate much faster than Python when dealing with larger and larger datasets.
  3. Automation. If you want your dashboards, reports, and other analyses to update automatically, Python is the way to do that. Python specializes in compiling data from disparate sources (e.g., SQL databases, APIs, CSV files, and many more) into one place and inserting it into PowerBI, Tableau, or whatever other analytics tool you choose on a completely hands-free basis.

While we are familiar with R and other languages, and are happy to work in those languages as well, we choose Python for the reasons mentioned above. We think it’s the best of the best for data analytics and data science.

Applications of Python: API Integration

APIs (Application Programming Interfaces) are a tool that a developer can use to programmatically extract data from some other location in real time. For example, if you have data stored in Salesforce, Zoho, Monday, or other CRM, you can pull that data into a customized analysis in real time via an API using Python. Because of the increased level of automation that this method provides, it is seen as preferable to downloading data from the program into a CSV or Excel file and then repeating the analysis manually. While Python is not the only programming language that can be used to pull data from an API, it is advantageous to keep your end-to-end data process in one language, especially if the analysis of the data you’re pulling is to be completed in Python. Boxplot specializes in building fully-automated customized dashboards and reports for our clients using data stored in some third-party system; Python is always our preferred language for client projects along those lines.

It is also possible to insert data into a third-party storage system in real time using an API in addition to pulling data. This technique comes in handy when trying to ensure that multiple data storage systems don’t get out of sync; a Python-based program can automatically update data in some storage system A based on updates made in some other system B. Boxplot has also worked on many client projects involving this sort of work.

Applications of Python: Big Data

As aforementioned, Python does not face the same data volume limits that users run into with Excel. Python enables users to process even one million rows of data in a near-instantaneous manner on an average-quality computer. And if you anticipate utilizing a massive billion-row dataset, Python easily integrates with Spark, Hadoop, and other parallel computing frameworks. In short, if you’re going for anything big-data-related, Python is the way to do it.

Applications of Python: Machine Learning

Machine Learning involves teaching a computer how to detect and extrapolate upon data without having to give it specific instructions on how to do so. Python is far and away the strongest programming language for machine learning, as numerous free-to-use machine learning libraries are available. These libraries include packages used in both supervised learning (i.e., when a computer tries to predict a set output) and unsupervised learning (i.e., when a computer creates abstract labels for categories of data observed in the dataset). Boxplot has completed numerous client projects in both supervised and unsupervised learning, and we always turn to Python as our go-to solution.

Github Tutorial

What Is GitHub And Why Is It Useful?

GitHub is the world’s most commonly-used code hosting/storage platform. It is used by coders of all sorts to view, store, write, and collaborate on code, all for free. GitHub uses a tool called Git to enable these abilities (i.e., “GitHub” refers to the website where code gets stored; “Git” is the tool that the website uses to do that). Because GitHub stores code in a centralized location accessible to anyone with the proper credentials, it is an extremely useful resource when it comes to collaboration, update syncing, version history, and so on; these aspects of software development are much more cumbersome if coders are each storing their own copy of your codebases locally. As with any technology, there is a slight learning curve if you are new to GitHub. However, the vast majority of skills you will need to get started using GitHub effectively are covered in this brief tutorial.

GitHub Basics And Initial Setup

The GitHub workflow consists of (1) the directory containing the files that you are editing on your computer, (2) a “staging area” or “index” where files are temporarily stored and verified before being permanently recorded in GitHub, (3) a “local repository” which records your changes but is only visible to you, and (4) a “remote repository” (i.e., GitHub itself) which records your changes so that anyone with access can see them. An important note is that the staging area is temporary storage; you do not create a traceable version history by adding files as you do with the GitHub repository. If you want to have a version history ―being one of the main assets of using GitHub to begin with― you will need to commit your changes permanently to (3) or (4).

You can interact with GitHub repositories and staging areas using either the command line or by using GitHub’s desktop app. To use the command line, you need to have Git installed. It can be downloaded here. The desktop app can be downloaded here.

You need to either start a new repository or clone an existing one to begin inserting files into it. We will have a future blog post that covers these options.

Git Pull

“Pulling changes” refers to the process of updating your own copy of each file that you want to view and/or edit to be in sync with the up-to-date version stored in GitHub before making any of your own changes to those files. It is important ―at the very minimum― to do so each time you edit anything stored in GitHub to make sure the changes you make do not (a) get applied to an out-of-date version, and as a consequence, do not (b) fail to incorporate someone else’s progress.

It’s very simple to perform a pull operation. In the command line:

and if you are using the desktop app, use the “Pull origin” button at the top to perform this same task:

Git Add

This command takes your locally-saved files and puts them in the staging area. Note that this step does not make any changes to the GitHub repository. It only moves a “rough draft” of the files to the staging area; if you check over each rough draft file and determine they are all set to overwrite the current versions in GitHub, you then use the git push command to do so (we will cover this command later in this tutorial).

Let’s see how it works in the command line. Navigate to the directory with the files you want to add by dragging the folder onto the terminal. Then, you can check which files are eligible to be uploaded to staging by using the git status command. The eligible file names are shown in red:

Add the files to staging by using the git add command followed by the names of the files you want to add to staging separated by spaces:

Alternatively, put a star rather than the file names to indicate you want to add all files of a certain file type:

Finally, you can add all files in the directory using this command:

Confirm that the correct files have been added to staging by using git status again. The file names in green confirm they have been added to staging.

Note that it will tell us if any commits have occurred or not.

In the desktop app, checking and unchecking files in the toolbar on the left hand side is the equivalent to adding and removing those files from staging:

git commit -m “message”

Uploading your changes to GitHub from the staging area is referred to as “committing” your changes. When you commit changes to GitHub, you pass in a message along with your changes that describes the changes that you are making. This feature is helpful if you ever have to look through the version history of a particular file in GitHub because these messages remind you of what changes occurred with each version; if a certain change ever has to be reverted, you know right where that change occurred. This message gets passed inside quotation marks at the end of the git commit command. For example, if you wanted to copy the two files from the previous section that are currently sitting in staging, you would do this:

If you prefer to use the GitHub desktop app, put your commit message in the box just above where it says “Description” and then click the Commit button in the bottom left:

git push

This command moves what is in staging to GitHub (the remote repository). You can perform this action using this command:

You will be prompted to enter your GitHub username and password. After doing so, the files will be pushed to GitHub.

If you are using the desktop app, go to Repository -> Push to push the files.

Small Business Social Media Metrics

A superior social media analytics practice is one of the best ways for small businesses to gain an edge over larger competitors. And the growing accessibility of advanced analytical tools makes social media analytics a worthwhile pursuit even if on a tight budget. This blog post covers some of the most useful social media-specific metrics ―which cover topics including audience/demographics, brand ecosystem, and engagement/conversion― to help organizations like yours get started with social media analytics.


Audience & Demographics

These metrics involve analyzing those who interact with your social media content rather than analyzing the social media content itself.

  1. Age Distribution. Do your social media pages attract older or younger audiences? Knowing the ages of who is interacting with your social media pages, such as the average and median age of those who “like” your posts, can help with everything from content selection to product pricing strategies.
  2. Geographical Distribution. How global is your organization’s reach on social media? Analyzing social media data by geography, such as viewing the number of followers on your Instagram page broken down by country, is one of the best ways to understand your existing and prospective customers. Knowing the geographical distribution of your audience may also assist with knowing when to post; the goal there being to post content to align with the time of day that people in each time zone are likely to be on social media.
  3. Gender. What is the distribution of your audience among different genders? How does that differ in terms of who views your posts versus who “likes” your posts?
  4. Ethnic Background. Similar to other demographic information, understanding how audiences of different ethnic backgrounds interact with your social media content allows you to infer a lot of insights that can help optimize your organization’s strategy.

Engagement & Conversion

These metrics involve analyzing data about the pages and posts themselves that your organization is creating via social media.

  1. Click-Through Rate. Out of all visitors to your social media pages, what proportion click on links that appear there? This is what click-through rate (CTR) measures. A higher CTR means that your audience is well-engaged with the content that you post.
  2. Clicks. Exactly what it sounds like: how many times has your audience clicked on content that you post on your social media pages? This metric can be measured on an overall basis or for individual posts.
  3. Impressions. Impressions count the number of times one of your pages is shown to someone.
  4. Pageviews. Pageviews count the number of times an audience member views one of your pages. This metric differs from impressions in that it counts views rather than actions. In other words, if someone views your page three times, that would count as three impressions but only one pageview.
  5. Shares/Retweets. Again, this metric can be calculated on an overall or post-by-post basis.
  6. Subscriptions. How many audience members have subscribed to your communications via social media, such as subscribing to your YouTube channel or to a monthly mailing list?
  7. Comments

Brand Ecosystem

These metrics concern trends occurring in the industry in which your organization operates. One key difference between this group of metrics and the others covered in this article is that these metrics analyze data originating from external social media pages rather than your organization’s.

  1. Industry Trends. There are many separate metrics within this category. For example, what has been the most talked about new product on social media over the past six months in the industry in which your organization operates? 
  2. Comparative Pricing. How are your competitors’ audiences reacting to their prices on social media? What percent of reactions in the comments are positive versus negative? Knowing this information can help your organization optimize pricing strategies to win over potential buyers from competitors, and it is all accessible by analyzing social media data.

None of these lists are exhaustive either! If you’re a very small business that doesn’t have a lot of time to dedicate to this, but wants to get started understanding your social media performance, we recommend our sister company Pulse Metrics. Their app will deliver insights like “Your average number of likes is down by 20% this month” or “Did you know that most of your audience is female between the ages of 18 and 24?” right to your email inbox. No need to learn how to make dashboards, or spend any time configuring settings. Connect your social media platforms in a few clicks and the app automatically starts generating insights. 

If you are a larger business and need more custom analyses to reach your analytical goals through social media data, a bespoke selection of metrics to focus on is almost always the most successful. As always, Boxplot is happy to discuss your strategy when it comes to social media analytics, regardless of your prior level of experience with it.

Line Graphs

What Is A Line Graph?

A line graph is a series of data points on an xaxis, connected by a line. There are two kinds of line graphs: colored-in (also known as area charts) and standard. While the area under the line of a colored-in line graph is shaded in, the area under a standard line chart is not.

Standard chart

Real-world example of a standard line graph

Area chart (colored-in)

Example of the same data as an area chart

Within area charts, there are two types: stacked and proportion of whole. Stacked area charts simply show the series as-is, such as the area chart immediately above; proportion of whole area charts show an entire population ―represented by 100%― split up into the proportion of that population that each group represents. It’s easy to differentiate between the two types of area chart by the fact that the uppermost line on a proportion of whole area chart is always flat at 100%, while that of stacked area charts varies over time. Area charts are not recommended by some data visualization experts because their data points can be hard to compare and misleading.

Real-world example of a proportion of whole area chart.

When To Use Line Graphs

Because line graphs convey the progression of a series of data very well, it is most common to see line graphs being used to display information over time. You’ll occasionally see line graphs represent distributions (what is the frequency of each possible value for a data set) but time series are definitely the most common. In terms of choosing between standard line graphs and area charts, it is more common to use an area chart if you want to compare different groupings of the same series/population ―e.g., if you want to compare the number of people who have gotten at least one dose of the coronavirus vaccine with the number of people who are fully vaccinated, as seen above― but much more common to use a standard line graph if you want to see a single metric, such as the first example graph shown. A standard line graph is also common if you’re looking at multiple different series (not to be confused with different groupings of the same series), e.g., if you want to compare how many people have been vaccinated in the United States vs. the number of people vaccinated in Canada.

Common Programs For Making Line Graphs

Likely the most commonly-used program to make either type of line graphs is Excel; beginner and expert data analysts alike love Excel’s ease of use, flexibility, and universality. However, other business intelligence software which can be used to make any type of line graphs include Tableau, Qlik, and Microsoft’s Power BI (which are each superior to Excel in terms of automation and customizability) as well as Google Sheets. Additionally, programmatic tools such as R and Python each include packages for creating line graphs.

Tutorial: Making Line Graphs In Excel

Although there are a number of different programs in which you can make line graphs, I’ll do a quick tutorial here in Excel since that’s the most popular. To start, organize your data into adjacent rows or columns, where one of those rows/columns contains the data to appear on the x-axis (this is usually a row/column of dates), and the other row(s)/column(s) correspond to the values that you want to appear in your chart. For example:

It would also work just as well to have the three categories of data as rows instead of columns:

While it isn’t absolutely necessary for the rows/columns of data to be adjacent, it will make the process of creating the line chart much easier; you’ll see why in a second. Note that any of the values in these rows/columns of data can be calculated by Excel formula(s) or hardcoded (raw) values. However, formulas are preferable if possible because when a calculated value is updated, it’ll also update the graph that shows that value automatically; this is not the case with hardcoded values, on the other hand.

To create your line graph, highlight the data, and then go to Insert and click on the line graph icon:

You’ll also see the stacked line and 100% stacked line options alongside the standard line chart option, and creating those two types of chart works the same way. However, in this context, the standard line chart makes the most sense. Using Excel’s Recommended Charts option will also give you a line chart by default as it assumes a line chart is desired for time-series data.

To customize the title of the line chart, simply click into the title box at the top of the chart and type in the desired title:

I can also change the font and formatting of the title in the same way.

There’s a small issue with this chart that we need to adjust: the x-axis starts at 12/1/2018 even when my data doesn’t begin until 12/31/2018. To customize my x-axis ―or any other chart component, for that matter― right click on the axis, and then select Format Axis. Next, set this axis type to a text axis (Excel thinks of dates such as 12/31/2018 as month dates that simply correspond to December 2018, and thus records them on charts as 12/1/2018 by default instead of 12/31/2018):

There are nearly countless other customization options available to you on this same toolbar that you just used to adjust the x-axis. To access the other options, click on the down arrow next to the Axis Options label:

Here, you have the ability to edit the chart area, chart title, horizontal axis, vertical axis, legend, plot area, and series, in terms of size/properties, effects, fill/line, and/or axis. For items corresponding to any sort of text label, the Text Options button also appears:

I won’t go into too much detail on all these options as doing so would take a VERY long time, but know that nearly any customization feature imaginable is available somewhere in these options.

While you’re at it, let’s go through the other customization options to make sure the line graph is exactly as you desire. If you click on your chart, you’ll see three icons on the top right corner:

The plus icon controls which elements you want to show up on your line chart. By clicking on this icon and hovering over each checkbox, you can view what your chart will look like with each option selected or unselected in real time; the arrows you see on the right side point you to additional toggle options:

You can use this feature to add or remove the axes, axes titles, the chart’s title, labels for each individual data point, a data table, error bars (showing standard error, percentage, standard deviation, etc. for each individual data point), gridlines, a legend, a trendline ―although that wouldn’t make very much sense in the context of a line graph― and up/down bars which show differences between the two series.

Next, the paintbrush icon allows you to change the visual styling of the line graph. Take some time to play around with the different options it gives you for style and color, and see which one you like best:

Note that more color options are available; the ones you see on this tab are just a few preset out-of-box options.

Lastly, you have the funnel icon, which lets you filter which data you do and don’t want to appear on your graph. You can even remove an entire series from the graph this way, or remove one of the dates along the x-axis; control these two options by checking/unchecking boxes, and then click Apply to see the results:

When you click on the graph, you may have noticed two extra tabs on the Excel ribbon that aren’t normally there: the Chart Design and Format tabs:

Let’s briefly go over those two tabs as well. On the Chart Design tab, the first three dropdowns (Add Chart Element, Quick Layout, and Change Colors) as well as the scroll menu in the center of the tab, give you some options that are duplicates from the plus and paintbrush icons explored earlier. Hovering over any of these options will show you what the chart will look like in real time. To reiterate, these are preset options, and if you don’t see something within these options that you want, chances are good you’ll be able to access the desired customizations in another way. There is also the option to switch row and column ―although for line graphs time always goes on the x-axis―, to update the selection of data that goes into the graph, to change the type of graph that displays, and to move the chart to another sheet in the Excel file.

Finally, there’s the Format tab, which replicates a lot of the functionality present on the toolbar that you used to adjust the x-axis. Take some time to play around with the options.

Summary

  1. Line graphs are used to show the progression of a series of data, usually over time. An area chart is used if the series is being split up into multiple groups, but if it is not being split up, a standard line graph is more common.
  2. Excel, Google Sheets, Power BI, Tableau, Qlik, Python, and R all include packages for building line graphs.
  3. There are countless options for customizing your line graphs in Excel. Review the tutorial in this blog post, or spend some time playing around with the functionality to build your optimal line graph.

Tips For Building Likert Surveys

Likert Surveys ―pronounced LICK-ert― are one of the most popular ways of collecting survey data. Their simple design of providing a series of prompts with a discrete scale of responses for each prompt (such as “Strongly Disagree”, “Disagree”, “Neutral”, “Agree”, and “Strongly Agree”; “Poor”, “Fair”, “Average”, “Good”, and “Excellent”, etc.) makes Likert Surveys an extremely popular survey type. But they are also one of the most commonly abused types of survey; this blog post provides a brief tutorial to ensure your Likert Surveys are implemented correctly.

Creating a Likert Survey may seem simple enough, but there are common pitfalls to avoid. When these hazards aren’t avoided, it leads to results that are biased at best and downright deceitful at worst. This is an especially important consideration whether you are designing a Likert Survey or analyzing the results of one; it may seem simple enough to design a Likert Survey, but only if these tips are followed can the survey yield truly fair, unbiased results.

  1. Make sure options are evenly spaced. 
    For example, you wouldn’t want the options to be “Poor”, “Good”, “Very Good”, and “Excellent” because moving from “Poor” to “Good” would likely be a much larger improvement than going from “Good” to “Very Good”. To avoid this potential pitfall, it often makes sense to have respondents rank things on a scale from 1 to 10 instead of assigning qualitative descriptors such as “Poor” and “Good” to their responses.
  2. Make sure options are evenly distributed. 
    For example, you wouldn’t want the options to be “Poor”, “Average”, “Good”, and “Excellent” because there would be more above-average options than below-average options in this case. I recently came across a Likert Survey pertaining to the quality of a new software product, distributed by the firm that sells that product. One of the prompts was, “[product name] has enhanced my ability to automate menial tasks”, and the options were “Disagree”, “Agree Somewhat”, “Agree”, and “Strongly Agree”. Obviously, with more options pointing to the product’s success than the product’s failure, respondents are more likely than otherwise to choose an option that makes the product look successful; thus, the firm in question is trying to use their survey to make their product seem more successful than it actually is. This is an especially egregious example, but it clearly points to how Likert Surveys can be biased and/or can be used to deceive, if not implemented correctly.
  3. Provide some positive prompts and some negative prompts. 
    For example, if you are asking respondents’ opinions on pizza and tacos, frame the first prompt as “Pizza is delicious” (with the options being strongly disagree, disagree, etc.) and frame the other prompt as “Tacos are overhyped”. Studies such as this one have shown that the practice of framing some prompts negatively and some positively ―known as balanced keying― can reduce what’s known as acquiescence bias in your survey results. Acquiescence bias describes the human tendency to acquiesce (choose the “agree” option) to a statement if respondents are unsure of their true feelings.

Building A Likert Survey In Google Forms

There are a great number of platforms which can be used to build a Likert Survey, but we’ll go over Google Forms because it’s free and allows for seamless response data conversion to spreadsheets.

Let’s say I’m trying to set up a Likert Survey on pizza toppings. To do this, I would first navigate to www.google.com/forms, and log into my Google account. On the resulting screen, under Start a new form, click on the plus icon labelled Blank:

Then, I’m going to fill out the name of my survey where it says Untitled form and provide a description if I want. Now it’s time to start creating the questions/prompts. First and foremost, on the dropdown menu on the right-hand side, I make sure to select either the linear scale option or multiple choice option; these options allow me to turn this survey into a Likert Survey. Next, I choose how many options to provide my respondents for this question (the default is 1 to 5, with customizable option labels). I fill in all of this information according to my survey’s requirements, but I remember to evenly distribute the labels. When I’m finished, I click the plus icon to move on to my next question.

I create the rest of your questions in succession, paying special attention to the aforementioned tips.

A few quick notes on the functionality of the right-hand side toolbar (the toolbar of icons with the plus icon for adding new questions on it): as previously mentioned, the top icon/plus icon allows me to add more questions to the survey, the second icon down allows me to import questions from other Google Forms surveys, the third icon down allows me to add additional titles and descriptions to subsections of the survey, the fourth and fifth icons down allow me to add images and videos, respectively to your survey, and finally the sixth icon down allows me to add a skip logic-enabled subsection. For more on skip logic and other survey technicalities, view our article on survey specifics here.

When I’m finished, I can send out this survey to respondents by clicking the “Send” button. There are three ways of distributing the survey: through email, by copying a link to the survey, or by embedding the survey in another page.

The Responses page will show respondent activity in real time. From that page, I can end the survey by clicking the Accepting responses toggle button; I can also send response data to a Google Sheet by clicking the green Create spreadsheet button, and/or I can send response data to a .csv file by clicking on the three dots in the top right of this page, and then clicking Download responses (.csv).

Analyzing Survey Data

Now that I have data from my respondents, what can I do with it? The answer to this question depends on whether my data are ordinal or interval data. Interval data preserves order and distance; for example, a Likert-like prompt to rate something on a scale from 1 to 10 would be interval data because we know that 2 is greater than 1, that 3 is greater than 2, and so on, but we also know that an improvement from a 1 to a 2 is of the same size as an improvement from a 2 to a 3. Ordinal data, on the other hand, merely preserves order; for example, a scale of (“Poor”, “Fair”, “Good”, and “Excellent”) would yield ordinal data because we cannot be assured that an improvement from “Poor” to “Fair” is of the same size as an improvement from “Fair” to “Good”.

Ordinal data is much more qualitative than interval data, and so it makes much more sense to use a bar chart, column chart, or pie chart to analyze such data. Interval data, on the other hand, can be used to run more advanced statistical tests such as histograms/distribution analysis, means, standard distributions, and hypothesis testing.

If I’m trying to apply Machine Learning to the results of a Likert Survey, my methods for doing so will also depend on whether I have ordinal or interval data. If I’m trying to apply Machine Learning to an ordinal target variable, I’ll be using a classification algorithm such as Decision Trees, Random Forests, Neural Networks etc. because the results of such a survey are discrete; regression analysis would not be a good strategy for ordinal target variables. For an interval target variable, I could opt for a classification model, but regression analysis would also be available to me because I can treat such a variable as continuous.

Interpreting Linear Regression Results

A Brief Introduction To Linear Regression

Linear regressions discover relationships between multiple variables by estimating a line or other function of best fit to a set of data. For example, the orange line is what a linear regression’s result would look like for the data shown in blue:

This function of best fit (shown here in orange) is expressed in the format of y = mx + b, where is the variable we are trying to predict and x, sometimes referred to as a regressor, is the variable whose effect on we are examining. Since represents the slope of the line, it can be thought of as the effect of on y (mathematically, m tells us the amount we would expect y to increase by for an increase of 1 in x, so essentially tells us by how much x is affecting y). Thus, the orange line represents our best estimate of the relationship between x and y. Note two things. First, we can use linear regression to discover relationships between more than two variables; we are not limited to just one x variable to explain the y. Second, the relationship we estimate does not have to be a straight line as it is in this example; it can also be a polynomial function, an exponential function, etc.

Interpreting Linear Regression Results In Excel

I can see that the equation it gives for the best fit linear trend line is y = -0.0014x + 19.606. The main insight embedded in this equation is the -0.0014, which represents m from the previous section; this number tells us that every one-pound increase in the weight of a car results in a reduction of acceleration by 0.0014; a negative means that an increase in the x variable is associated with a reduction in y and vice versa. Additionally, since our relationship is linear, we know that this relationship scales as well (so, for example, this value for m also tells us that every ten-pound increase in the weight of a car results in a reduction of acceleration by 0.014, every one-hundred-pound increase in the weight of a car results in a reduction of acceleration by 0.14, and so on).

In some cases, the intercept value b (19.606 in this case) is meaningful, but in this case, it is not. The intercept value tells us what we should expect the value of to be if the value of is 0. In this case, the value of x being 0 would translate to a car that weighs 0 pounds, which is obviously not meaningful.

So we’ve established that acceleration decreases by 0.0014 for every one-pound increase, but to understand the full influence of weight on acceleration, we also want to know: how much of the total variability in acceleration does weight account for? This is the question that a special metric called R2 answers. It’s called R2 because it’s the squared value of the correlation, which is represented by R. To get the Rscore on an Excel regression, go back into the formatting options that we used to add the regression equation, and check off the box for Display R2:

The resulting R2 value is 0.1743, which means that an estimated 17.43% of the variation in acceleration can be explained by weight.

Note that while it is possible to include more than one x variable in our regression to understand the effect of multiple variables on our chosen y, it is much better practice to use a programmatic tool such as R or Python rather than Excel for such cases.

Interpreting Linear Regression Results In Python And Other Programmatic Tools

Fire up your preferred Python IDE and load in the .csv or .xlsx file using the following code:

Obviously, you’ll have to replace the file path inside the read_csv method call with whatever file path and file name you saved the data under. There are two ways to run a linear regression in Python: by using the sklearn package and by using the statsmodels package. Here’s how to use Python’s sklearn package to run the regression (there are of course other ways of doing this, but this blog post is about interpreting results, not the actual coding):

Next, use reg’s coef_ attribute to retrieve the value of m:

Once again, we have arrived at a value of about -0.0014 for m. We can also check on the R2 value by using the score() method within reg:

and sure enough, it’s about 0.1743 again.

If we want to use the statsmodels package, the code to be run is:

The summary() method returns us a clean table which confirms the coefficient value of -0.0014, a constant (intercept) value of 19.606, and an R2 value of 0.1743:

This table also gives us a bunch of other information. We won’t go through all of it in this blog post, but here are a few highlights:

  1. Adjusted R2: Tells us the R2 value adjusted for the number of regressors in the regression. It increases over regular R2 if new regressors improve R2 by an abnormally large amount. Since there is only one regressor in this regression, adjusted R2 is slightly lower than regular R2.
  2. Model: The Model is denoted as OLS, which stands for ordinary least squares. This is the standard form of linear regression that we’ve explored in this blog post.
  3. Df Model: Tells us the number of degrees of freedom in the model, which is the number of regressors in the regression.
  4. std err, t, and P>|t|: These metrics show which results are statistically significant, and which are not. Values in the std err column tell us the accuracy of the coef values, where lower std err values correspond to higher accuracy. Values in the t column tell us the t-values of the coef values, which indicates the number of standard errors away from 0 the coef values are. The t-values are very important ―especially in the row(s) that correspond to regressor(s)― because if a t-value is at or near 0, it means that the constant/regressor in that row has an effect of 0 on the variable (i.e., the regressor is meaningless). But if a t-score is higher than about 2, it means that the regressor in that row is statistically significant. Finally, values in the P>|t| tell us the probability that the constant/regressor in that row is not equal to 0; if the P>|t| is high enough, we can therefore conclude that the constant/regressor is “significant”, i.e., we can be sure that that constant/regressor has a meaningful effect on the variable.

While the exact command/syntax used for linear regressions varies among Python, R, SPSS, Stata, and other tools, each tool is able to give you similar information to what we looked at in Python and Excel.

How B2C Businesses Can Use Their Data

In order to understand your customers fully and correctly, the data and analytics you maintain about those customers are extremely important. For B2C organizations especially ―which sell directly to customers― customer data and analytics are a necessary component of sales, marketing, and managerial strategies for both pre- and post-market decision making. There is a long list of channels through which a superior data practice can help B2C organizations gain a leg-up on their competition; this blog post will focus on some of the most prominent of those. If you’re a member of a B2C organization interested in hearing about what your business can do to make better use of data, this blog post is for you!

Social Media Analytics

Because they sell directly to end-user customers, in this day and age effective B2C organizations must be aware of the way that their customers engage with the business through social media. There are several different forms of social media analytics that help B2C organizations understand their customer base.

First, these organizations can use social media analytics to improve their social media strategy itself. For example, an organization may build an automatically-updating report to inform them of which sorts of posts (what time of day was it posted, long posts or short posts, which medium was it posted on, etc.) are generating the greatest number of click-throughs, the  greatest number of likes, the greatest number of sales leads, and so on; or alternatively, an organization may build an automatically-updating report to inform them of which sorts of posts are not working so well. Either way, the insights that these reports generate enable this organization to optimize their social media strategy according to what’s working and what’s not.

Second, these organizations can use social media analytics to improve their general marketing strategy. Knowing which sort of consumer is more likely to engage with your social media content also helps you determine which sorts of products/services those groups of people enjoy buying, and in turn, which product(s)/service(s) should you focus on building to gain ground on competitors. To read more on the benefits of social media analytics, view our separate blog post on the topic here.

Know Your Customers At A More In-Depth Level With Machine Learning

Characterizing your existing customer base by past spending habits, demographic information, and more can be a helpful strategy, but you have the ability to go into much more depth than simply aggregating known data. Data scientists can use a special kind of programming called machine learning ―defined as teaching a computer how to extrapolate on known data to create new data― to generate new, currently unknown information about your customers.

For example, let’s say your organization is planning the launch of a new service and wants to optimize its marketing strategy for that service. Obviously, you won’t have data on which marketing strategies worked and didn’t work before the service actually hits the market, nor will you ever have any data pertaining to strategies that you don’t end up pursuing. As a result, conventional data analytics won’t help you here. But what you can do is use machine learning to extrapolate on the data you do have to make an objective, unbiased prediction of which marketing strategy is the best given the circumstances.

Your organization can also use machine learning to predict future customer behavior, predict supply chain issues, diagnose drivers of cost overruns before they actually occur, personalize marketing by predicting which material(s) a given viewer may be more interested in seeing, and much, much more.

Customer Relationship Management Analytics

What can your organization do to retain more customers? Being able to answer this question is a primary objective of most B2C organizations, as B2Cs almost always feature shorter customer lifecycles than B2Bs. When applied in the proper way, data analytics is the most reliable way to identify customers who are at risk of terminating their business with your organization before they actually do; then, you can apply the proper retention tactics with these customers to keep them onboard for longer. To identify such customers, data scientists often turn to a technique called survival analysis, which can estimate the likelihood that a current customer leaves your customer base over time. Then, the data scientist can examine which diagnostic factor(s) have the greatest downward influence over that likelihood; when these diagnostic factor(s) are then identified in a certain customer, your organization can deduce that corrective action is necessary for that customer.

There are a lot of other questions that customer relationship management analytics can answer as well. Here are a few examples:

  1. Which type of prospects and sales leads tend to turn into clients, and what can be done to improve the conversion rate of those who are less likely to become clients?
  2. How can we measure client conflicts and what can be done to prevent them?

Data collection is also a straightforward exercise in order to conduct customer relationship management analytics, especially if your organization has CRM software.

Accounting Analytics

What are the principal causes of cost overruns for your organization? What is the size of the average purchase from your business, and how has that changed over time? If a customer buys a certain product/service from you, which of your other products/services are they more/less likely to also buy? The answer to each of these questions are buried somewhere in your accounting records. To uncover insights such as these, a necessary step is to build some sort of data feed from wherever your accounting data is stored onto bespoke report(s) and/or dashboard(s); the payoff of an expert data analyst in this case is to build an efficient, effective, and ideally automated set of data feeds and reports to enable your organization to receive fully-updated, industry-leading accounting analytics whenever they are required.

Your Organization’s Data Strategy

These are just a few examples. Many B2C organizations will have all kind of data that could be valuable to the business. Contact us to chat about your organization’s data-related needs.

Need help applying these concepts to your organization's data?

Chat with us about options.

Schedule a Meeting   

Continue to make data-driven decisions.

Sign up for our email guides that contains relevant tips, software tricks, and news from the data world.

*We never spam you or sell your information.

Back to Top