Inferential statistics and descriptive statistics

Statistics are the heart of data analysis. They help us detect trends, patterns and plan, that is, they bring data to life and help us derive its meaning. Although the individual statistical methods that we use in data analysis are too numerous to count, they can be divided into two main fields: descriptive statistics and inferential statistics . In this post, we explore the differences between these concepts and see how they impact the field of data analysis.

Descriptive statistics
Inferential Statistics
DefinitionDescribe the characteristics of the populations and / or samples.Use samples to make generalizations about larger populations.
FunctionOrganize and present data in a purely factual way.It helps us make estimates and predict future results.
Final resultsPresent the final results visually, using tables, charts, or graphs.Present the final results in the form of probabilities.
ConclusionsDraw conclusions based on known data.Draw conclusions that go beyond the available data.
Measurements and techniquesUse measures such as central tendency, distribution, and variance.It uses techniques such as hypothesis testing, confidence intervals, and regression and correlation analysis.

What are statistics

It may seem silly to define such a “basic” concept as statistics, yet when we use these terms frequently, it is easy to take them for granted. Simply put, statistics is the area of ​​applied mathematics that deals with the collection, organization, analysis, interpretation, and presentation of data . Sounds familiar? It should. These are all vital steps in the data analysis process. In fact, in many ways, data analysis is a statistic. When we use the term ‘data analysis’ what we really mean is ‘the statistical analysis of a given data set or data sets’. But that’s a bit tricky, so we tend to shorten it!

Because they are so critical to data analysis, statistics are also vitally important to whatever field data analysts work in. From science and psychology to marketing and medicine, the wide range of statistical techniques that exist can be divided into two categories: descriptive statistics and inferential statistics . But what is the difference between them?

Simply put, descriptive statistics focus on describing the visible characteristics of a data set (a population or sample) . Meanwhile, inferential statistics focus on making predictions or generalizations about a larger data set, based on a sample of that data . Before exploring these two categories of statistics further, it is helpful to understand what the population and sample mean. Let’s find out.

What are population and sample in statistics

Two basic but vital concepts in statistics are population and sample. We can define them as follows.

The population is the entire group that you want to extract data from (and then draw conclusions). While in everyday life the word is often used to describe groups of people (such as the population of a country) in statistics, it can be applied to any group from which you collect information. These are usually people, but they can also be cities of the world, animals, objects, plants, colors, etc.

A sample is a representative group from a larger population . Random sampling of representative groups allows us to draw general conclusions about a general population. This approach is commonly used in surveys. Pollsters ask a small group of people for their views on certain topics. They can then use this information to make informed judgments about what the general population thinks. This saves the time, hassle, and expense of extracting data from an entire population (which for all practical purposes is often impossible).

The image illustrates the concept of population and shows. Using random sample measures from a representative group, we can estimate, predict, or infer characteristics over the largest population. While there are many variations on this technique, they all follow the same underlying principles.

OKAY! Now that we understand the concepts of population and sample, we are ready to explore descriptive and inferential statistics in a little more detail.

What is descriptive statistics

Descriptive statistics are used to describe the characteristics of a data set . The term “descriptive statistics” can be used to describe both individual quantitative observations (also known as “summary statistics”) and the general process of obtaining knowledge from these data. We can use descriptive statistics to describe both an entire population and an individual sample. Because they are purely explanatory, descriptive statistics do not care much about the differences between the two types of data. So what measures do descriptive statistics look at? While there are many, the important ones include:

  • Distribution
  • Central trend
  • Variability

What is distribution?

The distribution shows us the frequency of different results (or data points) in a population or sample . We can display it as numbers in a list or table, or we can represent it graphically. As a basic example, the following list shows the number of people with different hair colors in a data set of 286 people.

  • Brown hair: 130
  • Black hair: 39
  • Blonde hair: 91
  • Red hair: 13
  • Gray: 13

We can also represent this information visually, for example, in a pie chart.

Generally, the use of visualizations is a common practice in descriptive statistics . It helps us to detect patterns or trends more easily in a data set.

What is the central tendency?

Central trend is the name of measurements that look at typical central values ​​within a data set . This doesn’t just refer to the central value within a complete data set, which is called the median. Rather, it is a general term used to describe a variety of core measures. For example, you could include central measurements from different quartiles from a larger data set. Common measures of central tendency include:

  • The mean: the average value of all data points.
  • The median: the middle or middle value of the data set.
  • The mode: the value that occurs most frequently in the data set.

Again, using our hair color example, we can determine that the mean measurement is 57.2 (the total value of all measurements, divided by the number of values), the median is 39 (the middle value), and the mode is 13 (because it appears twice, which is more than any of the other data points). Although this is a very simplified example, for many areas of data analysis, these basic measures support how we summarize the characteristics of a sample of data or population . Summarizing these types of statistics is the first step in determining other key characteristics of a data set, for example its variability. This brings us to the next point …

What is variability?

The variability or dispersion of a data set describes how the values ​​are distributed or spread out . Identifying variability is based on understanding the measures of central tendency in a data set. However, as a central tendency, variability is not just a measure. It is a term used to describe a variety of measurements. Common measures of variability include:

  • Standard deviation: This shows us the amount of variation or dispersion. A low standard deviation implies that most of the values ​​are close to the mean. The high standard deviation suggests that the values ​​are more dispersed.
  • Minimum and Maximum Values ​​- These are the highest and lowest values ​​in a data set or quartile. Using the example from our hair color dataset again, the minimum and maximum values ​​are 13 and 130 respectively.
  • Range: measures the size of the distribution of values. This can be easily determined by subtracting the smallest value from the largest. So in our hair color dataset, the range is 117 (130 minus 13).
  • Kurtosis: measures whether the tails of a given distribution contain extreme values ​​(also known as outliers). If a queue lacks outliers, we can say that it has low kurtosis. If a data set has many outliers, we can say that it has high kurtosis.
  • Skewness: is a measure of the symmetry of a data set. If you were to draw a bell curve and the tail on the right was longer and thicker, we would call it positive bias. If the tail on the left is longer and thicker, we would call it negative bias. This is visible in the following image.

Used together, distribution, central tendency, and variability can give us a surprising amount of detailed information about a data set. Within data analytics, they are very common measures, especially in the area of ​​exploratory data analysis. Once you have summarized the main characteristics of a population or sample, you will be in a much better position to know how to proceed. And this is where inferential statistics come in.

What is inferential statistics

Therefore, we have established that descriptive statistics focus on summarizing the key characteristics of a data set. Meanwhile, inferential statistics focus on making generalizations about a larger population based on a representative sample of that population . Because inferential statistics focuses on making predictions (rather than stating facts), its results are often in the form of a probability.

As might be expected, the precision of inferential statistics is highly dependent on the sample data being accurate and representative of the general population . Doing this involves obtaining a random sample . If you’ve ever read news coverage of scientific studies, you may have come across the term mentioned above. The implication is always that random sampling means better results. On the other hand, results that are based on biased or non-random samples are generally discarded. Random sampling is very important for performing inferential techniques, but it is not always straightforward!

Let’s quickly summarize how you might get a random sample.

How do we get a random sample?

Random sampling can be a complex process and often depends on the particular characteristics of a population. However, the fundamental principles imply:

1. Definition of a population

This simply means determining the pool from which you will be drawing your sample. As we explained earlier, a population can be anything, it is not limited to people. So it could be a population of objects, cities, cats, pugs, or whatever else we can derive measurements from!

2. Decide on the sample size

The larger your sample size, the more representative it will be of the general population. Removing large samples can be time consuming, difficult, and expensive. In fact, this is why we sampled in the first place: it is rarely feasible to extract data from an entire population. Therefore, the sample size should be large enough to give you confidence in your results, but not so small that the data runs the risk of being unrepresentative (which is short for inaccurate). This is where the use of descriptive statistics can help, as they allow us to strike a balance between size and precision.

3. Select a random sample

Once you have determined the sample size, you can make a random selection. You can do this by using a random number generator, assigning each value a number and selecting the numbers at random. Or you can do it using a variety of similar techniques or algorithms (we won’t go into details here, as this is a topic in itself, but you already understand).

4. Analyze the data sample

Once you have a random sample, you can use it to infer information about the largest population. It is important to note that while a random sample is representative of a population, it will never be 100% accurate. For example, the mean (or average) of a sample will rarely match the mean of the entire population, but it will give you a good idea. For this reason, it’s important to incorporate your margin of error into any analysis (which we’ll cover in a moment). Therefore, as explained above, any result of inferential techniques is in the form of probability.

However, assuming that we have obtained a random sample, there are many inferential techniques to analyze and obtain information from these data. The list is long, but some noteworthy techniques include:

  • Hypothesis testing
  • Confidence intervals
  • Regression and correlation analysis

Let’s explore a little more closely.

What is hypothesis testing

Hypothesis testing involves checking that your samples repeat the results of your hypothesis (or proposed explanation) . The goal is to rule out the possibility that a particular result may have occurred by chance . A current example of this is the clinical trials of the covid-19 vaccine. Since it is impossible to conduct trials on an entire population, we instead conducted numerous trials on various random and representative samples.

The hypothesis test, in this case, might ask something like: ‘ Does the vaccine reduce severe disease caused by COVID-19? »By collecting data from different groups of samples, we can infer whether the vaccine will be effective. If all samples show similar results and we know that they are representative and random, we can generalize that the vaccine will have the same effect in the general population. On the other hand, if one sample shows greater or lesser efficacy than the others, we must investigate why this might be the case. For example, maybe there was an error in the sampling process, or maybe the vaccine was delivered differently to that group. In fact, it was due to a dosing error that one of the Covid vaccines was shown to be more effective than other groups in the trial … Which shows just how important hypothesis testing can be . If the group of outliers had simply been discarded, the vaccine would have been less effective!

What is a confidence interval

Confidence intervals are used to estimate certain parameters for a population measurement (such as the mean) based on sample data . Instead of providing a single mean value, the confidence interval provides a range of values . It is usually expressed as a percentage. If you have ever read a scientific research article, the conclusions drawn from a sample will always be accompanied by a confidence interval.

For example, suppose you have measured the tails of 40 randomly selected cats. You get an average length of 17.5 cm. You also know that the standard deviation of the tail lengths is 2 cm. Using a special formula, we can say that the mean length of the tails in the total cat population is 17.5 cm, with a 95% confidence interval. Basically this tells us that we are 95% certain that the population mean (which we cannot know without measuring the entire population) is within the given range. This technique is very useful for measuring the degree of precision within a sampling method.

What are regression and correlation analyzes

Regression and correlation analysis are techniques used to observe how two (or more) sets of variables are related to each other .

Regression analysis aims to determine how a dependent (or output) variable is affected by one or more independent (or input) variables . It is often used for hypothesis testing and predictive analysis. For example, to predict future sunscreen sales (an output variable), you can compare last year’s sales to weather data (which is an input variable) to see how much sales increased on sunny days.

Meanwhile, correlation analysis measures the degree of association between two or more data sets . Unlike regression analysis, correlation does not infer cause and effect. For example, ice cream sales and sunburn are likely to be highest on sunny days; we can say that they are correlated. But it would be wrong to say that ice cream causes sunburn!

What we have described here is just a small selection of a large number of inferential techniques that you can use in data analysis. However, they provide a tantalizing taste of the kind of predictive power that inferential statistics can offer.

What is the difference between inferential and descriptive statistics

In this post, we explore the differences between descriptive and inferential statistics. Let’s see what we have learned.

Descriptive statistics:

  • Describe the characteristics of the populations and / or samples.
  • Organize and present data in a purely factual way.
  • Present the final results visually, using tables, charts, or graphs.
  • Draw conclusions based on known data.
  • Use measures such as central tendency, distribution, and variance.

Inferential statistics:

  • Use samples to make generalizations about larger populations.
  • It helps us make estimates and predict future results.
  • Present the final results in the form of probabilities.
  • Draw conclusions that go beyond the available data.
  • It uses techniques such as hypothesis testing, confidence intervals, and regression and correlation analysis.

One last thing to note: although we have presented descriptive and inferential statistics in binary form , they are actually used more often together. Together, these powerful statistical techniques are the fundamental foundation on which data analysis is built.

Add a Comment

Your email address will not be published. Required fields are marked *