What is the frequency of different categories in our dataset? was the motivating factor for frequency ploat for quantitative data. But does the same question make sense for quantitative or numerical data. Let us start with discrete quantitative data and see if this makes sense. Table 1 below shows a few rows of the runs scored by Sachin Tendulkar in each of his 462 ODI innings.
Now, we may be interested in asking the following questions: How many times did he get out on 99 or 49 or 0? we can think of the numbers 0 to 200* as the unique values that the attribute “runs scored” can take and we are now interested in seeing how many times does each of these unique values repeats. We can once again draw a bar chart with runs (from 1 to 200) on the horizontal axis and their frequencies on the y-axis. This idea looks very similar to the frequency charts that we discussed earlier with two main differences:
- The values on the horizontal axis are now numbers instead of labels or categories
- Their is a natural ordering of the values on the x-axis and hence unlike before in histogram we do not sort the values on the x-axis based on their frequencies but simply based on their natural numeric ordering
Such bar charts are called histograms.
The above histogram reveals some interesting patterns. To begin with, being a big fan of the maestro I was devastated to see the number of times he was dismissed on 0 (called a duck in cricket parlance). Also notice that the combined frequency of all the low scores (<20) is very high. Contrast this with the general expectation that Sachin should score a century every time he goes out to bat! Another interesting pattern is that from 1 to 100, Sachin has been dismissed on every score except 56, 58, 59, 75, 76 and 92 (looks like these are the only six lucky numbers for him). Of course, if you play for as long as Sachin did, this is bound to happen.
Grouping values into bins
While the above figure revealed some interesting patterns, it has too many unique values on the x-axis. This not only makes it hard to visualise or even display the data but also makes it difficult to answer some other interesting questions such as How many times did Sachin Tendulkar get out in the 90’s or on single digit scores?. Hence, instead of plotting individual data values on the x-axis it is common to group the values into bins. For example, as shown in Figure 10 instead of having 200 unique values in the x-axis we could have the following 21 bins: 0-9, 10-19, 20-29, 30-39, …..190-199, 200-210. The bar on top of the bin 0-9 contains the sum of the frequencies of every score from 0 to 9. Similarly, the bar on top of the bin 10-19 contains the sum of the frequencies of every score from 10 to 19. These bins are called class-intervals. The end points of these class intervals are called class boundaries (e.g, for the interval 0-9, 0 is the left class boundary and 9 is the right class boundary).
With this grouping of values into class intervals, the plot now is much easier to visualise. Of course, it hides certain details (such as the number of time Sachin was dismissed on exactly 0) but is still useful for most practical purposes. For example, we can still ask questions such as How many times was he dismissed for a low score (say, less than 20). How many times was he dismissed in the 90s or immediately after scoring a century?
Show Figure 2 which is the histogram of Sachin’s scores binned into the following bins: 0-9, 10-19, 20-29, 30-39, …..190-199, 200-210
While grouping values into bins, one important question is how do you select the bin size. For example, in the above figure instead of selecting bins of size 10, could we have selected bins of size 5. The answer is yes but once again there will be too many values (41, to be precise) which will compromise the readability of the plot. Figure 3 shows the histogram with bins of size 5 and it is clearly more dense than the one with bins of size 10.
Show Figure 3 which is the histogram of Sachin’s scores with bins of size 5,
So then, should we use even larger bin sizes, say of size 20 or 40. As evident from Figures 11 and 12, having larger bin sizes is also bad as it compromises the granularity of the data. It is now difficult to answer questions about the number of times he was dismissed in 20s, 30s and so on. In the extreme case, if we use bins of size 200, 100 or 50 (as shown in Figures 13 to 15) then the plots do not reveal any interesting patterns. A lot of details in the data are now hidden due to larger bin sizes. In summary, for this data, extremely small (1) or large (>50) values do not make sense. A moderate value of 10 seems to be a reasonable choice.
Indeed, selecting the right bin size requires some trial and error and in practice a statistician would try a few bin sizes and choose the one which works best for revealing meaningful patterns in the data by neither hiding nor revealing too many details (both extremes are bad).