Categorical attributes can take on a fixed number of values which keep repeating in the data. Given this, we might be interested in knowing how frequently does a given value of the attribute repeat in the data. Consider the following table containing data on all ODI matches played by Sachin Tendulkar
For example, Considering the categorical attribute “Dismissal Type”, we might be interested in the number of times he was dismissed by getting bowled, caught, run out, etc. This count is called the frequency of the value.
One way of visualising such data is to construct a frequency table which has two columns. The first column contains all the unique values that this attribute can take and the second column contains the counts of these values (i.e., the number of times this value appears in the data. For example, Table 1 shows that out of the 452 times that Sachin Tendulkar was dismissed in ODIs he was bowled 68 times, caught 258 times and so on.
Similarly, consider data about the matches in which Sachin Tendulkar scored a century. One of the attributes in this data is “Opposition” which can take on values Australia, England, Pakistan and so on. We can now construct a frequency table for this attribute (as shown in Table 2). This quickly allows us to compare the number of centuries against different countries and see if an interesting trend emerges.
A better and more visually appealing way of displaying these counts is to draw frequency bar charts (often just called frequency charts). A frequency bar chart is a plot in which one of the axis (typically, the horizontal axis) contains the different values that a categorical attribute can take and the other axis (typically, the vertical axis) contains the counts or frequencies of these values. Each count is represented by a bar whose height is proportional to the count. For example, the data in Tables 1 and 2 can be displayed using frequency charts as shown in Figure 1 and 2.
Sometime it is more convenient to visualise the relative frequencies of the values rather than the absolute frequencies. This allows us to answer questions of the following form: What percentage of his total centuries did Sachin Tendulkar score agains Australia? or What percentage of the total number of farms in India grow paddy. The relative frequency of a value can be computed by dividing the absolute frequency by the total number of data points. For example, as shown in Table 4 the absolute frequency of the number of centuries Sachin Tendulkar scored against Australia is 9. If we divide this number of centuries (i.e., the total number of data points), n = 49, then we get the relative frequency 0.183673. We can compute the relative frequency for all other values as shown in Table 4
Similar to a relative frequency Table we can also draw a relative frequency cart as shown in Figure 3.
Use of frequency charts in Machine Learning
Frequency charts are often used in Machine Learning to visualise the distribution of different categories in the data. Below we give a couple of examples where frequency charts are useful.
Analysing errors in an ML system: Suppose you are training a ML system to classify between horses, and giraffes. Once the system is trained you feed it 100 images (say, 50 of horses and 50 of giraffes) to evaluate the system. You observe that it makes errors on some 30 examples and now you make a frequency chart of the categories of the erroneous examples. Figures 4 shows one possible frequency chart resulting from the data. This looks a bit troublesome because it suggests that the model is making more mistakes in identifying giraffes as opposed to horses. This can give clues to the ML engineer then he needs to debug his system and either (i) make appropriate adjustments so that the errors are more uniform while still low in number and (ii) find out a reason for why it is harder to detect giraffes than horses.