Most datasets contain several attributes for a given object. For example, in the cricket dataset if we consider player as an object then we have data about runs scored, balls faced, minutes played, strike rate, type of dismissal, etc. Similarly, for the agriculture dataset if we consider farm as an object then we have data about state, district, type of crop, total area, total yield and so on. It is often the case that we expect certain relationships to hold between these variables. For example, we may expect the runs scored to be related to the number of balls faced or the total yield to be related to the total area of a farm. Hence, instead of describing these attributes in isolation it is more meaningful to describe them together. For example, Figures 1 and 2 show the histograms for the runs scored and balls faced for Sachin Tendulkar.
While these histograms are informative they are not good for answering questions of the following type: How does the scoring rate (i.e., runs scored per ball) of Sachin Tendulkar change as the number of balls faced increases? Is this relationship linear or does it becomes quadratic or exponential as the number of balls faced increases (i.e., he starts scoring very fast after he has faced a certain number of balls). To be able to answer such questions involving two attributes we need to draw a scatter plot. A scatter plot contains one attribute on the x-axis and the other attribute on the y-axis. Figure 3 shows such a scatter plot for the balls faced and runs scored by Sachin Tendulkar in all the 452 ODIs that he played. To be precise, each point in the plot corresponds to one of the ODIs that Sachin played. The x-coordinate of each point is equal to the number of balls faced by Sachin in that match and the y-coordinate is equal to the value of the runs scored by Sachin in that match.
We can see that as the balls faced increases, the runs also increase with a linear relationship.
Notice that the above scatter plot was to capture the relationship between two discrete variables. We can also use it capture the relationship between two continuous variables such as total yield and total area (Figure pp) or between one discrete and one continuous variable such as runs scored (discrete) and strike rate (continuous) as shown in Figure 5
Use of scatter plots in Machine Learning
In Machine Learning scatter plots are often used to identify correlated features. For example, suppose you are trying to predict whether a patient has a certain health risk or not. You want to build a ML systems which will base its decision on various parameters or features of the patient such as sugar level, cholesterol level, blood pressure, triglycerides, age, weight, height, BMI, etc. One important consideration while using ML is to ensure that the features that you feed to an ML system are not redundant (i.e., each feature brings in new information which is not captured by other feature). More formally, you are interested in features that are uncorrelated. A formal way of checking this is to compute the correlation between features as we will see in the next chapter. A more informal or visual way of checking this is to draw the scatter plot for different pairs of variables and see if they are related. If so, this means that if you know the value of one of these attributes then knowing the value of the other attribute does not add any new information. In other words, given one attribute the other attribute is redundant. In the above example, we observe that weight and BMI have an almost linear relationship between them (Figure 6) and hence feeding both of these inputs to the ML system does not make sense. On the other hand, the scatter plot of age and BMI (Figure 7) shows that there is almost no relationship between these two variables and hence knowing one does not necessarily reveal anything about the other. Hence, it may be important for the ML system to consider the values of both these variables while taking decisions.