Data description

Data package: SMSdata Data set: carc - The car data set consists of 13 variables measured for 74 car types.

The variables are:

P - Price, a numeric vector

M - Mileage (in miles per gallone), a numeric vector

R78 - Repair record 1978 (rated on a 5-point scale; 5 best, 1 worst), a factor with levels . 1 2 3 4 5

R77 - Repair record 1977 (scale as before), a factor with levels . 1 2 3 4 5

H - Headroom (in inches), a numeric vector

R - Rear seat clearance (distance from front seat back to rear seat, in inches), a numeric vector

Tr - Trunk space (in cubic feet), a numeric vector

W - Weight (in pound), a numeric vector

L - Length (in inches), a numeric vector

T - Turning diameter (clearance required to make a U-turn, in feet), a numeric vector

D - Displacement (in cubic inches), a numeric vector

G - Gear ratio for high gear, a numeric vector

C - Company headquarter, a factor with levels US Japan Europe


Commented solution

In order to not make things too confusing, I want my main output to be scatter plot of two continuous variables on x and y axis and two categorical variables distinguished by shape and color of the points. The data set contains only three categorical variables: R77, R78 and C. I will use R78 and C, since R77 and R78 are likely highly correlated and R78 contains more recent information.

Lets decide which continuous variables to use for my plot.

Correlation matrix of the data:

Graph with highly correlated variables connected:

We can see that continuous variables W, L, T, D, “-G” and “-M” are highly correlated, which is convenient for the purpose of data description since we can choose one representative from this group and other members of this group will likely behave in a similar way as this representative.

I choose W (weight) since it seems to be correlated the most, it is easy to understand variable and it is very descriptive.

From the continuous variables “remain” P (price), R (rear seat clearance), H (headroom), Tr (trunk space).

I choose P since it is “least physically descriptive” - we already have weight as physical descriptor.

Picture: Scatter plot of chosen data with red loess curve

From the picture above it is clear that American cars behave differently than European and Japanese cars.

Pictures: Partial scatter plots of chosen data with red loess curve for joint data and blue loess curve for “partial” data