Introduction to Statistics: Coursework

**Instructions**

This
coursework tests your basic statistical modelling skills, using spreadsheet
software as well as your awareness of the reality of how probability
calculations, estimation and regression work in practice. Your answers are to
be presented in an **essay/report format**,
for which you will use a word processor. In writing your report, please:

·
*state and explain all
assumptions, on which your answers are based*;

·
*clearly indicate your
answer/recommendations*

·
*support any answers with the
appropriate calculations to arrive at the answer*

·
*no evidence of use of excel
will result in a fail mark for this assignment and therefore the coursework
component of the module;*

·
*include selected printouts
of formulae underlying computed values. Failure to demonstrate you have created appropriate formulations on excel will
be severely penalised. Despite the fact that you will be submitting the Excel
file as well, your report is a stand-alone document, meaning a reader should
not be required to look at the Excel file to understand your analysis, findings
and recommendations*

·
*please note that adequate
usage of the excel calculations in the report is important. This means that the
key data/findings need to be included in the report and appropriate referencing
needs to be done, i.e. the relevant cell/table/range in the relevant tab of the
excel file mentioned at the point of the report when it should be consulted.*

The report will have a maximum of 6 pages (including any Appendixes; penalties will be applied for longer submissions – you are required to develop your judgement on what is and isn’t important). Ten percent of the total mark is allowed for the overall quality of the presentation, i.e. organisation of the report, use of visual aids to support your conclusions/answers and clarity of the key messages you want to leave a reader with.

*Notes:*

**This coursework is your own
(individual) work. Any student found guilty of plagiarism will be penalised.** **Standard penalties for late submissions are applicable. **

**Question
1: (30%)**

**Tab 1** of the attached excel file called "Data
File IF1202 CW March20" contains wealth data for the UK population. Given the large
size of the data, it is presented in grouped data format.

Required:

a) Discuss what would be the best way to present the data graphically and give an illustration.

b) Calculate relevant measures of central tendency (hint: you'll need to make an assumption regarding the size of the final class) and measures of variation for this data.

c) Discuss and justify which of the metrics of central tendency above would be more appropriate to describe the data.

d) Calculate:

a. the proportion of the population with a wealth above £100,000

b. the probability that one individual selected at random will have a wealth of £75,000

c. the probability that one individual selected at random will have a wealth of no more than £50,000

d. the probability that one individual selected at random will have a wealth of at least of £20,000

e. the probability that a sample of size 150 taken randomly random will have a mean wealth of at least £20,000

e) Discuss and justify the difference in the results
of the last two probabilities calculated in part d) above.

The table below represents data for the profits, size, number of product lines sold and sales for the 20 branches of a retailing company:

Profit (£000s) | Size (000s sq. ft.) | Lines | Sales (£000s) |

42.13 | 6.0 | 150 | 748.82 |

6.32 | 1.4 | 75 | 140.78 |

38.47 | 5.0 | 170 | 702.11 |

-0.32 | 1.0 | 75 | 41.54 |

3.65 | 1.2 | 75 | 96.85 |

7.77 | 1.5 | 75 | 166.93 |

4.31 | 1.3 | 75 | 109.05 |

4.53 | 1.1 | 80 | 263.92 |

-2.69 | 1.1 | 75 | 50.84 |

3.22 | 1.2 | 75 | 90.08 |

9.03 | 1.4 | 80 | 190.59 |

-2.59 | 1.2 | 75 | 91.75 |

6.39 | 1.4 | 80 | 141.57 |

24.39 | 3.5 | 160 | 377.04 |

13.92 | 1.5 | 100 | 198.69 |

2.13 | 1.3 | 75 | 62.78 |

17.48 | 2.1 | 110 | 265.28 |

7.21 | 1.3 | 85 | 91.80 |

15.62 | 2.5 | 120 | 231.60 |

33.61 | 4.5 | 200 | 548.31 |

Required:

a) Summarise the distribution of profits of the twenty branches and comment on the results?

b) Is there evidence that the average number of lines stocked per store is significantly different from 78?

c) If you divide the branches in two groups with, one of branches with sales above £150,000, and the other with sales below that value, is there a significant difference between the profits of the groups?

d) Based on this sample, provide a 99% confidence interval, and comment on the outcome, for the profits of the twenty branches.

e) Is there evidence of association between the profit and the other variables?

f) Develop three regression models to predict the
profit based upon each of the other factors (variables) individually. Which of
these is best? What are the limitations of your best model? How can you improve
this analysis?

**Question
3: (30%)**

**Tab 2** of the attached excel file called "Data
File IF1202 CW March20" contains data for the salaries of employees in a
bank, as well as some other variables we believe impact the salary of each
employee.

Required:

a) Prepare a summary table with the correlations between all the variables (columns B to G) and discuss which variables are highly correlated and which are not.

b) Construct a multiple regression model with all independent variables and clearly indicate your regression equation;

c) Indicate and justify which variables are significant and non-significant in the regression model and compare with your answer to part b) above;

d) Construct another multiple regression model including only the significant variable from the model in c) above and discuss whether it is a better model or not.

e) Bonus question (not required, but an additional up to 6%age points can be added to your final assignment mark). Discuss how the variable Gender could be incorporated in your analysis and justify whether you believe there is evidence of gender inequality in banking salaries for the data collected.

