Market Basket Analysis - Real transaction data set

1. In this question you will perform a Market Basket Analysis to a real transaction data set available in Moodle (here). The data set contains transaction data for the year 2009 for a UK-based registered non-store online retail. Only transactions made by UK residents were included. The aim of the analysis is to find association rules that could be of interest for the retailer. To answer this question you will use R. You need to type your answers and include the relevant R output.

(a)     Explore the data set to answer: how many transactions and how many items are included in the data set? What is the average number of items sold in a transaction? What is the minimum and maximum number of items included in a transaction?

(b)    What are the 4 items in the data set with biggest sales? What is their support count (absolute frequency) and support (relative frequency)?

(c)     Make a frequency plot for the items that have a minimum support of 5% and comment on it. What are the 5th and 6th items with biggest sales?

(d)    Find strong rules with a minimum support of 1% and a minimum confidence of 80% by applying the Apriori algorithm. In the list of parameters of the apriori command add a constraint so that the maximum number of items contained in a rule is 5 (maxlen=5). How many rules were found? Analyse the main descriptive statistics of the support, confidence and lift of the mined rules and comment on your findings.

(e)     List one mined rule of your choice and interpret its level of support, confidence and lift.

(f)      Inspect and list the 2 rules with highest support, the 2 rules with highest confidence and the 2 rules of highest lift. Comment on your findings.

(g)     Make a scatter plot of the mined rules and comment. Make use of the interactive feature of the plot to investigate further an area of great interest and comment.

(h)    Make a matrix plot for the 20 rules with highest lift and use the support measure for the shading. Comment on your findings.

(i)      Make a Graph of the 10 mined rules with highest support and comment on your findings. List the two rules shown in the plot with highest support and the two rules with highest lift.

(j)      You are asked to suggest the 3 most interesting rules, justifying your suggestion and explaining how these rules could be used to benefit the online retailer.

80 marks

Consider the market basket transactions shown in the table below:

Transaction IDItems bought
2apple, bread, milk
3bread, milk, orange
4bread, milk, orange, tomato
5apple, orange, tomato
6apple, bread, milk
7apple, bread, milk, orange
8apple, bread

(a)     Build the FP-tree for this transaction data set, showing all your workings. With the information obtained so far and without any extra computation answer the following: if you were to apply Apriori algorithm, how many 2itemsets candidates would be generated by the Apriori algorithm? Explain how you arrive to the answer.

(b)    Use the FP-Growth algorithm to extract the frequent itemsets assuming a minimum support of 30%. Show all your workings. Use the information obtained to answer: which percentage of the 2-itemsets candidates generated with the Apriori algorithm survive the pruning step?

(c)     Assume a minimum confidence of 60%. Find all the strong rules that satisfy the minimum support and minimum confidence.

(d)    Find:

i.     the itemset with largest support

ii.   a pair of items, a and b such that the rules {a} ? {b} and {b} ? {a} have the same confidence. iii. in your opinion, the best rule. Justify your answer.

20 marks

In this question you will analyse a real dataset about countries available on Moodle (here). The aim is to identify possible factors that are related with the life expectancy of countries. The data set, collected by the World Health Organization and the United Nations in 2014, contains information about demographics, immunisation and mortality rates of 131 countries (note that only countries with complete observations were included). The aim of the analysis is to advise the government of a country about which area should be given importance in order to efficiently improve the life expectancy of its population. The variables in the data set are:


    Life expectancy: life expectancy in age

    Adult Mortality: probability of dying between 15 and 60 years per 1000 population

    Hepatitis B: hepatitis B immunization coverage among 1-year-olds ( % )

    Measles: number of reported cases of measles per 1000 population

    BMI: average Body Mass Index of entire population

    HIV/AIDS: deaths per 1000 live births HIV/AIDS (0-4 years)

    GDP: Gross Domestic Product per capita (in USD)

    Population: Population of the country

    Schooling: Average number of years of Schooling(years)

(a)     Make use of appropriate graphs and numerical measures to investigate the relationship between Life expectancy and the other variables in the data set. Based on your analysis choose three variables that could be used to model the Life expectancy of countries. Justify your selection of predictors.

(b)    Write the equation of your proposed linear regression model and fit it in R. Attached the R summary output of your fitted model to your answers and interpret it: check whether or not your predictors make significant contributions to your model and interpret all the significant estimated coefficient values; comment on the goodness-of-fit of the model and interpret the R2 value.

(c)     List the assumptions of the model and argue whether they are satisfied or not. Make appropriate plots in R to justify your conclusions.

(d)    Assuming the model is valid, write the equation for the fitted regression line and estimate the life expectancy for Argentina and Ukraine, which in 2015 had the following indicators:

CountryAd. Mort.Hep. BMeaslesBMIHIV/AIDSGDPPopulationSchooling

Knowing that the true life expectancy for these countries were 76.3 and 71.3, respectively, calculate the error in the predictions and comment on the accuracy.

(e)     The government of Afghanistan aims to increase the life expectancy by 5 years in the following decade. Based on your model, which suggestion can you make to achieve that goal? Quantify the impact of at least two alternative strategies and comment.

50 marks

In this question you will diagnose a heart disease by using information of several health indicators of 303 patients of the Cleveland Clinic Foundation. The data set is available on Moodle (here) and contains 14 variables:

    age: age in years

    sex: sex (1 = male; 0 = female)

    cp: chest pain type (1: typical angina, 2: atypical angina, 3: non-anginal pain and 4: asymptomatic)

    trestbps: resting blood pressure (in mm Hg on admission to the hospital)

    chol: serum cholestoral in mg/dl fbs: (fasting blood sugar > 120 mg/dl) (1 = true

Market Basket Analysis - Real transaction data set

  • Order

  • Payment

  • Processing

  • Delivery

Validation error occured. Please enter the fields and submit it again.
Thank You ! Your email has been delivered.