#### Mining Frequent Patterns, Associations, and Correlations

#### Que 1. Explain the Frequent Pattern Analysis, its important with example.

**Frequent Pattern Analysis:**

- In Data Mining, Frequent Pattern Analysis is a major concern because it is playing a major role in Associations and Correlations.
- Frequent Pattern is a pattern which appears frequently in a data set.
- Example Itemsets, subsequences or substructures.
- For example, a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset.
- A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.
- A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences.
- If a substructure occurs frequently, it is called a (frequent) structured pattern.
- By identifying frequent patterns we can observe strongly correlated items together and easily identify similar characteristics, associations among them.
- By doing frequent pattern mining, it leads to further analysis like clustering, classification and other data mining tasks.

**Importance of Frequent Pattern Analysis:**

**Example:**

- Letâ€™s look at an example of Frequent Pattern Mining.
- For this example, we will only consider two factors namely, Support and Confidence.

**Support:**

- The support of a rule x -> y (where x and y are each items/events etc.) is defined as the proportion of transactions in the data set which contain the item set x as well as y
- So, Support (x -> y) = no. of transactions which contain the item set x & y / total no. of transactions.

**Confidence:**

- The confidence of a rule x -> y is defined as: Support (x -> y) / support (x).
- So, it is the ratio of the number of transactions that include all items in the consequent (y in this case), as well as the antecedent (x in this case) to the number of transactions that include all items in the antecedent (x in this case).
- In the table below, Support (milk->bread) = 0.4 means milk and bread are purchased together occur in 40% of all transactions. Confidence (milk->bread) = 0.5 means that if there are 100 transactions containing milk then there will be 50 that will also contain bread.

#### Que 2. Explain the Apriori Algorithm in Frequent Itemset Mining Method with example.

**Apriori Algorithm:**

- The Apriori algorithm was the first algorithm that was proposed for frequent mining of item sets, and was later improved by R Agarwal and R Srikant and was named Apriori.
- This algorithm uses two steps â€˜joinâ€™ and â€˜pruneâ€™ to reduce search space.
- It is an iterative approach to discovering the most frequent itemsets.

**The Core of the Apriori Algorithm**

- Uses frequent sets (k-1) to generate candidates for frequent k-items
- Uses database scanning and pattern matching
- Counts for candidate item sets

**Apriori algorithm is a sequence of steps, including :**

**Join step:**

- This step generates a (K + 1) set of elements from K-sets of elements by joining each element to itself.

**Pruning step:**

- This step analyzes the count of each item in the database.
- If the candidate item does not meet the minimum support, then it is considered rare and is therefore removed.
- Pruning is done to reduce the size of the candidate itemsets.

Consider the following dataset and we will find frequent itemsets and generate association rules for them.

- Minimum sup_count is 2.
- Minimum confidence is 60%.

**Step-1: K=1**

- (I) Create a table containing support count of each item present in dataset â€“ Called C1 (candidate set).

- (II) Compare candidate set itemâ€™s support count with minimum support count(here min_support=2 if sup_count of candidate set items is less than min_support then remove those items). This gives us itemset L1.

- I4 and I5 have sup_count is equal to min_support =2

**Step-2: K=2**

- Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
- Check all subsets of an itemset are frequent or not and if not frequent remove that itemset. (Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)
- Now find support count of these itemsets by searching in dataset.

- (II) compare candidate (C2) support count with minimum support count(here min_support=2 if sup_count of candidate set item is less than min_support then remove those items) this gives us itemset L2.

**Step-3:**

- Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common. So here, for L2, first element should match.
- So itemset generated by joining L2 is {I1, I2, I3} {I1, I2, I5} {I1, I3, i5} {I2, I3, I4} {I2, I4, I5} {I2, I3, I5}
- Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset. (Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
- Find support count of these remaining itemset by searching in dataset.

- (II) Compare candidate (C3) support count with minimum support count(here min_support=2 if sup_count of candidate set item is less than min_support then remove those items) this gives us itemset L3.

**Step-4:**

- Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
- Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
- We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into picture. For that we need to calculate confidence of each rule.

**Confidence â€“**

A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.

**Confidence (A->B) = Sup_count(AâˆªB) / Sup_count(A)**

So here, by taking an example of any frequent itemset, we will show the rule generation. Itemset {I1, I2, I3} //from L3

**SO rules can be**

- [I1^I2]=> [I3] //confidence = sup (I1^I2^I3)/sup(I1^I2) = 2/4
*100=50%* - [I1^I3]=> [I2] //confidence = sup (I1^I2^I3)/sup(I1^I3) = 2/4100=50%
- [I2^I3]=> [I1] //confidence = sup (I1^I2^I3)/sup(I2^I3) = 2/4
*100=50%* - [I1]=> [I2^I3] //confidence = sup (I1^I2^I3)/sup(I1) = 2/6100=33%
- [I2]=> [I1^I3] //confidence = sup (I1^I2^I3)/sup(I2) = 2/7*100=28%
- [I3]=> [I1^I2] //confidence = sup (I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.

#### Que 3. Discuss a Pattern-Growth Approach for Mining Frequent Itemsets with proper example.

**Frequent Pattern Growth Algorithm :**

- This algorithm is an improvement to the Apriori method.
- A frequent pattern is generated without the need for candidate generation.
- FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or FP tree.
- This tree structure will maintain the association between the itemsets.

** **

**FP Tree:**

- Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
- The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an item of the itemset.
- The root node represents null while the lower nodes represent the itemsets.
- The association of the nodes with the lower nodes that is the itemsets with the other itemsets are maintained while forming the tree.

**Example of FP-Growth Algorithm:**

Support threshold=50%, Confidence= 60%

**Build FP Tree:**

- Considering the root node null.
- The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked to I1.
- T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3. But this branch would share I2 node as common as it is already used in T1.
- Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
- T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
- T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node, hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
- T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
- T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.

#### Que 4. Why Strong Rules Are Not Necessarily Interesting discuss with proper example.

Whether or not a rule is interesting can be assessed either subjectively or objectively. Ultimately, only the user can judge if a given rule is interesting, and this judgment, being subjective, may differ from one user to another. However, objective interestingness measures, based on the statistics â€•behindâ€– the data, can be used as one step toward the goal of weeding out uninteresting rules from presentation to the user.

The support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads to *correlation rules* of the form.

A => **B[support,confidence,correlation]**

That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets *A* and *B*. There are many different correlation measures from which to choose. In this section, we study various correlation measures to determine which would be good for mining large data sets.

**Also read : Unit V: Data Cube Technology**