Use the 10% data from KDD Cup 1999 Dataset located at… for this exercise.

Create 2 training sets by selecting samples from this data set and evaluate them using decision trees (such as J48 in Weka). You can use random sampling or any other selective sampling technique. Compare the decision trees you find and describe any key changes between the trees. Comment on why these changes may be occuring by looking at the class distribution in your samples or the size of your training samples.

You may use alternate analysis techniques such as clustering and associations to supplement your analysis (although this is not required).

Submit a word document of your assignment, please make sure to include the decision tree snapshots and other relevant snapshots in your assignment. You do not need to include snapshots of every intermediate step or analysis.

You can use weka or any other alternative data mining tool for this assignment.

Sample Solution

  1. Data Statistics
    The data from KDD cup 99 consists 494,021 record and 42 attributes.
    The classes in this data are distributed as follow:
    Index Class Count Percentage
    1 smurf 280790 56.83766%
    2 neptune 107201 21.69968%
    3 normal 97278 19.69107%
    4 back 2203 0.44593%
    5 satan 1589 0.32165%
    6 ipsweep 1247 0.25242%
    7 portsweep 1040 0.21052%
    8 warezclient 1020 0.20647%
    9 teardrop 979 0.19817%
    10 pod 264 0.05344%
    11 nmap 231 0.04676%
    12 guess_passwd 53 0.01073%
    13 buffer_overflow 30 0.00607%
    14 land 21 0.00425%
    15 warezmaster. 20 0.00405%
    16 imap 12 0.00243%
    17 Rootkit 10 0.00202%
    18 loadmodule 9 0.00182%
    19 ftp_write 8 0.00162%
    20 multihop 7 0.00142%
    21 phf 4 0.00081%
    22 perl 3 0.00061%
    23 spy 2 0.00040%
    Total 100%
    Table 1.0 : distribution of the attacks in the KDD Cup 99 dataset
    Feature name Description Type
    Duration length (number of seconds) of the
    Protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
    Service Network service on the destination, e.g., http,
    telnet, etc.
    src_bytes number of data bytes from source to
    Table 1.1 : Basic features of individual TCP connections.
    The attributes could be classified into three categories [1]:
    feature name description type
    hot number of hot'' indicators continuous num_failed_logins number of failed login attempts continuous logged_in 1 if successfully logged in; 0 otherwise discrete num_compromised number ofcompromised” conditions continuous
    root_shell 1 if root shell is obtained; 0 otherwise discrete
    su_attempted 1 if su root'' command attempted; 0 otherwise discrete num_root number ofroot” accesses continuous
    num_file_creations number of file creation operations continuous
    num_shells number of shell prompts continuous
    num_access_files number of operations on access control files continuous
    num_outbound_cmds number of outbound commands in an ftp
    is_hot_login 1 if the login belongs to the hot'' list; 0 otherwise discrete is_guest_login 1 if the login is aguest”login; 0 otherwise discrete
    Table 1.2: Content features within a connection.
    Feature name Description Type
    Count Number of connections to the same host as the
    current connection in the past two seconds
    Note: The following features refer to these samehost connections.
    serror_rate % of connections that have SYN'' errors continuous rerror_rate % of connections that haveREJ” errors continuous
    same_srv_rate % of connections to the same service continuous
    diff_srv_rate % of connections to different services continuous
    srv_count number of connections to the same service as the
    current connection in the past two seconds
    dst_bytes number of data bytes from destination to
    flag normal or error status of the connection discrete
    land 1 if connection is from/to the same host/port;
    0 otherwise
    wrong_fragment number of wrong'' fragments continuous urgent number of urgent packets continuous 2 Note: The following features refer to these sameservice connections. srv_serror_rate % of connections that haveSYN” errors continuous
    srv_rerror_rate % of connections that have “REJ” errors continuous
    srv_diff_host_rate % of connections to different hosts continuous
    Table 1.3: Traffic features computed using a two-second time window.
  2. Preprocessing
    We use Excel mostly to preprocess the data set. In this stage, we mainly want to
    make the data consistent and combined. To do so, we combined the header file to the
    dataset we have to be able to sense the nature of the variables and their role in the attacks.
    Moreover, we added a new attribute that classifies the data into two classes (Normal,
    Attack) as baseline for comparison and to understand the pattern of attacks before mining
    the data. Basically, the normal observations are labeled in our new class {NORMAL},
    and everything else labeled as {ATTACK} the distribution of this new class is as follow:
    Index Class Count Percentage
    1 Attack 396743 %80.30
    2 Normal 97278 %19.69
    Table 2.1: A binary classification of the population dataset.
    Since the application of our problem is classification using Decision Tree, we
    would not normalize the data, as we would do for other classification algorithms such as
    NN. Until this point we still need more understanding of the classes we have in the data.
    Therefore, we create new class attribute that map the attacks by their general type such as
    DoS, U2R … etc. [2,3]. The following table demonstrates our new class.
    Index Type of Attack Count Percentage
    1 DoS 392478 79.44561%
    2 Normal 97278 19.69107%
    3 Probe 4107 0.83134%
    4 R2L 106 0.02146%
    5 U2R 52 0.01053%
    Total 100%
    Table 2.2: Types of the attacks in the dataset
    According to the given task, we decided to create three samples with different
    distribution of the classes. The size of each sample is (n=400).
    1) The first sample is created using random sampling method which will create a
    subset sample nearly the same distribution of classes as the population dataset we
    2) The second sample will be manually chosen which will represent different
    distributions of the classes different than the population dataset.
    3) The third sample will be manually chosen as sample 2
    4) A sample for test set with (n=200) that represent the same distribution of the
    classes of the population dataset. The test set is to assess our model.
    Moreover, we will run the classifier for each sample two times, one with 10 fold
    cross validation and the second time with supplied test set that we created. The tool we
    will use is Rapidminer to create our model. In sake of simplicity, we will run our
    classifiers on the types of the attacks (Normal, DoS, Probe, U2R, R2L), which refer to
    respectively (Normal flow, Denial of Service attacks, Probe attacks, User to Root attacks,
    and Remote to User attacks).
    2.1 Sampling the data.
    As mentioned in the methodology section, the samples are as follow
    1) Sample1 (RandSamp) n =400:
    Type of
    Actual Count Count Ratio Percentage
    Normal 97278 83 0.085322478 20.75
    DoS 392478 314 0.080004484 78.5
    Probe 4107 3 0.073046019 00.75
    Total 400
    Table 2.1.1: Distribution of RandSamp sample
    2) Sample2 (UniformDist1) n=400:
    The distribution we suggest for this sample as follow:
    Type of Attack Actual Count Sample size Ratio Percentage
    Normal 97278 42 0.04317523 10.5
    DoS 392478 100 0.025479135 25
    U2R 52 52 100 13
    R2L 106 106 100 26.5
    Probe 4107 100 2.4348673 25
    Total 400
    Table 2.1.2: Distribution of UniformDist1 sample
    3) Sample3 (UniformDist2) n=400:
    The distribution we suggest for this sample as follow:
    Type of Attack Actual Count Sample size Ratio Percentage
    Normal 97278 100 0.102798166 25
    DoS 392478 170 0.04331453 42.5
    U2R 52 10 19.23076923 2.5
    R2L 106 20 2.4348673 5
    Probe 4107 100 18.86792453 25
    Total 400
    Table 2.1.3: Distribution of UniformDist2 sample
    4) TestSample (TSample) n=200:
    Type of Attack Actual Count Count Ratio Percentage
    Normal 97278 37 0.038035321 18.5
    DoS 392478 160 0.040766616 80
    Probe 4107 3 0.073046019 1.5
    Table 2.1.3: Distribution of TSample sample testset
    2.2 Feature selection
    We used information gain algorithm on the first sample (RandSamp) with 5 fold
    cross validation to choose our feature. We select the first 16 attributes in the rank from
    the following table:
    Average Merit Average Rank Attribute
    0.754 -0.011 1 count
    0.707 -0.015 2 src_bytes
    0.604 -0.024 3 dst_bytes
    0.467 -0.028 4.4 dst_host_same_src_port_rate
    0.453 -0.01 4.8 logged_in
    0.42 -0.028 6.2 srv_count
    0.401 -0.026 7.2 dst_host_count
    0.394 -0.01 7.4 protocol_type
    0.299 -0.015 9 dst_host_srv_diff_host_rate
    0.193 -0.024 10.2 srv_diff_host_rate
    0.077 -0.004 12 srv_serror_rate
    0.077 -0.011 13 flag
    0.077 -0.004 13.2 dst_host_srv_serror_rate
    0.077 -0.004 13.6 dst_host_serror_rate
    0.075 -0.004 14.4 serror_rate
    0.059 -0.006 16 same_srv_rate
    0.042 -0.005 17.2 duration
    0.025 -0.012 20.4 diff_srv_rate
    0 0 20.8 hot
    0 0 21.4 num_failed_logins
    0.024 -0.012 21.8 dst_host_diff_srv_rate
    0 0 22.8 root_shell
    0 0 23 wrong_fragment
    0 0 23.4 urgent
    0 0 24 land
    0 0 25 num_compromised
    0.042 -0.085 26.2 dst_host_srv_count
    0 0 26.8 su_attempted
    0 0 27 dst_host_srv_rerror_rate
    0 0 31 srv_rerror_rate
    0 0 31.2 is_guest_login
    0 0 31.8 dst_host_same_srv_rate
    0 0 32.4 num_root
    0 0 32.8 rerror_rate
    0 0 34.6 dst_host_rerror_rate
    0 0 35.8 num_file_creations
    0 0 36.6 num_outbound_cmds
    0 0 37.8 num_shells
    0 0 38.8 num_access_files
    0 0 40 is_host_login
    Table 2.2.1: Ranks of features based on Information Gain algorithm
  3. Results
    3.1 RandSamp set:
    The result of applying Decision tree on this dataset create the following tree:
    Figure 3.1.1: Tree of RandSamp sample classifier
    The performance measures of this classifier is differ for the two models as shown:
    Performance when tested on external dataset: accuracy: 98.50%
    Figure 3.1.2: Performance of RandSamp sample classifier tested on external test set
    Performance when using 10 folds cross validation
    Figure 3.1.3: Performance of RandSamp sample classifier with 10 folds cross
    As can be seen in Figure 3.1.2 and Figure 3.1.3the results shows that the cross
    validation model was slightly better than the one that was tested on (n=200) test dataset
    with %0.15
    Both classifiers misclassify the three observations of Probe attack due to the lack
    number of attacks in both sets. Since our sampling technique in this task use random
    sampling which represents the same distribution of the population dataset. Both
    (RandSamp and population dataset are lacking of Probe attacks compered to other
    Due to the number of observations and the sampling technique, from the five
    types of attacks, we only have 3 types in which one of those types is unrepresented
    Even though, the representation of R2L, U2R, and Probe attacks in the population
    data set is considered little compared to Normal, and DoS, we can produce better results
    when increasing the number of observations of those unrepresented attacks which will
    create a valid representation of all classes. We will try to tackle this problem, the lack of
    representation in the next two samples.
    3.2 UniformDist_1 sample set:
    The result of applying Decision tree on this dataset create the following tree:
    Figure 3.2.1: Tree of UniformDist1 sample classifier
    The performance measures of this classifier is differ for the two models as shown:
    Performance when tested on external dataset: accuracy: 98.50%
    Figure 3.2.2: Performance of UniformDist1 sample classifier tested on external test
    Performance when using 10 folds cross validation
    Figure 3.2.3: Performance of RandSamp sample classifier with 10 folds cross
    Here we have different accuracy results based on our two models. The accuracy
    when we tested our model on external dataset is %94.50. Moreover, the accuracy when
    we tested our model on 10 folds cross validation dropped to %89.25. It is clear that the
    model that evaluated on external test set performs better for two reasons, one is that the
    test data is n=200 and it is the same distribution of the population data which means that
    with the number of observations in the test set, the representation of the classes U2R and
    R2L is very low. The second reason is that provided that the classifier misclassified all
    the observation from both classes, U2R and R2L, the true classified observations from
    other classes dictate their affect of misclassification.
    On the other hand, the performances of the second model that run on 10 folds
    cross validation is less than the one that tested on test dataset by %5. The significance
    drop in performance in recall occurred on the following classes: Normal %78.57, U2R
    %80.77, and R2L %88.68. While the precision measure shows sharp drop in performance
    on the class U2R. By looking to the table 2.1.2 we see that we chose all the observation
    of the population dataset for the classes U2R 52 observation and R2L 106. Yet that is not
    adequate enough to enhance the performance. The total observation of both classes is 158
    observations less than half of the sample. Therefore, Due to the inadequate of Normal,
    U2R, and R2L classes in the distribution the performance is low.
    3.3 UniformDist_2 sample set:
    The tree from applying the decision tree on UniformDist2 is :
    Figure 3.3.1: Tree of UniformDist1 sample classifier
    The classifier performs very will on this sample compared to the other sample we
    created. In the following tables we show the performance of our classifier on this specific
    Figure 3.3.2: Performance of UniformDist2 sample classifier tested on external test
    Figure 3.2.3: Performance of RandSamp sample classifier with 10 folds cross
    In this sample this classifier outperform its results on all other different samples.
    The accuracy of the classifier tested on the external test set is %98.50 which is slightly
    better than the accuracy when we run the model 10 folds cross validation %94.50. It is
    clear by looking to the tree in 3.2.1 that the branch that starts from the head node when it
    is more than 52500 that all the leafs are for the classes that are not represented
    sufficiently such as R2L and U2R. However the classifier (x validation) did decent
    accuracy by misclassify only 4 observations from both classes out of 30 observations.
  4. Overall Discussion:
    From the trees in the second and third samples, we can identify that by looking to
    the majority of the nodes are from the attribute src_byte and dis_ byte which indicates
    that both play huge role in identifying the type of attacks. The attribute duration also
    shows association with both src_byte and dist_ byte and that could imply the importance
    of these three attributes to any feature selection on this dataset.
    It is a problem of distribution of the classes that plays role in accuracy of the
    model. And assessing the classifier on external test set is important if we need to object
    the class to knew unseen data.
    Moreover, it is important to select the right set of features for this problem since
    the number predictors are large relatively.
    To enhance the model, we could include removing the redundancy from the
    dataset could result in better samples class distributions especially the realistic samples
    that are created representing almost the same distribution of the original data.
    [1] Stolfo, S. J., Fan, W., Lee, W., Prodromidis, A., & Chan, P. K. (2000). Cost-based
    modeling for fraud and intrusion detection: Results from the JAM project.
    In DARPA Information Survivability Conference and Exposition, 2000.
    DISCEX’00. Proceedings (Vol. 2, pp. 130-144). IEEE.
    [2] Thomas, C., Sharma, V., & Balakrishnan, N. (2008, March). Usefulness of darpa
    dataset for intrusion detection system evaluation. In SPIE Defense and Security
    Symposium (pp. 69730G-69730G). International Society for Optics and
    [3] Paliwal, S., & Gupta, R. (2012). Denial-of-Service, Probing & Remote to User (R2L)
    Attack Detection using Genetic Algorithm. International Journal of Computer
    Applications, 60.