Data Categorical

Data Analysis of Iris Dataset : A Tutorial

The Iris dataset is a classic in the field of machine learning and statistics. It’s often used for practicing classification algorithms. The dataset contains 150 samples of iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The samples are evenly distributed across three species of iris flowers: Setosa, Versicolor, and Virginica. In this article we shall perform data analysis of Iris dataset using Python.

Steps for Data Analysis of Iris Dataset

Here are the steps we’ll follow for the complete data analysis of Iris dataset:

  1. Data Loading: Load the Iris dataset.
  2. Data Exploration: Explore the dataset to understand it better and figure out how to approach the classification problem.
  3. Data Visualization: Visualize the data to understand the relationships between features and species.
  4. Statistical Summary: Summarize the data statistically to extract insights.
  5. Correlation Analysis: Understand how features correlate with each other.

Let’s get started!

Package required: sklearn

Step 1: Data Loading

First, let’s load the Iris dataset.

Python
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris_data = load_iris()
iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
iris_df['species'] = iris_data.target_names[iris_data.target]

# Show first few rows of the dataset
iris_df.head()

Output:


sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
5.1	                3.5	              1.4	                0.2	          setosa
4.9	                3.0	              1.4	                0.2	          setosa
4.7	                3.2	              1.3	                0.2	          setosa
4.6	                3.1	              1.5	                0.2	          setosa
5.0	                3.6	              1.4	                0.2	          setosa

The dataset has been successfully loaded. It contains four feature columns:

  1. Sepal length (cm)
  2. Sepal width (cm)
  3. Petal length (cm)
  4. Petal width (cm)

Additionally, it has a ‘species’ column, which is the label we would typically predict in a machine learning task.

Step 2: Data Exploration

Let’s start by checking for any missing values and understanding the basic statistics of the dataset.

Python
# Check for missing values
missing_values = iris_df.isnull().sum()

# Basic statistics of the dataset
statistics_summary = iris_df.describe()

missing_values, statistics_summary

Oytput:

(sepal length (cm)    0
 sepal width (cm)     0
 petal length (cm)    0
 petal width (cm)     0
 species              0
 dtype: int64,
        sepal length (cm)  sepal width (cm)  petal length (cm)  \
 count         150.000000        150.000000         150.000000   
 mean            5.843333          3.057333           3.758000   
 std             0.828066          0.435866           1.765298   
 min             4.300000          2.000000           1.000000   
 25%             5.100000          2.800000           1.600000   
 50%             5.800000          3.000000           4.350000   
 75%             6.400000          3.300000           5.100000   
 max             7.900000          4.400000           6.900000   
 
        petal width (cm)  
 count        150.000000  
 mean           1.199333  
 std            0.762238  
 min            0.100000  
 25%            0.300000  
 50%            1.300000  
 75%            1.800000  
 max            2.500000  )

Great! There are no missing values in the dataset, which simplifies things for us.

Here are some basic statistics for each of the feature columns:

  • Sepal length (cm): Ranges from 4.3 to 7.9 with a mean of approximately 5.84.
  • Sepal width (cm): Ranges from 2.0 to 4.4 with a mean of approximately 3.06.
  • Petal length (cm): Ranges from 1.0 to 6.9 with a mean of approximately 3.76.
  • Petal width (cm): Ranges from 0.1 to 2.5 with a mean of approximately 1.20.

Step 3: Data Visualization

Let’s visualize the data to understand it better. We’ll use various types of plots to view the relationships between different features and species.

Python
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the aesthetics for the plots
sns.set(style="whitegrid")

# Create a pairplot to visualize the pairwise relationships in the dataset
pair_plot = sns.pairplot(iris_df, hue='species', markers=["o", "s", "D"])
pair_plot.fig.suptitle('Pairplot of Iris Dataset', y=1.02);
pair plot of iris dataset
image 4

The pairplot above provides a pairwise relationship between all numerical variables. Different colors represent different species of the Iris flower.

  • Diagonal: The diagonal shows the distribution of a single variable for each species. We can see that the Setosa species is easily distinguishable from the others based on petal length and petal width.
  • Off-diagonal: The off-diagonal plots show scatter plots between two variables. These can help us understand the correlation between two variables for each species.

Step 4: Statistical Summary

Let’s take a closer look at the statistical distribution of the features across the different species.

Python
# Group the data by species and calculate the mean and standard deviation for each group
grouped_mean = iris_df.groupby('species').mean()
grouped_std = iris_df.groupby('species').std()

grouped_mean, grouped_std

Output:

(            sepal length (cm)  sepal width (cm)  petal length (cm)  \
 species                                                              
 setosa                  5.006             3.428              1.462   
 versicolor              5.936             2.770              4.260   
 virginica               6.588             2.974              5.552   
 
             petal width (cm)  
 species                       
 setosa                 0.246  
 versicolor             1.326  
 virginica              2.026  ,
             sepal length (cm)  sepal width (cm)  petal length (cm)  \
 species                                                              
 setosa               0.352490          0.379064           0.173664   
 versicolor           0.516171          0.313798           0.469911   
 virginica            0.635880          0.322497           0.551895   
 
             petal width (cm)  
 species                       
 setosa              0.105386  
 versicolor          0.197753  
 virginica           0.274650  )

The table below summarizes the mean and standard deviation of each feature for the different species:

FeatureSetosa (Mean)Versicolor (Mean)Virginica (Mean)Setosa (Std)Versicolor (Std)Virginica (Std)
Sepal length (cm)5.0065.9366.5880.3520.5160.636
Sepal width (cm)3.4282.7702.9740.3790.3140.322
Petal length (cm)1.4624.2605.5520.1740.4700.552
Petal width (cm)0.2461.3262.0260.1050.1980.275
  • Setosa: Generally has smaller dimensions.
  • Versicolor: Has dimensions that are intermediate between Setosa and Virginica.
  • Virginica: Generally has larger dimensions.

Step 5: Correlation Analysis of Iris Dataset

Finally, let’s check how the features correlate with each other.

Python
# Calculate the correlation matrix
correlation_matrix = iris_df.corr()

# Generate a heatmap to visualize the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Iris Dataset')
plt.show()

Output:

heatmap of iris dataset
image 5

The heatmap above shows the correlation matrix for the feature variables in the Iris dataset. The values range from -1 to 1, with -1 indicating a strong negative correlation, 1 indicating a strong positive correlation, and 0 indicating no correlation.

Here are some observations:

  • Sepal Length is positively correlated with Petal Length and Petal Width, suggesting that longer sepals are associated with longer and wider petals.
  • Sepal Width shows relatively low correlation with other features.
  • Petal Length and Petal Width are strongly positively correlated, indicating that longer petals are generally also wider.

That wraps up our complete data analysis and visualization of the Iris dataset.