importing header files¶
In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
calling the excel file or csv file which I am going to use¶
In [33]:
df = pd.read_csv('students_score.csv')
printing the header files - here first five row will be printed¶
In [34]:
print(df.head())
Unnamed: 0 Gender EthnicGroup ParentEduc LunchType TestPrep \ 0 0 female NaN bachelor's degree standard none 1 1 female group C some college standard NaN 2 2 female group B master's degree standard none 3 3 male group A associate's degree free/reduced none 4 4 male group C some college standard none ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans \ 0 married regularly yes 3.0 school_bus 1 married sometimes yes 0.0 NaN 2 single sometimes yes 4.0 school_bus 3 married never no 1.0 NaN 4 married sometimes yes 0.0 school_bus WklyStudyHours MathScore ReadingScore WritingScore 0 < 5 71 71 74 1 5 - 10 69 90 88 2 < 5 87 93 91 3 5 - 10 45 56 42 4 5 - 10 76 78 75
printing to describe the files different values - like count, mean, min , max , standerd deviasion¶
In [35]:
print(df.describe())
Unnamed: 0 NrSiblings MathScore ReadingScore WritingScore count 30641.000000 29069.000000 30641.000000 30641.000000 30641.000000 mean 499.556607 2.145894 66.558402 69.377533 68.418622 std 288.747894 1.458242 15.361616 14.758952 15.443525 min 0.000000 0.000000 0.000000 10.000000 4.000000 25% 249.000000 1.000000 56.000000 59.000000 58.000000 50% 500.000000 2.000000 67.000000 70.000000 69.000000 75% 750.000000 3.000000 78.000000 80.000000 79.000000 max 999.000000 7.000000 100.000000 100.000000 100.000000
printing the header files - here first 2 row will be printed¶
In [36]:
print(df.head(2))
Unnamed: 0 Gender EthnicGroup ParentEduc LunchType TestPrep \ 0 0 female NaN bachelor's degree standard none 1 1 female group C some college standard NaN ParentMaritalStatus PracticeSport IsFirstChild NrSiblings TransportMeans \ 0 married regularly yes 3.0 school_bus 1 married sometimes yes 0.0 NaN WklyStudyHours MathScore ReadingScore WritingScore 0 < 5 71 71 74 1 5 - 10 69 90 88
call the header files - here first 2 row will be printed but it will be a table form¶
In [37]:
(df.head(2))
Out[37]:
| Unnamed: 0 | Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | female | NaN | bachelor's degree | standard | none | married | regularly | yes | 3.0 | school_bus | < 5 | 71 | 71 | 74 |
| 1 | 1 | female | group C | some college | standard | NaN | married | sometimes | yes | 0.0 | NaN | 5 - 10 | 69 | 90 | 88 |
call the last files - here last 2 row will be printed but it will be a table form¶
In [38]:
df.tail(2)
Out[38]:
| Unnamed: 0 | Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30639 | 934 | female | group D | associate's degree | standard | completed | married | regularly | no | 3.0 | school_bus | 5 - 10 | 82 | 90 | 93 |
| 30640 | 960 | male | group B | some college | standard | none | married | never | no | 1.0 | school_bus | 5 - 10 | 64 | 60 | 58 |
printing the header files - here first five row will be printed¶
In [39]:
print(df.tail())
Unnamed: 0 Gender EthnicGroup ParentEduc LunchType \
30636 816 female group D high school standard
30637 890 male group E high school standard
30638 911 female NaN high school free/reduced
30639 934 female group D associate's degree standard
30640 960 male group B some college standard
TestPrep ParentMaritalStatus PracticeSport IsFirstChild NrSiblings \
30636 none single sometimes no 2.0
30637 none single regularly no 1.0
30638 completed married sometimes no 1.0
30639 completed married regularly no 3.0
30640 none married never no 1.0
TransportMeans WklyStudyHours MathScore ReadingScore WritingScore
30636 school_bus 5 - 10 59 61 65
30637 private 5 - 10 58 53 51
30638 private 5 - 10 61 70 67
30639 school_bus 5 - 10 82 90 93
30640 school_bus 5 - 10 64 60 58
df.info() - mainly it used for finding the null values only¶
In [40]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30641 entries, 0 to 30640 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 30641 non-null int64 1 Gender 30641 non-null object 2 EthnicGroup 28801 non-null object 3 ParentEduc 28796 non-null object 4 LunchType 30641 non-null object 5 TestPrep 28811 non-null object 6 ParentMaritalStatus 29451 non-null object 7 PracticeSport 30010 non-null object 8 IsFirstChild 29737 non-null object 9 NrSiblings 29069 non-null float64 10 TransportMeans 27507 non-null object 11 WklyStudyHours 29686 non-null object 12 MathScore 30641 non-null int64 13 ReadingScore 30641 non-null int64 14 WritingScore 30641 non-null int64 dtypes: float64(1), int64(4), object(10) memory usage: 3.5+ MB
Give the sum of the null values of each columns if available¶
In [41]:
df.isnull().sum()
Out[41]:
Unnamed: 0 0 Gender 0 EthnicGroup 1840 ParentEduc 1845 LunchType 0 TestPrep 1830 ParentMaritalStatus 1190 PracticeSport 631 IsFirstChild 904 NrSiblings 1572 TransportMeans 3134 WklyStudyHours 955 MathScore 0 ReadingScore 0 WritingScore 0 dtype: int64
I used this to drop the Unnamed column its mean delete it but there will no changes will done in the original current data¶
In [42]:
df = df.drop("Unnamed: 0",axis = 1)
In [43]:
df.head()
Out[43]:
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | female | NaN | bachelor's degree | standard | none | married | regularly | yes | 3.0 | school_bus | < 5 | 71 | 71 | 74 |
| 1 | female | group C | some college | standard | NaN | married | sometimes | yes | 0.0 | NaN | 5 - 10 | 69 | 90 | 88 |
| 2 | female | group B | master's degree | standard | none | single | sometimes | yes | 4.0 | school_bus | < 5 | 87 | 93 | 91 |
| 3 | male | group A | associate's degree | free/reduced | none | married | never | no | 1.0 | NaN | 5 - 10 | 45 | 56 | 42 |
| 4 | male | group C | some college | standard | none | married | sometimes | yes | 0.0 | school_bus | 5 - 10 | 76 | 78 | 75 |
I do some changes here in my data for my research benifits change WklyStudyHours "5 - 10" = "> 5" .¶
In [44]:
df["WklyStudyHours"] = df["WklyStudyHours"].str.replace("5 - 10","> 5")
In [45]:
df.head()
Out[45]:
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | female | NaN | bachelor's degree | standard | none | married | regularly | yes | 3.0 | school_bus | < 5 | 71 | 71 | 74 |
| 1 | female | group C | some college | standard | NaN | married | sometimes | yes | 0.0 | NaN | > 5 | 69 | 90 | 88 |
| 2 | female | group B | master's degree | standard | none | single | sometimes | yes | 4.0 | school_bus | < 5 | 87 | 93 | 91 |
| 3 | male | group A | associate's degree | free/reduced | none | married | never | no | 1.0 | NaN | > 5 | 45 | 56 | 42 |
| 4 | male | group C | some college | standard | none | married | sometimes | yes | 0.0 | school_bus | > 5 | 76 | 78 | 75 |
show the results ParentMaritalStatus in bar chart vstyle¶
In [46]:
# sns.countplot(data = df , x = 'Gender' , y = 'ParentMaritalStatus')
a = sns.countplot(data = df , x = 'ParentMaritalStatus')
a.bar_label(a.containers[0])
plt.show()
show the results TransportMeans in bar chart style¶
In [47]:
b = sns.countplot(data = df , x = 'TransportMeans')
b.bar_label(a.containers[0])
plt.show()
show the results according to math score in bar chart vstyle¶
In [48]:
plt.figure (figsize = (24,8))
c = sns.countplot(data = df , x = 'MathScore')
c.bar_label(a.containers[0])
plt.show()
show the mean value of MathScore , ReadingScore, WritingScore based on parents education¶
In [49]:
gb = df.groupby("ParentEduc").agg({"MathScore":'mean',"ReadingScore":'mean',"WritingScore":'mean'})
gb
Out[49]:
| MathScore | ReadingScore | WritingScore | |
|---|---|---|---|
| ParentEduc | |||
| associate's degree | 68.365586 | 71.124324 | 70.299099 |
| bachelor's degree | 70.466627 | 73.062020 | 73.331069 |
| high school | 64.435731 | 67.213997 | 65.421136 |
| master's degree | 72.336134 | 75.832921 | 76.356896 |
| some college | 66.390472 | 69.179708 | 68.501432 |
| some high school | 62.584013 | 65.510785 | 63.632409 |
In [50]:
print(gb)
MathScore ReadingScore WritingScore ParentEduc associate's degree 68.365586 71.124324 70.299099 bachelor's degree 70.466627 73.062020 73.331069 high school 64.435731 67.213997 65.421136 master's degree 72.336134 75.832921 76.356896 some college 66.390472 69.179708 68.501432 some high school 62.584013 65.510785 63.632409
show the Heatmap of MathScore , ReadingScore, WritingScore based on parents education¶
In [51]:
plt.figure(figsize=(8,6))
sns.heatmap(gb,annot= True)
plt.show()
In [52]:
plt.figure(figsize=(4,4))
sns.heatmap(gb,cmap="BuPu",annot= True)
plt.show()
show the mean value and heatmap of MathScore , ReadingScore, WritingScore based on parents marital status¶
In [53]:
gb1 = df.groupby("ParentMaritalStatus").agg({"MathScore":'mean',"ReadingScore":'mean',"WritingScore":'mean'})
In [54]:
gb1
Out[54]:
| MathScore | ReadingScore | WritingScore | |
|---|---|---|---|
| ParentMaritalStatus | |||
| divorced | 66.691197 | 69.655011 | 68.799146 |
| married | 66.657326 | 69.389575 | 68.420981 |
| single | 66.165704 | 69.157250 | 68.174440 |
| widowed | 67.368866 | 69.651438 | 68.563452 |
In [55]:
plt.figure(figsize=(4,4))
sns.heatmap(gb1,cmap="viridis",annot= True)
plt.show()
In [77]:
# show the mean value and heatmap of MathScore , ReadingScore, WritingScore based on No of Siblings
In [78]:
gb2 = df.groupby("NrSiblings").agg({"MathScore":'mean',"ReadingScore":'mean',"WritingScore":'mean'})
In [79]:
gb2
Out[79]:
| MathScore | ReadingScore | WritingScore | |
|---|---|---|---|
| NrSiblings | |||
| 0.0 | 66.819449 | 69.547812 | 68.746515 |
| 1.0 | 66.473896 | 69.259097 | 68.245345 |
| 2.0 | 66.554934 | 69.472018 | 68.522533 |
| 3.0 | 66.719092 | 69.488159 | 68.650498 |
| 4.0 | 66.245495 | 69.144169 | 68.073444 |
| 5.0 | 66.630303 | 69.453788 | 68.282576 |
| 6.0 | 65.917219 | 68.801325 | 67.860927 |
| 7.0 | 67.615120 | 69.828179 | 68.986254 |
In [80]:
plt.figure(figsize=(4,4))
sns.heatmap(gb2,cmap="viridis",annot= True)
plt.show()
In [81]:
df.head(2)
Out[81]:
| Gender | EthnicGroup | ParentEduc | LunchType | TestPrep | ParentMaritalStatus | PracticeSport | IsFirstChild | NrSiblings | TransportMeans | WklyStudyHours | MathScore | ReadingScore | WritingScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | female | NaN | bachelor's degree | standard | none | married | regularly | yes | 3.0 | school_bus | < 5 | 71 | 71 | 74 |
| 1 | female | group C | some college | standard | NaN | married | sometimes | yes | 0.0 | NaN | > 5 | 69 | 90 | 88 |
show the mean value and heatmap of MathScore , ReadingScore, WritingScore based on Is First chield or not & Weekly Study Hours¶
In [87]:
gb3 = df.groupby("IsFirstChild").agg({"MathScore":'mean',"ReadingScore":'mean',"WritingScore":'mean'})
In [88]:
gb4 = df.groupby("WklyStudyHours").agg({"MathScore":'mean',"ReadingScore":'mean',"WritingScore":'mean'})
gb4
Out[88]:
| MathScore | ReadingScore | WritingScore | |
|---|---|---|---|
| WklyStudyHours | |||
| < 5 | 64.580359 | 68.176135 | 67.090192 |
| > 10 | 68.696655 | 70.365436 | 69.777778 |
| > 5 | 66.870491 | 69.660532 | 68.636280 |
In [89]:
gb3
Out[89]:
| MathScore | ReadingScore | WritingScore | |
|---|---|---|---|
| IsFirstChild | |||
| no | 66.246832 | 69.132614 | 68.210887 |
| yes | 66.740646 | 69.542553 | 68.558484 |
In [90]:
sns.heatmap(gb3,cmap="viridis",annot=True)
# plt.figure(figsize=(4,4))
# sns.heatmap(gb2,cmap="viridis",annot= True)
# plt.show()
Out[90]:
<Axes: ylabel='IsFirstChild'>
In [64]:
sns.heatmap(gb4,cmap="viridis",annot=True)
# plt.figure(figsize=(4,4))
# sns.heatmap(gb2,cmap="viridis",annot= True)
# plt.show()
Out[64]:
<Axes: ylabel='WklyStudyHours'>
Boxplot Based on weekly study hours¶
In [91]:
sns.boxplot(data = df, x = "WklyStudyHours")
plt.show()
Boxplot Based on Writing Score¶
In [92]:
sns.boxplot(data = df, x = "WritingScore")
plt.show()
Boxplot Based on Math Score¶
In [93]:
sns.boxplot(data = df, x = "MathScore")
plt.show()
Boxplot Based on Reading Score¶
In [68]:
sns.boxplot(data = df, x = "ReadingScore")
plt.show()
In [69]:
print(df["EthnicGroup"].unique())
[nan 'group C' 'group B' 'group A' 'group D' 'group E']
count every column value where EthnicGroup == group A¶
In [70]:
groupA = df.loc[(df["EthnicGroup"] == "group A")].count()
print(groupA)
Gender 2219 EthnicGroup 2219 ParentEduc 2078 LunchType 2219 TestPrep 2081 ParentMaritalStatus 2121 PracticeSport 2167 IsFirstChild 2168 NrSiblings 2096 TransportMeans 1999 WklyStudyHours 2146 MathScore 2219 ReadingScore 2219 WritingScore 2219 dtype: int64
creating PI chart and showing the value where EthnicGroup == group A, group B, group C, group D, group E and showing its persentages also ( Using only Integer Values)¶
In [100]:
groupA = df.loc[(df["EthnicGroup"] == "group A")].count()
groupB = df.loc[(df["EthnicGroup"] == "group B")].count()
groupC = df.loc[(df["EthnicGroup"] == "group C")].count()
groupD = df.loc[(df["EthnicGroup"] == "group D")].count()
groupE = df.loc[(df["EthnicGroup"] == "group E")].count()
mlist = [groupA["EthnicGroup"],groupB["EthnicGroup"] ,groupC["EthnicGroup"],groupD["EthnicGroup"] ,groupE["EthnicGroup"]]
l=['groupA','groupB','groupC','groupD','groupE']
# plt.pie(mlist, labels=l, autopct = "%1.2f%%")
print(mlist)
plt.pie(mlist, labels=l, autopct = "%1i%%")
plt.show()
[np.int64(2219), np.int64(5826), np.int64(9212), np.int64(7503), np.int64(4041)]
creating PI chart and showing the value where EthnicGroup == group A, group B, group C, group D, group E and showing its persentages also ( Using Float Values here)¶
In [102]:
plt.pie(mlist, labels=l, autopct = "%1.2f%%")
plt.title("Distribution of Ethnic Group \n ")
plt.show()
creating Bar Plot and showing the value where EthnicGroup == group A, group B, group C, group D, group E and showing its persentages also¶
In [105]:
l = sns.countplot(data = df ,x='EthnicGroup')
l.bar_label(l.containers[0])
Out[105]:
[Text(0, 0, '9212'), Text(0, 0, '5826'), Text(0, 0, '2219'), Text(0, 0, '7503'), Text(0, 0, '4041')]
0 Comments