BREAST CANCER DETECTION USING MACHINE LEARNING
In this blog, we are going to learn the following things
- In this project, we will learn how to detect whether women have breast cancer or not by using machine learning
- Uploading our data using ipywidgets
- Data understanding and visualization
- Plot various kinds of plots using seaborn, matplotlib , heatmap, and much more
- Different machine learning techniques like RandomForestClassifier, K neighbors Classifier, SVC
First thing we need is data on which this machine learning can be done you download it from kaggle
Various parameters is given in our dataset and their meaning
Attribute Information:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
Since we are not doctor all you need to know is malignant means women have cancer and benign means she does not have
First thing we have to do is import important modules that we are going to use [we are going to call machine learning modules later onwards]
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
Upload Our data into Jupyter Notebook
import ipywidgets as widgets
widgets.IntSlider()
from IPython.display import display
w = widgets.IntSlider()
uploader = widgets.FileUpload(
accept='*.csv', # Accepted file extension e.g. '.txt', '.pdf', 'image/*', 'image/*,.pdf'
multiple=False # True to accept multiple files upload else False
)
display(uploader)
we will see an uploading button once we run this code
import io
import pandas as pd
input_file = list(uploader.value.values())[0]
content = input_file['content']
content = io.StringIO(content.decode('utf-8'))
df = pd.read_csv(content)
df.head()
The above Line will show our data in Jupyter notebook like all the paramter and value it contains
Note : We have store our data into a variable called df but you can change the variable name as per your requirement
After that we will type df.describe it will show how many value contain each parameter ,mean,median, standard deviation , 25 % ,75%, 100% etc . It tells us how cluster our data is
df.shape #it tells how many rows and columns are present in our data in our case it is 569,33
df.info() #this tells our data types ,no. of non null counts , space occupy by our data , no.of rows and colums
df.isnull().sum() # this line will sum up all the null values in our data that present Note: null value in this case are useless and we need to remove it
After running last command you observe that Unnamed: 32 contains null value so we will drop this column
df =df.drop(columns='Unnamed: 32') # this will drop this column
Data correlation (In simple how does one parameter effect the other parameter) . We will use heatmap to color visualise
corr =df.corr()
plt.figure(figsize=(20,10))
sns.heatmap(corr, annot=True)
# one hot encoding
df = pd.get_dummies(data=df, drop_first=True) # this will convert it into 1 and 0 form which can be easily use for machine learning algorithm
JUST IN IF YOU WANT TO PLOT MORE GRAPH AND VISUALISE JUST WRITE THIS SIMPLE COMMAND IT WILL DO ALL YOUR WORK AND ALSO CREATE REPORT FOR YOU [MY FAVOURITE COMMAND]
from pandas_profiling import ProfileReport
df.profile_report()
We can also do pairpot in our jupyter notebook by using this simple command
sns.pairplot(df)
We have done enough Visualisation now its show time to do our machine learning
This is a supervised machine learning classification project so that you need to keep in mind
First we have to create our data which contains our data
x = df.iloc[:,1:-1].values # this will all rows and columns except diagonsis one
y = df.iloc[:,-1].values # this will contain all rows and column of Diagnosis
Splitting dataset into training data and testing data . Notice the size define is 0.2 it means 20% of our all data will use as testing data , random_State =0 whenever we run our data we will get same results .x_trian (contains data that is going to use train our machine learning [variable]) ,y_train(contains data that is going to use train our machine learning [Results of x_train variable]) , x_test (contain variable of our dataset)
y_test (will get this value after training our data we will use this to test how accurate our model is)
y_test (will get this value after training our data we will use this to test how accurate our model is)
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
First Machine Learning Model we are going to use RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)
Here what we have done is we have store our randomForestClasifier into a variable called rfc and rfc.fit(x_train ,y_train) basically here all our magic of machine learning is done here.
After training we have to see how good our model is
y_pred=rfc.predict(x_test)
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,mean_squared_error
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print("Training Score: ",rfc.score(x_train,y_train)*100)
precision recall f1-score support
0 0.98 0.97 0.98 67
1 0.96 0.98 0.97 47
accuracy 0.97 114
macro avg 0.97 0.97 0.97 114
weighted avg 0.97 0.97 0.97 114
[[65 2]
[ 1 46]]
Training Score: 100.0
We can see different parameter like accuracy , precision ,confusion matrix ,accuracy of our model
Machine Learning With KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=7)
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,mean_squared_error,r2_score
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print("Training Score: ",knn.score(x_train,y_train)*100)
print(knn.score(x_test,y_test))
precision recall f1-score support
0 0.96 0.96 0.96 67
1 0.94 0.94 0.94 47
accuracy 0.95 114
macro avg 0.95 0.95 0.95 114
weighted avg 0.95 0.95 0.95 114
[[64 3]
[ 3 44]]
Training Score: 93.4065934065934
0.9473684210526315
0 Comments
if you are not getting it then ask i am glad to help