[Kaggle] What's cooking ?
Predicting the category of a dish's cuisine given a list of its ingredients.
What’s cooking? - Kaggle Competition
In this project I aim to predict the category of a dish's cuisine given a list of its ingredients.
From the official desc. -
‘’If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. India’s market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see. Some of our strongest geographic and cultural associations are tied to a region's local foods.’’
The public dataset is from the Kaggle competition, What’s for Dinner?
The data is provided in JSON
format. Each example in the dataset contains the recipe identification, type of cuisine and a list of
ingredients. There are an average of 11 ingredients per recipe. The data consists of 39,774 unique recipes from 20 countries with 428,275 ingredients
(6,714 unique).
Importing the required packages.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt
data = pd.read_json('data/train.json', orient='records')
Number of recipes in the dataset.
print("Number of recipes:", len(data))
Let's see the distribution of data in the file... For thi I'll see the first 10 rows.
data[:10]
Let's have a look at Matar Paneer (My favourite dish)
matar_paneer = (data['id'] == 29172)
data[matar_paneer][['ingredients']]
cuisine = data['cuisine'].value_counts().index
data['cuisine'].value_counts()
#Kagglers love Italian food i guess...lol
bargraph = data['cuisine'].value_counts().plot(kind='bar',title="Cuisine Distribution by Countries")
Let's collect all the ingredients in a list and print few examples.
ingredientlist = []
for ingre in data['ingredients']:
ingredientlist.extend(ingre)
print("Total Inrgedients used : ", len(ingredientlist))
print("Unique Ingredients : ", len(set(ingredientlist)),'\n')
for i in range(20):
print(ingredientlist[i])
Most used ingredients :
ingredients_series = pd.Series(ingredientlist)
ax1 = ingredients_series.value_counts().head(15).plot(kind='bar', \
title='15 Most Used Ingredients')
ax1.set_ylabel(" Number of Recipes")
Now, I'm going to see how many cuisines use paneer and how many use tomatoes (random thought lol)
ingredients2 = data['ingredients'].map(";".join)
indices = ingredients2.str.contains('paneer')
data[indices]['cuisine'].value_counts().plot(kind='bar',
title='paneer as found per cuisine')
ingredients3 = data['ingredients'].map(";".join)
indices = ingredients2.str.contains('tomatoes')
data[indices]['cuisine'].value_counts().plot(kind='bar',
title='tomatoes as found per cuisine')
Data cleaning
In the data, some of the ingredients have words describing some attributes of the underlying ingredient.
For example, "chopped onions" vs "onions" and "canned coconut milk" vs "coconut milk". The words chopped and canned don't add any value to learning ; so we can remove the word "chopped" and "cheese" from these ingredients.
Another case "garlic clove" and "garlic". In this case, we can't just remove "clove" since it is also a name of an ingredient. We deal with this case as mapping "garlic clove" to "garlic".
Additionally, I'll convert the plurals into singlulars (eggs -> egg)
Finally, we replace one than one spaces to a single space and remove space in the start and end.
import re
def clean_ingredients(ingredientlist):
words_to_remove = ["lowfat", "light", "shredded", "sliced", "all purpose", "all natural", "natural", "original",
"gourmet", "traditional", "boneless", "skinless", "fresh", "nonfat", "pitted", "quick cooking",
"unbleached", "part skim", "skim", "quickcooking", "oven ready", "homemade", "instant", "small",
"extra large", "large", "chopped", "grated", "cooked", "stone ground", "freshly ground",
"ground", "pure", "peeled", "deveined", "organic", "cracked", "granulated", "inch thick",
"extra firm", "crushed", "flakes", "self rising", "diced", "crumbles", "crumbled",
"whole wheat", "whole grain", "baby", "medium", "plain", "of", "thick cut", "cubed", "coarse",
"free range", "seasoned", "canned", "multipurpose", "vegan", "thawed", "squeezed",
"vegetarian", "fine", "zesty", "halves", "firmly packed", "drain", "drained","canned", "washed","smoked"]
map_plural_to_singular = [("steaks", "steak"), ("loins", "loin"), ("inches", "inch"), ("centimeters", "centimeter"),
("ounces", "ounce"), ("liters", "liter"), ("mililiters", "mililiter"), ("grams", "gram"),
("cups", "cup"), ("gallons", "gallon"), ("quarts", "quart"), ("lbs", "lb"),
("pounds", "pound"), ("tablespoons", "tablespoon"), ("teaspoons", "teaspoon"),
("pints", "pint"), ("fluid ounces", "fluid ounce"), ("onions", "onion"),
("cloves", "clove"), ("bulbs", "bulb"), ("peppers", "pepper"), ("breasts", "breast"),
("eggs", "egg"), ("carrots", "carrot"), ("mushrooms", "mushroom"),
("tortillas", "tortilla"), ("sausages", "sausage"), ("wedges", "wedge"),
("tomatoes", "tomato"), ("thighs", "thigh"), ("chilies", "chili"), ("potatoes", "potato"),
("peppercorns", "peppercorn"), ("spices", "spice"), ("chiles", "chile"), ("apples", "apple"),
("legs", "leg"), ("doughs", "dough"), ("drumsticks", "drumstick")]
phrases_to_map = [
(("green onion", "red onion", "purple onion", "yellow onion", "yel onion"), "onion"),
(("collard green leaves", "collards", "collard leaves"), "collard greens"),
("black pepper", "pepper"),
("yel chives", "chives"),
("spinach leaves", "spinach"),
("tea leaves", "tea"),
('Indian spice', 'garam masala'),
('catfish fillets','catfish'),
("chile", "chili"),
(("garlic clove", "garlic bulb"), "garlic"),
("uncooked", "raw"),
('large eggs', 'eggs'),
(("red chili pepper", "hot chili pepper", "red hot chili pepper"), "chili pepper"),
(("baking potato", "baked potato"), "baked potato"),
(("sea salt", "kosher salt", "table salt", "white salt"), "salt"),
("scotch whiskey", "scotch"),
(("i cant believe its not butter spread", "i cant believe its not butter"), "butter"),
(("extra virgin olive oil", "virgin olive oil"), "olive oil"),
(("white bread", "wheat bread", "grain bread"), "bread"),
(("white sugar", "yel sugar"), "sugar"),
("confectioners sugar", "powdered sugar")
]
for i in range(len(ingredientlist)):
for word in words_to_remove:
ingredientlist[i] = re.sub(r"\b{}\b".format(word), "", ingredientlist[i])
for plural, singular in map_plural_to_singular:
ingredientlist[i] = re.sub(r"\b{}\b".format(plural), singular, ingredientlist[i])
for pattern, replacement in phrases_to_map:
if type(pattern) is tuple:
for val in pattern:
ingredientlist[i] = re.sub(r"\b{}\b".format(val), replacement, ingredientlist[i])
elif type(pattern) is str:
ingredientlist[i] = re.sub(r"\b{}\b".format(pattern), replacement, ingredientlist[i])
ingredientlist[i] = re.compile(r" +").sub(" ", ingredientlist[i])
ingredientlist[i] = ingredientlist[i].strip()
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
ingredients = [' '.join(ingredients).lower() for ingredients in data['ingredients']]
cuisines = [cusine for cusine in data['cuisine']]
clean_ingredients(ingredients)
tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()
X = tfidf_enc.fit_transform(ingredients)
X = X.astype('float16')
Y = lbl_enc.fit_transform(cuisines)
x_train, x_test, y_train, y_test = train_test_split(X, Y,
test_size=0.05,
random_state = 8888)
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print("x_test", x_test.shape)
print("y_test", y_test.shape)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from matplotlib import pyplot
def cm(y_test, y_pred, cuisines):
pyplot.figure(figsize=(10, 10))
cm = confusion_matrix(y_test, y_pred)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
pyplot.imshow(cm_normalized, interpolation='nearest')
pyplot.title("confusion matrix")
pyplot.colorbar(shrink=0.2)
tick_marks = np.arange(len(cuisines))
pyplot.xticks(tick_marks, cuisines, rotation=90)
pyplot.yticks(tick_marks, cuisines)
pyplot.tight_layout()
pyplot.ylabel('True label')
pyplot.xlabel('Predicted label')
print(classification_report(y_test, y_pred, target_names = cuisines))
Vanilla Neural Network - Case 1
Multi-layer Feedforward Neural Networks provide a natural extension to the multiclass problem. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.
'''from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
mlp_clf = MLPClassifier(solver='lbfgs', alpha=1e-3,
hidden_layer_sizes=(128, 64), random_state=1)
mlp_clf.fit(x_train, y_train)
y_pred1 = mlp_clf.predict(x_train)
y_pred2 = mlp_clf.predict(x_test)
print("Training accuracy:", accuracy_score(y_train, y_pred1))
print("Testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)
'''
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
vnn = MLPClassifier(solver='lbfgs', alpha=1e-3, hidden_layer_sizes=(128,64), random_state=1)
vnn.fit(x_train, y_train)
predic1 = vnn.predict(x_train)
predic2 = vnn.predict(x_test)
print("Training Accuracy : ", accuracy_score(y_train,predic1))
print("Test Accuracy : ", accuracy_score(y_test, predic2))
#Confusion (Evaluation) Matrix
cm(y_test, predic2, cuisine)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
y_train = y_train.reshape(y_train.size)
purpose = 'run' #optimize or run(get best results)
if purpose == 'optimize':
lr = LogisticRegression(C=10)
lr.fit(x_train, y_train)
#cross-validation
scores = cross_val_score(LogisticRegression(C=10), x_train ,y_train, cv=5)
print("training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
y_pred1 = lr.predict(x_test)
print("testing accuracy before Grid Search (knowing the apt hyper-parameters:", accuracy_score(y_test, y_pred1))
#grid search
parameters = {'C':[0.1,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10]}
lr_clf = GridSearchCV(lr, parameters)
lr_clf.fit(x_train, y_train)
# prediction
y_pred2 = lr_clf.predict(x_test)
print("testing accuracy after Grid Seach:", accuracy_score(y_test, y_pred2))
# best parameter in my run :- C=4.5 - test accuracy 80.1408%
print(lr_clf.best_estimator_)
else:
print("--before optimization--")
lr = LogisticRegression()
lr.fit(x_train, y_train)
y_pred1 = lr.predict(x_test)
print("testing accuracy:", accuracy_score(y_test, y_pred1))
print("--after optimization--")
lr_clf = LogisticRegression(C=4.5)
lr_clf.fit(x_train, y_train)
scores = cross_val_score(lr_clf, x_train ,y_train, cv=5)
print("CV training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
y_pred2 = lr_clf.predict(x_test)
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
rf_clf = DecisionTreeClassifier()
rf_clf.fit(x_train, y_train)
y_pred1 = rf_clf.predict(x_train)
y_pred2 = rf_clf.predict(x_test)
print("training accuracy:", accuracy_score(y_train, y_pred1))
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
y_train = y_train.reshape(y_train.size)
rf_clf = RandomForestClassifier(n_estimators=1000, criterion = 'entropy',
max_depth=20)
rf_clf.fit(x_train, y_train)
y_pred1 = rf_clf.predict(x_train)
y_pred2 = rf_clf.predict(x_test)
print("training accuracy:", accuracy_score(y_train, y_pred1))
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test,y_pred2,cuisine)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
y_train = y_train.reshape(y_train.size)
purpose = 'run' #optimize or run(get best results)
if purpose == 'optimize':
#****1.Linear SVM#****
lsvm = svm.LinearSVC(C=7)
lsvm.fit(x_train, y_train)
# cv
scores = cross_val_score(svm.LinearSVC(C=7), x_train ,y_train, cv=5)
print("training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
y_pred1 = lsvm.predict(x_test)
print("testing accuracy 1 before Grid Search:", accuracy_score(y_test, y_pred1))
# grid search
parameters = {'C':[0.1,0.5,1,2,3,4,5,6,7,8,9,10]}
clf_svm1 = GridSearchCV(lsvm, parameters)
clf_svm1.fit(x_train, y_train)
# predict cuisines
y_pred2 = clf_svm1.predict(x_test)
print("testing accuracy 1 after Grid Seach:", accuracy_score(y_test, y_pred2))
#clf.get_params()
#****2.SVM#****
# grid search
param_grid = [{'C': [0.1, 1, 10], 'kernel': ['linear']},{'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['rbf']},]
svm_kern = svm.SVC()
clf_svm2 = GridSearchCV(svm_kern, param_grid)
clf_svm2.fit(x_train, y_train)
svm_kern.fit(x_train, y_train)
# predict cuisines
y_pred3= clf_svm2.predict(x_test)
print("testing accuracy 2 after Grid Seach:", accuracy_score(y_test, y_pred3))
#clf2.get_params()
#****3.SVM extension#****
# extend parameter grid
param_grid2 = [{'C': [5, 10, 50, 100, 1000], 'gamma': [0.5, 1, 10, 100, 1000], 'kernel': ['rbf']}]
svm_kern_2 = svm.SVC()
clf_svm3 = GridSearchCV(svm_kern_2, param_grid2)
clf_svm3.fit(x_train, y_train)
# predict cuisines
y_pred4= clf_svm3.predict(x_test)
print("testing accuracy 3 after Grid Seach:", accuracy_score(y_test, y_pred4))
#clf2.get_params()
else:
print("*-before optimization-*")
lsvm = svm.LinearSVC()
lsvm.fit(x_train, y_train)
y_pred1 = lsvm.predict(x_test)
print("testing accuracy:", accuracy_score(y_test, y_pred1))
print("*-after optimization-*")
#best parameter C=10, gamma=1, kernel='rbf' - - test accuracy 82.1016%
svm_clf = svm.SVC(C = 10, kernel = 'rbf', gamma = 1)
svm_clf.fit(x_train, y_train)
scores = cross_val_score(svm_clf, x_train ,y_train, cv=5)
print("CV training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
y_pred2 = svm_clf.predict(x_test)
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)
FIN.