What’s cooking? - Kaggle Competition

Note: Documentation and GitHub repository

In this project I aim to predict the category of a dish's cuisine given a list of its ingredients.

From the official desc. -

‘’If you're in Northern California, you'll be walking past the inevitable bushels of leafy greens, spiked with dark purple kale and the bright pinks and yellows of chard. Across the world in South Korea, mounds of bright red kimchi greet you, while the smell of the sea draws your attention to squids squirming nearby. India’s market is perhaps the most colorful, awash in the rich hues and aromas of dozens of spices: turmeric, star anise, poppy seeds, and garam masala as far as the eye can see. Some of our strongest geographic and cultural associations are tied to a region's local foods.’’

The public dataset is from the Kaggle competition, What’s for Dinner?
The data is provided in JSON format. Each example in the dataset contains the recipe identification, type of cuisine and a list of ingredients. There are an average of 11 ingredients per recipe. The data consists of 39,774 unique recipes from 20 countries with 428,275 ingredients (6,714 unique).

Importing the required packages.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt

Exploratory Data Analysis

data = pd.read_json('data/train.json', orient='records')

Number of recipes in the dataset.

print("Number of recipes:", len(data))
Number of recipes: 39774

Let's see the distribution of data in the file... For thi I'll see the first 10 rows.

data[:10]
cuisine id ingredients
0 greek 10259 [romaine lettuce, black olives, grape tomatoes...
1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...
2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...
3 indian 22213 [water, vegetable oil, wheat, salt]
4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...
5 jamaican 6602 [plain flour, sugar, butter, eggs, fresh ginge...
6 spanish 42779 [olive oil, salt, medium shrimp, pepper, garli...
7 italian 3735 [sugar, pistachio nuts, white almond bark, flo...
8 mexican 16903 [olive oil, purple onion, fresh pineapple, por...
9 italian 12734 [chopped tomatoes, fresh basil, garlic, extra-...

Let's have a look at Matar Paneer (My favourite dish)

matar_paneer = (data['id'] == 29172)
data[matar_paneer][['ingredients']]
ingredients
1517 [red chili powder, coriander powder, salt, oil...

Cuisine Analysis

cuisine = data['cuisine'].value_counts().index 
data['cuisine'].value_counts()
#Kagglers love Italian food  i guess...lol
italian         7838
mexican         6438
southern_us     4320
indian          3003
chinese         2673
french          2646
cajun_creole    1546
thai            1539
japanese        1423
greek           1175
spanish          989
korean           830
vietnamese       825
moroccan         821
british          804
filipino         755
irish            667
jamaican         526
russian          489
brazilian        467
Name: cuisine, dtype: int64
bargraph = data['cuisine'].value_counts().plot(kind='bar',title="Cuisine Distribution by Countries")

Ingredient Analysis

Let's collect all the ingredients in a list and print few examples.

ingredientlist = []
for ingre in data['ingredients']:
    ingredientlist.extend(ingre)
print("Total Inrgedients used : ", len(ingredientlist))
print("Unique Ingredients : ", len(set(ingredientlist)),'\n')

for i in range(20):
    print(ingredientlist[i])
Total Inrgedients used :  428275
Unique Ingredients :  6714 

romaine lettuce
black olives
grape tomatoes
garlic
pepper
purple onion
seasoning
garbanzo beans
feta cheese crumbles
plain flour
ground pepper
salt
tomatoes
ground black pepper
thyme
eggs
green tomatoes
yellow corn meal
milk
vegetable oil

Most used ingredients :

ingredients_series = pd.Series(ingredientlist)
ax1 = ingredients_series.value_counts().head(15).plot(kind='bar', \
                                               title='15 Most Used Ingredients')

ax1.set_ylabel(" Number of Recipes")
Text(0, 0.5, ' Number of Recipes')

Now, I'm going to see how many cuisines use paneer and how many use tomatoes (random thought lol)

ingredients2 = data['ingredients'].map(";".join)
indices = ingredients2.str.contains('paneer')
data[indices]['cuisine'].value_counts().plot(kind='bar',
                                   title='paneer as found per cuisine')
<matplotlib.axes._subplots.AxesSubplot at 0x1b422bed358>
ingredients3 = data['ingredients'].map(";".join)
indices = ingredients2.str.contains('tomatoes')
data[indices]['cuisine'].value_counts().plot(kind='bar',
                                   title='tomatoes as found per cuisine')
<matplotlib.axes._subplots.AxesSubplot at 0x1b422c6db38>

Data cleaning

In the data, some of the ingredients have words describing some attributes of the underlying ingredient.

For example, "chopped onions" vs "onions" and "canned coconut milk" vs "coconut milk". The words chopped and canned don't add any value to learning ; so we can remove the word "chopped" and "cheese" from these ingredients.

Another case "garlic clove" and "garlic". In this case, we can't just remove "clove" since it is also a name of an ingredient. We deal with this case as mapping "garlic clove" to "garlic".

Additionally, I'll convert the plurals into singlulars (eggs -> egg)

Finally, we replace one than one spaces to a single space and remove space in the start and end.

import re
def clean_ingredients(ingredientlist):
  
  words_to_remove = ["lowfat", "light", "shredded", "sliced", "all purpose", "all natural", "natural", "original", 
                      "gourmet", "traditional", "boneless", "skinless", "fresh", "nonfat", "pitted", "quick cooking", 
                      "unbleached", "part skim", "skim", "quickcooking", "oven ready", "homemade", "instant", "small", 
                      "extra large", "large", "chopped", "grated", "cooked", "stone ground", "freshly ground", 
                      "ground", "pure", "peeled", "deveined", "organic", "cracked", "granulated", "inch thick", 
                      "extra firm", "crushed", "flakes", "self rising", "diced", "crumbles", "crumbled", 
                      "whole wheat", "whole grain", "baby", "medium", "plain", "of", "thick cut", "cubed", "coarse", 
                      "free range", "seasoned", "canned", "multipurpose", "vegan", "thawed", "squeezed", 
                      "vegetarian", "fine", "zesty", "halves", "firmly packed", "drain", "drained","canned", "washed","smoked"]
  
  map_plural_to_singular = [("steaks", "steak"), ("loins", "loin"), ("inches", "inch"), ("centimeters", "centimeter"),
                          ("ounces", "ounce"), ("liters", "liter"), ("mililiters", "mililiter"), ("grams", "gram"),
                          ("cups", "cup"), ("gallons", "gallon"), ("quarts", "quart"), ("lbs", "lb"),
                          ("pounds", "pound"), ("tablespoons", "tablespoon"), ("teaspoons", "teaspoon"), 
                          ("pints", "pint"), ("fluid ounces", "fluid ounce"), ("onions", "onion"), 
                          ("cloves", "clove"), ("bulbs", "bulb"), ("peppers", "pepper"), ("breasts", "breast"),
                          ("eggs", "egg"), ("carrots", "carrot"), ("mushrooms", "mushroom"),
                          ("tortillas", "tortilla"), ("sausages", "sausage"), ("wedges", "wedge"), 
                          ("tomatoes", "tomato"), ("thighs", "thigh"), ("chilies", "chili"), ("potatoes", "potato"), 
                          ("peppercorns", "peppercorn"), ("spices", "spice"), ("chiles", "chile"), ("apples", "apple"),
                          ("legs", "leg"), ("doughs", "dough"), ("drumsticks", "drumstick")]
  
  phrases_to_map = [
    (("green onion", "red onion", "purple onion", "yellow onion", "yel onion"), "onion"),
    (("collard green leaves", "collards", "collard leaves"), "collard greens"),
    ("black pepper", "pepper"),
    ("yel chives", "chives"),
    ("spinach leaves", "spinach"),
    ("tea leaves", "tea"),
    ('Indian spice', 'garam masala'),
    ('catfish fillets','catfish'),
    ("chile", "chili"),
    (("garlic clove", "garlic bulb"), "garlic"),
    ("uncooked", "raw"),
    ('large eggs', 'eggs'),
    (("red chili pepper", "hot chili pepper", "red hot chili pepper"), "chili pepper"),
    (("baking potato", "baked potato"), "baked potato"),
    (("sea salt", "kosher salt", "table salt", "white salt"), "salt"),
    ("scotch whiskey", "scotch"),
    (("i cant believe its not butter spread", "i cant believe its not butter"), "butter"),
    (("extra virgin olive oil", "virgin olive oil"), "olive oil"),
    (("white bread", "wheat bread", "grain bread"), "bread"),
    (("white sugar", "yel sugar"), "sugar"),
    ("confectioners sugar", "powdered sugar")
    ]
  
  for i in range(len(ingredientlist)):
    for word in words_to_remove:
        ingredientlist[i] = re.sub(r"\b{}\b".format(word), "", ingredientlist[i])
    for plural, singular in map_plural_to_singular:
        ingredientlist[i] = re.sub(r"\b{}\b".format(plural), singular, ingredientlist[i])
    for pattern, replacement in phrases_to_map:
        if type(pattern) is tuple:
            for val in pattern:
                ingredientlist[i] = re.sub(r"\b{}\b".format(val), replacement, ingredientlist[i])
        elif type(pattern) is str:
            ingredientlist[i] = re.sub(r"\b{}\b".format(pattern), replacement, ingredientlist[i])
    ingredientlist[i] = re.compile(r" +").sub(" ", ingredientlist[i])
    ingredientlist[i] = ingredientlist[i].strip()

Design Matrix and splitting the data

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

ingredients = [' '.join(ingredients).lower() for ingredients in data['ingredients']]
cuisines = [cusine for cusine in data['cuisine']]
clean_ingredients(ingredients)

tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()

X = tfidf_enc.fit_transform(ingredients)
X = X.astype('float16')

Y = lbl_enc.fit_transform(cuisines)

x_train, x_test, y_train, y_test = train_test_split(X, Y,
                                                    test_size=0.05,
                                                    random_state = 8888)

print("x_train", x_train.shape)
print("y_train", y_train.shape)
print("x_test", x_test.shape)
print("y_test", y_test.shape)
x_train (37785, 2970)
y_train (37785,)
x_test (1989, 2970)
y_test (1989,)

Evaluation Matrix

will be utilised to interpret the accuracy in different cases ahead..

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from matplotlib import pyplot

def cm(y_test, y_pred, cuisines):         
    pyplot.figure(figsize=(10, 10))
    cm = confusion_matrix(y_test, y_pred)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    pyplot.imshow(cm_normalized, interpolation='nearest')
    pyplot.title("confusion matrix")
    pyplot.colorbar(shrink=0.2)
    tick_marks = np.arange(len(cuisines))
    pyplot.xticks(tick_marks, cuisines, rotation=90)
    pyplot.yticks(tick_marks, cuisines)
    pyplot.tight_layout()
    pyplot.ylabel('True label')
    pyplot.xlabel('Predicted label')
    
    print(classification_report(y_test, y_pred, target_names = cuisines))

Vanilla Neural Network - Case 1

Multi-layer Feedforward Neural Networks provide a natural extension to the multiclass problem. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable.

'''from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

mlp_clf = MLPClassifier(solver='lbfgs', alpha=1e-3,
                     hidden_layer_sizes=(128, 64), random_state=1)

mlp_clf.fit(x_train, y_train)                         

y_pred1 = mlp_clf.predict(x_train)
y_pred2 = mlp_clf.predict(x_test)

print("Training accuracy:", accuracy_score(y_train, y_pred1))
print("Testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)

'''
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

vnn = MLPClassifier(solver='lbfgs', alpha=1e-3, hidden_layer_sizes=(128,64), random_state=1)
vnn.fit(x_train, y_train)
predic1 = vnn.predict(x_train)
predic2 = vnn.predict(x_test)
print("Training Accuracy : ", accuracy_score(y_train,predic1))
print("Test Accuracy : ", accuracy_score(y_test, predic2))
Training Accuracy :  0.8503903665475718
Test Accuracy :  0.7943690296631473
#Confusion (Evaluation) Matrix
cm(y_test, predic2, cuisine)
              precision    recall  f1-score   support

     italian       0.67      0.67      0.67        18
     mexican       0.67      0.41      0.51        39
 southern_us       0.77      0.62      0.69        74
      indian       0.82      0.86      0.84       145
     chinese       0.68      0.68      0.68        31
      french       0.59      0.66      0.62       134
cajun_creole       0.92      0.73      0.81        63
        thai       0.86      0.92      0.89       144
    japanese       0.81      0.58      0.68        38
       greek       0.84      0.88      0.86       404
     spanish       0.77      0.71      0.74        14
      korean       0.71      0.72      0.72        68
  vietnamese       0.79      0.75      0.77        40
    moroccan       0.94      0.91      0.92       345
     british       0.87      0.76      0.81        45
    filipino       0.41      0.41      0.41        22
       irish       0.73      0.81      0.77       208
    jamaican       0.63      0.61      0.62        51
     russian       0.77      0.69      0.73        70
   brazilian       0.55      0.64      0.59        36

   micro avg       0.79      0.79      0.79      1989
   macro avg       0.74      0.70      0.72      1989
weighted avg       0.80      0.79      0.79      1989

Logistic Regression - Case 2

binary classification algorithm using sigmoid function,

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

y_train = y_train.reshape(y_train.size)
purpose = 'run'    #optimize or run(get best results)

if purpose == 'optimize':
    lr = LogisticRegression(C=10)
    lr.fit(x_train, y_train)
  
    #cross-validation  
    scores = cross_val_score(LogisticRegression(C=10), x_train ,y_train, cv=5)
    print("training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    y_pred1 = lr.predict(x_test)
    print("testing accuracy before Grid Search (knowing the apt hyper-parameters:", accuracy_score(y_test, y_pred1))

    #grid search
    parameters = {'C':[0.1,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10]}
    lr_clf = GridSearchCV(lr, parameters)
    lr_clf.fit(x_train, y_train)

    # prediction
    y_pred2 = lr_clf.predict(x_test)
    print("testing accuracy after Grid Seach:", accuracy_score(y_test, y_pred2))
    # best parameter in my run :- C=4.5 - test accuracy 80.1408%
    print(lr_clf.best_estimator_)
else:
    print("--before optimization--")
    lr = LogisticRegression()
    lr.fit(x_train, y_train)
    y_pred1 = lr.predict(x_test)
    print("testing accuracy:", accuracy_score(y_test, y_pred1))
    
    print("--after optimization--")
    lr_clf = LogisticRegression(C=4.5)
    lr_clf.fit(x_train, y_train)
    scores = cross_val_score(lr_clf, x_train ,y_train, cv=5)
    print("CV training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    y_pred2 = lr_clf.predict(x_test)
    print("testing accuracy:", accuracy_score(y_test, y_pred2))
    
--before optimization--
C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
testing accuracy: 0.7943690296631473
--after optimization--
C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
CV training accuracy: 0.7882 (+/- 0.0108)
testing accuracy: 0.8009049773755657
cm(y_test, y_pred2, cuisine)
              precision    recall  f1-score   support

     italian       1.00      0.61      0.76        18
     mexican       0.60      0.38      0.47        39
 southern_us       0.77      0.66      0.71        74
      indian       0.84      0.88      0.86       145
     chinese       0.68      0.61      0.64        31
      french       0.63      0.65      0.64       134
cajun_creole       0.86      0.67      0.75        63
        thai       0.86      0.93      0.90       144
    japanese       0.79      0.50      0.61        38
       greek       0.81      0.91      0.86       404
     spanish       0.77      0.71      0.74        14
      korean       0.79      0.68      0.73        68
  vietnamese       0.86      0.78      0.82        40
    moroccan       0.91      0.92      0.92       345
     british       0.94      0.73      0.83        45
    filipino       0.64      0.32      0.42        22
       irish       0.70      0.84      0.76       208
    jamaican       0.76      0.69      0.72        51
     russian       0.78      0.77      0.78        70
   brazilian       0.59      0.44      0.51        36

   micro avg       0.80      0.80      0.80      1989
   macro avg       0.78      0.68      0.72      1989
weighted avg       0.80      0.80      0.80      1989

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score

rf_clf = DecisionTreeClassifier()
rf_clf.fit(x_train, y_train)
y_pred1 = rf_clf.predict(x_train)
y_pred2 = rf_clf.predict(x_test)

print("training accuracy:", accuracy_score(y_train, y_pred1))
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test, y_pred2, cuisine)
training accuracy: 0.9993648273124256
testing accuracy: 0.6535947712418301
              precision    recall  f1-score   support

     italian       0.39      0.39      0.39        18
     mexican       0.30      0.33      0.32        39
 southern_us       0.68      0.51      0.58        74
      indian       0.72      0.69      0.71       145
     chinese       0.41      0.45      0.43        31
      french       0.43      0.46      0.45       134
cajun_creole       0.62      0.67      0.64        63
        thai       0.82      0.77      0.79       144
    japanese       0.39      0.32      0.35        38
       greek       0.74      0.74      0.74       404
     spanish       0.37      0.50      0.42        14
      korean       0.57      0.54      0.56        68
  vietnamese       0.58      0.70      0.64        40
    moroccan       0.82      0.81      0.82       345
     british       0.56      0.53      0.55        45
    filipino       0.21      0.27      0.24        22
       irish       0.62      0.63      0.62       208
    jamaican       0.45      0.41      0.43        51
     russian       0.64      0.67      0.66        70
   brazilian       0.54      0.56      0.55        36

   micro avg       0.65      0.65      0.65      1989
   macro avg       0.54      0.55      0.54      1989
weighted avg       0.66      0.65      0.65      1989

Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

y_train = y_train.reshape(y_train.size)

rf_clf = RandomForestClassifier(n_estimators=1000, criterion = 'entropy',
                                max_depth=20)
rf_clf.fit(x_train, y_train)

y_pred1 = rf_clf.predict(x_train)
y_pred2 = rf_clf.predict(x_test) 

print("training accuracy:", accuracy_score(y_train, y_pred1))
print("testing accuracy:", accuracy_score(y_test, y_pred2))
cm(y_test,y_pred2,cuisine)
training accuracy: 0.7734286092364695
testing accuracy: 0.6601307189542484
              precision    recall  f1-score   support

     italian       0.00      0.00      0.00        18
     mexican       0.00      0.00      0.00        39
 southern_us       0.92      0.46      0.61        74
      indian       0.71      0.93      0.80       145
     chinese       1.00      0.26      0.41        31
      french       0.70      0.17      0.28       134
cajun_creole       1.00      0.25      0.41        63
        thai       0.84      0.89      0.86       144
    japanese       0.00      0.00      0.00        38
       greek       0.52      0.93      0.67       404
     spanish       1.00      0.43      0.60        14
      korean       0.89      0.47      0.62        68
  vietnamese       1.00      0.42      0.60        40
    moroccan       0.83      0.89      0.86       345
     british       0.94      0.38      0.54        45
    filipino       0.00      0.00      0.00        22
       irish       0.48      0.69      0.57       208
    jamaican       1.00      0.02      0.04        51
     russian       0.82      0.77      0.79        70
   brazilian       0.89      0.47      0.62        36

   micro avg       0.66      0.66      0.66      1989
   macro avg       0.68      0.42      0.46      1989
weighted avg       0.69      0.66      0.61      1989

C:\Users\shekh\AppData\Roaming\Python\Python37\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Support Vector Machine

maximize the minimum distance from the separating hyperplane to the nearest example. I'm trying to use ‘linearSVC + one-vs-rest(OVR) scheme’ and ‘SVC + one-vs-one scheme’ to solve this multi-class problem.

from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

y_train = y_train.reshape(y_train.size)
purpose = 'run'    #optimize or run(get best results)

if purpose == 'optimize':
    #****1.Linear SVM#****
    lsvm = svm.LinearSVC(C=7)
    lsvm.fit(x_train, y_train)

    # cv  
    scores = cross_val_score(svm.LinearSVC(C=7), x_train ,y_train, cv=5)
    print("training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    y_pred1 = lsvm.predict(x_test)
    print("testing accuracy 1 before Grid Search:", accuracy_score(y_test, y_pred1))

    # grid search
    parameters = {'C':[0.1,0.5,1,2,3,4,5,6,7,8,9,10]}
    clf_svm1 = GridSearchCV(lsvm, parameters)
    clf_svm1.fit(x_train, y_train)

    # predict cuisines
    y_pred2 = clf_svm1.predict(x_test)
    print("testing accuracy 1 after Grid Seach:", accuracy_score(y_test, y_pred2))
    #clf.get_params()

    #****2.SVM#****
    # grid search
    param_grid = [{'C': [0.1, 1, 10], 'kernel': ['linear']},{'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['rbf']},]
    svm_kern = svm.SVC()
    clf_svm2 = GridSearchCV(svm_kern, param_grid)
    clf_svm2.fit(x_train, y_train)
    svm_kern.fit(x_train, y_train)

    # predict cuisines
    y_pred3= clf_svm2.predict(x_test)
    print("testing accuracy 2 after Grid Seach:", accuracy_score(y_test, y_pred3))
    #clf2.get_params()

    #****3.SVM extension#****
    # extend parameter grid
    param_grid2 = [{'C': [5, 10, 50, 100, 1000], 'gamma': [0.5, 1, 10, 100, 1000], 'kernel': ['rbf']}]
    svm_kern_2 = svm.SVC()
    clf_svm3 = GridSearchCV(svm_kern_2, param_grid2)
    clf_svm3.fit(x_train, y_train)

    # predict cuisines
    y_pred4= clf_svm3.predict(x_test)
    print("testing accuracy 3 after Grid Seach:", accuracy_score(y_test, y_pred4))
    #clf2.get_params()
else:
    print("*-before optimization-*")
    lsvm = svm.LinearSVC()
    lsvm.fit(x_train, y_train)
    y_pred1 = lsvm.predict(x_test)
    print("testing accuracy:", accuracy_score(y_test, y_pred1))
    
    print("*-after optimization-*")
    #best parameter C=10, gamma=1, kernel='rbf' -  - test accuracy 82.1016%
    svm_clf = svm.SVC(C = 10, kernel = 'rbf', gamma = 1)
    svm_clf.fit(x_train, y_train)
    scores = cross_val_score(svm_clf, x_train ,y_train, cv=5)
    print("CV training accuracy: %0.4f (+/- %0.4f)" % (scores.mean(), scores.std() * 2))
    y_pred2 = svm_clf.predict(x_test)
    print("testing accuracy:", accuracy_score(y_test, y_pred2))
    cm(y_test, y_pred2, cuisine)
*-before optimization-*
testing accuracy: 0.7998994469582705
*-after optimization-*
CV training accuracy: 0.8044 (+/- 0.0085)
testing accuracy: 0.8205128205128205
              precision    recall  f1-score   support

     italian       0.92      0.61      0.73        18
     mexican       0.65      0.56      0.60        39
 southern_us       0.81      0.70      0.75        74
      indian       0.85      0.85      0.85       145
     chinese       0.81      0.71      0.76        31
      french       0.66      0.74      0.70       134
cajun_creole       0.83      0.78      0.80        63
        thai       0.88      0.94      0.91       144
    japanese       0.88      0.55      0.68        38
       greek       0.83      0.91      0.87       404
     spanish       0.71      0.71      0.71        14
      korean       0.80      0.72      0.76        68
  vietnamese       0.86      0.80      0.83        40
    moroccan       0.93      0.91      0.92       345
     british       0.89      0.73      0.80        45
    filipino       0.83      0.45      0.59        22
       irish       0.73      0.84      0.78       208
    jamaican       0.82      0.63      0.71        51
     russian       0.78      0.80      0.79        70
   brazilian       0.68      0.64      0.66        36

   micro avg       0.82      0.82      0.82      1989
   macro avg       0.81      0.73      0.76      1989
weighted avg       0.82      0.82      0.82      1989

FIN.