So have you ever faced a dilemma regarding do I have the required skills for what it takes to participate in Kaggle competitions? We are going to break it down here. Kaggle is much more than a platform for predictive modeling data science or machine learning competitions. So for those of you who aren’t familiar with what Kaggle is all about? Let’s dig deep.
What is Kaggle?
So it’s a global platform which hosts competitions, datasets, kernels, discussions regarding data science machine learning for people in different areas like developers, engineers, researchers as well as job board. It’s a platform for doing and sharing data science and a great place to learn the designs of machine learning from an amazing community. Earlier people proficient in fields like data science machine learning were the ones, actively using Kaggle but now Kaggle has done a lot in making it available for people who don’t have such strong background and making it available for public use. So if you are really good at it then you can have a good carrier and prominent companies can hire you as well. What’s more? You will also get handsome prize money. Well, amount varies for the difficulty of the problem but there’s a steep learning curve in the whole process, from downloading and dataset and building a model to submitting your predictions. It is super fun.
How to Use Kaggle?
So, first of all, create an account on Kaggle. This step is required to use Kaggle and there are two methods to sign up. We would recommend signing up using email since if you sign up through Facebook or Google+ you may have some issues in using Kaggle command line. It is a tool to download the data set and submit prediction using command line but recently Kaggle has released its own API. After that, go to data sites. There are thousands of high-quality data that are available on Kaggle. When you click on any of them, you’re going to see the screen. There are many sections available namely data, discussions, kernel, discussions, activity, new kernel, and of course download.
Overview and Data
In the overview section, you’re going to see the top contributors. When you scroll down, you will see the kernels with most votes. We will dive into kernel afterward. For now, just consider them as scripts. Scripts are short snippets of code. Kernel is a combination of environment, input, code, and output, all stored together. They encourage you to become active in discussions because they are a really nice way to collaborate and learn together. You get to learn about new perspectives and learn some new tricks which may help you in carrying out your own task.
Now when you click on it, you get an insight into what the detail is all about, what does the file contain. Generally, it is divided into training and testing data sites. Some may contain validation data as well but it’s highly recommended to create one if it’s not included before proceeding. Very often you would have to do this. The training set is used to build machine learning models for the training site outcome which is also known as ground truth. Your model will be based on features; you can also use future engineering to create new features. This test should be used to see how well your model performs on unseen data, which your model hasn’t encountered before. For the test set, labeling or ground truth is not provided. A model needs to credit its outcomes. There’s a submission.CSV file, this is a file which you must be concerned with after dealing with the model since you would be submitting your predictions in this format. They provide you a sample of how it should look like. Normally, it’s a data dictionary. It’s super convenient.
Discussions
Discussions, as you know, are like forums. Did you stick up with a problem? You use it to get your query answered. For rankings, there’s a leaderboard. It’s divided into two sections i.e. public and private. Public leaderboard is computed on a portion of the test set. The private data is computed on the remainder of the test set. But not the whole test set. When someone says something regarding fitting a leaderboard it means they are referring to tuning your models to perform well on the public leaderboard. You have to make sure that you are not overfitting in any case. Privately the board remains secret until the end of the competition. The parent leaderboard determines the final competition winners. The purpose of this division is to prevent people from winning by overfitting public leaderboards. Participants then are motivated to make sure their models generalize well to the private data set. You can read the rules totally straight.
Kaggle Competition
Now, let’s come to the competition section. There are many types of competitions like the playground, research, and many more. There are active ones and there are archived ones. So if you are playing with the data set of an archived competition, you would be able to submit your predictions but you won’t be ranked. There is no prize money for it and it’s just kept for educational purpose. Active competitions award you with the money as we have talked and the privately the board is kept secret so that we don’t overfit on the public dataset. You can take initial help some kernels which have been made available by many participants. So if you click on the competition which interests you, you can see similar stuff as we have discussed. We didn’t talk about evaluation, so it gives you an idea of how your predictions are being evaluated. Sometimes they use log loss, sometimes mean average, precision – a different intersection of a union. You can check at that before submitting just to make sure. They give you a format of what do they expect from the submitted file. Normally, you create that in CSV format, comma-separated values. You can use its tool to manage data. You have to be really particular regarding submitting your submission since everything has to be in the same format as they asked for. Otherwise, you won’t be marked.
Kaggle Kernels
Now, let’s talk about Kaggle kernels. They are essentially jupyter notebook. Use it on your browser just like snippets of code in jupyter notebook. It’s a free platform to Jupyter notebook in your browser so that means you don’t have to worry about the hassles of setting up your local environment or any other environment on cloud instance as well. So when you run the kernel, the processing power is coming from the servers in the cloud so you can practice and learn machine learning without heating up your laptop. You can either use already available kernels or you can create your new ones. The data set is already loaded in the environment of the kernel so there is no need to upload the dataset up on the cloud instance. You can still add additional files that you may require on to that instance.
In order to demonstrate, we deal with a fashion_mnist dataset. It’s a data set that contains 10 categories of clothing and accessory things like pants, shirt, bags, and so on. So it’s an example of multiclass classification. There are 50,000 tuning samples and ton thousand evolution samples. So, Kaggle kernel lets you visualize the data in addition to just processing it. So it allows you to work in a fully interactive notebook in the browser with little or just no setup. And we really want to focus that we didn’t have to launch any cloud instance or have to manage any environment configuration which is really awesome. Just sign up on Kaggle, play with its kernels, and participate in discussions and competitions as well.