I have started participating in Kaggle now. And I think I am starting to like it. So I started by learning Regression and this is the first ML problem I have attempted after 5 months since my resignation. I know, I know. The girl issue was on my mind for past few months and it was hard to concentrate for me. Now I am finally getting over her and using my time productively. Thank God. Phew !!

So you can check the problem here. At the time of writing this I am ranked at top 81% percentile which is a pretty bad number. I am trying to do better. Anyways learned a few good things about Regression. Apparently whatever I was doing by reading Lectures and MOOC's was not very fruitful. Not for me atleast.

My current submission is on github. Here's what I learnt:-

  • Basically most of the time you want your dependent and independent variable to be normally distributed. Read this answer on Quora.
  • You want to select dependent variables in such a way that are highly correlated with dependent variable and have low correlation among each other. Because situation of multicollinearity arises. That is same information is conveyed if correlation between two independent variables is high.
  • Homoescadicity - It was an interesting read. It says, that your residuals must be evenly distributed and should not vary with predictor variables.

What I have to try:-

  • Learn Cross Validation. At this point I am using RMSE as a metric.
  • Remove Outliers and check if they improve my model
  • Encode Categorical Variables - Haven't used any of them yet.
  • Use Lasso And Ridge Regression.
  • XGBoost - Read about this on Kaggle Only. Not sure what it is.