I Made a Web Linear Regression Visualizer

Plus, a brief intro to linear regression & machine learning concepts

11 min readAug 14, 2024

Preface

I don’t know about you, but sometimes when I try to learn something, especially if it’s a relatively difficult concept or idea, I find myself wrestling with my mind trying to understand what I’m learning. This holds especially true whenever I try to learn how a new machine learning model works mathematically & conceptually. Just a few months ago, I had trouble trying to figure out how linear regression, the most simple machine learning algorithm, works. What confused me the most wasn’t necessarily the mathematics in play, though. My biggest issue? Trying to understand how each building block of the model helps & assists with another. I thought that the most feasible way to help me understand how this model worked was to create an interactive visual representation of it.

Thus, I was inspired to create a web visualization depicting this algorithm as best as possible. For this project, I used HTML & CSS to design the interface, and vanilla JavaScript to implement the algorithm from scratch. This project took between 1 and 2 weeks to fully finish & deploy on GitHub pages. Please click here to access this project if you are interested in seeing it in action. I will post screenshots & explain each component of the project in this article.

Please note that this article is not meant to outline the entire mathematics & inner workings of the linear regression algorithm. However, I will do my best to explain how the algorithm works in conjunction with explaining how this project works & depicts the algorithm.

With that out of the way, I will now briefly explain how this algorithm works.

To understand how this algorithm works in a nutshell, let’s use the example of a man who is trying to walk with the hopes of losing weight. On any given day, due to metabolism, he can burn as much as 2500 calories. He finds that walking around 1,000 steps can burn around 50 calories, 5,000 steps can burn around 250 calories, and walking 10,000 steps can burn around 500 calories. Given these three data points, he can conclude that, for every step he walks, he will burn around 0.05 calories. Mathematically, this relationship can be described as follows:

The x coefficient represents the number of steps the man walks, and the 2500 is the base calorie expenditure from metabolism. The output will return the number of calories the man has expended from walking & metabolism.

This relationship above is an example of the man conducting linear regression. However, in the context of computing, the programmer does not compute the relationship — the computer does!

There are a few caveats that come with estimating this dataset. The first one is the fact that the datasets in the real world that utilize this algorithm are almost never perfectly linear. Any dataset collected via empirical methods is bound to have numerous outliers that can affect how a best-fit line can be estimated. However, the dataset’s linearity should be evident in order for a linear regression model to work (it’s in the name after all!). In other words, the data should generally have a consistent & proportionate pattern as the independent variable quantity increases or decreases.

With that being said, the algorithm is not as simple as picking a few points and figuring out the linear relationship. Since the dataset does not have a perfectly linear relationship, the regression simply estimates the relationship, but will never obtain a perfect result. The algorithm, instead, runs & finds the relationship on a trial-and-error basis, where it first outputs a relationship, tests how accurate it is, and adjusts the relationship parameters accordingly until the relationship yields the highest possible accuracy. The way the parameters are adjusted is done using another algorithm known as “gradient descent”, which will be briefly discussed later in this article.

Dataset Generation / Basic Metrics for Optimization

Upon entering the project for the first time, the user is prompted with four inputs — m, b, # of points, and noise. These inputs are used to generate a dataset that loosely follows a linear pattern given the parameters m and b, the slope and y-intercept respectively. The input for the number of points is self-explanatory — how many data points do we want this randomly generated dataset to contain? Finally, the noise input is what allows us to give the dataset nonlinearity.

Let’s set m = 4, b = 3, # points = 1000, and noise = 0.3. The procedure for the generation of the dataset is as follows:

let dataset = [];
let points = 1000;
let slope = 4;
let yint = 3;
let noise = 0.3;
for (let i = 0; i < points; i++){
  let x = Math.random() // random float between 0 and 1
  let y = (x * slope + yint) + (Math.random() * noise);
  /* Math.random() * noise adds a random, weighted value to spread out the 
  datapoints while maintaining the specified linear relationship */ 
  dataset.append([x, y]);
}

The dataset’s domain is between 0 and 1, and thus, the range of the dataset is between 3 and 7.

Let’s submit these values into the inputs, and see what we get.

On the left is a graph that contains the line of best fit and the generated dataset. The grey dots represent the dataset, and the green line is the line of best fit. The parameters of both the generated dataset and the line of best fit are listed below the graph.

The linear regression algorithm is responsible for adjusting the green line to best estimate the relationship depicted by the dataset. This green line is referred to as the “line of best fit”. Below the predicted line of best fit, and the actual trend of the dataset, are metrics that the linear regression model uses. “Epoch” simply refers to the current iteration or “cycle” the algorithm is running on. For example, if it shows “epoch 100”, then that means the adjustments the green line has made has been done after 100 iterations.

The loss metric next to the epoch metric is very important for the model to make adjustments to the line of best fit. Essentially, the loss is a calculation of how “incorrect” the model is in estimating, where it takes each datapoint within the line & the dataset and calculates how off the estimated line is. There are numerous ways to calculate this value, but this project uses what’s known as the “Mean Squared Error” function, which will be defined later on. There are other popular functions such as the “Mean Absolute Error”, and others that are better suited for other types of problems (i.e. “Categorical Crossentropy” with classification problems).

Hyperparameters, Visuals & Gradient Descent

On the right side is where the hyperparameters, or parameters that can affect the accuracy of the model. I will explain the learning rate in a second, but for now, understand that this is a crucial value that can drastically affect how a model learns. For now, we will set the value equal to 0.01. Epochs, as mentioned above, are simply the number of times the algorithm runs. We will set this value to 100 — generally, the higher this number is, the more likely we are to get an accurate model.

Immediately after submitting the information, we are directed to a dashboard outlining ostensibly complicated graphs, equations, numbers, and buttons.

Values of parameters & their calculations after 7 epochs

These are the values that we get after running this algorithm for 7 epochs or rounds. Don’t be intimidated by the amount of formulas & numbers here! I’ll break everything down.

The first formula is the formula for our Mean Squared Error loss function. This formula in writing may seem daunting, but all that the function is doing is:

Taking the difference between the predicted & actual values for any given X value from the dataset
Squaring the difference
Adding that difference to previously computed costs from the different datapoints
Averaging the loss value out in relation to the dataset.

Using this function, we see that, after around 7 epochs, we get a loss of 15.78. This means that, on average, the squared difference between the values of the dataset & the line of best fit was around 15.78 points. Obviously, this is not a good estimation. We generally want the loss to be as close to 0 as possible.

Below the information pertaining to the loss function is a vector. This vector is the gradient (as denoted by the upside-down delta), which is simply a vector of derivatives of the cost functions with respect to the variables. So, the first value of the gradient is the computed derivative of the cost with respect to the slope, and second the computed derivative of the cost with respected to the y-intercept.

These values that we calculated are crucial in adjusting our slope and y-intercept to fit our data best. Take a look at the formulas at the very bottom. Those formulas are the formulas for parameter adjustment. We take the parameter that was used to compute the loss, and subtract it by the learning rate multiplied by its respective gradient. The value that we get after subtracting is our new adjusted parameter, which is supposed to be a more accurate linear representation of the data than the previous value. This entire procedure, from calculating loss, to updating the parameters, is known as “gradient descent”.

The intuition behind this may be confusing. At least, I thought it was. But it’s simpler than how one may believe. Let’s use the gradient of cost with respect to m, for example: -4.316, when m = 0.898. What does this mean? Essentially, as m approaches up to 0.898, the loss decreases by a rate of about 4.316. Our goal is to find a parameter where the loss is as close to zero as possible. So, since the cost decreases when m = 4.316, we want to increase it so that the loss continues to decrease. In other words, we are moving the parameter m in the direction opposite of the direction the cost is moving. If the gradient was positive, then that would mean that the greater the slope, the higher the loss will be, so we want to decrease m. Subtracting the learning rate multiplied by the gradient from the parameter helps us achieve this effect, where the parameter moves in the opposite direction of how the cost is going.

Now, you may already have noticed why having a learning rate is important. Say we eliminate the learning rate from the calculation of the weight. Plugging in the values for the new weight is now 0.898 minus -4.316, which will give us 5.214. We just increased the slope of our best fit line by over 4.3! Essentially, without the learning rate, the parameter adjustments are extremely volatile — they can increase or decrease very heavily, which can lead to very wonky & flat-out incorrect predictions.

An intuitive way to understand how the learning rate works is in the context of walking down a very steep cliff. A learning rate, in this scenario, is similar to the length of a stride. If you walk down recklessly, you are bound to slide down and injure yourself while hitting the bottom. Without a learning rate, the parameters will adjust very rapidly & ultimately give ridiculously bad parameters. But, if you decide to walk down with small strides or baby steps, you will slowly but surely reach the bottom with (hopefully!) no injuries. Similarly, with a learning rate in place, the parameters adjust themselves slowly, but will ultimately lead to an accurate line of best fit. You can test the importance of a learning rate by setting it equal to 1 (meaning there is no learning rate) before starting the training process.

The four graphs that are shown below these calculations show the relationship between the loss and the parameters of the best-fit line. As we can see, as the parameters get closer to the actual dataset parameters, the loss gets closer to zero, which is our ultimate goal. This is a sign that this model is being ran correctly. By the end of the process, the loss/change in loss in all four graphs should be very close to zero.

Results after 100 Epochs

By epoch 100, we see that the line did not fit the data that well. This doesn’t necessarily mean that the model did bad. In fact, we could say the opposite. Take a look at the loss score — 0.286! That’s not bad after running this algorithm for 100 times, but can certainly be better. And, take a look at the graphs, as the parameters increase (which has been the trend since epoch 1 if you run this project), we see that the loss consistently decreases. At the same time, the change in loss gets closer and closer to zero, which makes sense — we don’t want the loss to change rapidly if the loss approaches zero, we want to slow down while still trying to minimize this loss value.

What would have allowed the model to give us an accurate fit for the data set is simply giving the algorithm more time/epochs to run. If we let the algorithm run for, say, 1000 epochs, we would get a much more perfect estimation for the dataset. Or, we could have increased the learning rate to something like 0.05 or even 0.1 if we wanted to make the adjustments to the parameters happen quicker. Coming up with ways to change these hyperparameters & implementing them to optimize the training process is known as “hyperparameter tuning”, which is one of the most time-consuming yet rewarding aspects of any machine learning task.

Conclusion

Ultimately, this project helped me understand linear regression better, and, naturally, how other machine learning models work. This project is far from perfect, and some tweaks & features could certainly be added, but at the end of the day, I hope this project helps someone else’s understanding, even by a little bit.

Thanks for taking the time to read this article! If you have any suggestions for this project, or any feature ideas for this project, please let me know in a comment.

The GitHub repository for this project can be found by clicking this link.