Some bits and pieces of Data Science

Mar 08, 2025

I am currently wrapping up a 10 week course on data science. One of the best classes I have so far in college. Perhaps because of the small class size, a caring teacher, and well documented slides and abundant support. Read this as a summary of a beginner’s understanding of data science and LLMs, and as always, trust but verify :) Constructive corrections are welcome!

Okay, onto the technicals.

I learned there are 4 types of variables: discrete, continuous, ordinal and nominal. Discrete and continuous are things you can do math with, like postal code and salary. Ordinal and nominal fall under the qualitative category, where ordinal means you can associate orders to it, like Yelp stars or low/middle/high income, and nominal is there are no order to your categories: like male/female.

When you want to smooth your data to account for error ranges, you can use a kernel density function, which applies a probability distribution to your data point (how likely you sample and get your data point). You have a smoothing parameter alpha (larger alpha means more smooth curve, but you may lose out important information, whereas an alpha too small may leave noisy data and outliers in the visualization). Makes me realize you can also lie with statistics :)

We also covered LLMs and glossed over the GPT architecture. Words get divided into tokens and embedded as floating points in an embedding vector, and words are generated based on how likely they are to appear next in the sentence. You can vary the temperature (high temp, more creativity, if temp = 0, always pick the word with highest probability to appear next). GPT uses positional encoding and attention. Attention is computed with a keys and query matrix. There’s a softmax at the end. These matrix operations are kinda costly to the environment (or shall i say data centers use a heck of a lot of electricity and water for cooling the GPU racks). Sadly its not considered an existential crisis because rich people can always find a way out. AI safety people hates AI ethics people because AI safety cares about the catastrophic risks (small probability that AI will destroy humanity) whereas AI ethics people care about bias and equity. Google had a team of AI ethics researchers but the team got fired for disagreeing with the company. I guess companies are responsible for shareholders. Now I don’t know which side of the table I want to be on, cuz I do love money and dont want to see stocks dip, and when Dave Ramsey says corporate culture and ping pong tables are a joke, not sure which viewpoint I should be taking. I’ll leave it to you to form your opinion. Maybe I like to take the position of presenting as many viewpoints as I can, and run away from making decisions :)

Chi’s Substack

Discussion about this post