Mining

Data Mining in C

Data Mining in C

#Data #Mining

“Tsoding Daily”

References:
– Wikipedia – K-means Clustering –
– Less is More: Parameter-Free Text Classification with Gzip –

Chapters:
– 0:00:00 – TBD

source

 

To see the full content, share this page by clicking one of the buttons below

Related Articles

33 Comments

  1. Man I love your videos in C, I am starting to love C slowly, my passion in C is increasing due to you, Thanks a lot Tsoding, hope one day I will send minor patches to linux kernel

  2. The Lloyd’s algorithm can be used to create a mesh called centroidal Voronoi tessellation. I once used it to generate a mesh on a sphere with non uniform density. That would be pretty cool to make and it basically uses the same algorithm as the one you implemented

  3. Yeah, the samples are actually more dense in the center, because the probability for a point on a vector with a larger magnitude is the same as the one with a shorter magnitude, so you have the same chance to get a point on a large circumference and it's much sparser. It's a generate_cluster() problem, not a rand() problem. You could generate points in a square which is 2 radii in width and height and only pick points that are within the radius to get "uniform" distribution

  4. as seen in a video by mathemanic called "the numerical simulation is not as easy as you think". the phenomenon where the clusters are denser at the center can be fixed by assigning the magnitude equal to the square root of the random variable between 0 and 1. that is (as others have pointed out before) the rate of change of the area of a circle is not constant as you increase the radius, instead it increases with the square of the radius

  5. 44:04 maybe it's denser in center for same reason as if you take same length sticks and place them with ends at one point (4 sticks look like +, 5 sticks look like *), then whole thing's center is dense (biggest wood/air ratio by volume)

  6. If you replace the commas with nulls you can use the c apis directly without the temporary buffer. That way the csv is actually a sequence of null terminated strings. You do need to keep track of the newline and replace thatvwith null aswell

  7. Hi !
    Why are all your utils like nob_da_append/.. not on nobuild github ? is it your "custom" version ? would be really cool to have them there !

    Happy new year !!

  8. You could visualize the high dimensional data by running pca two reduce the dimensions. In your case you can do pca of dimension 2 and what you would obtain is a 2 dimensional vector where the 2 values have the largest “explained variance” this basically means that those 2 features contribute to the variance in the data more than any other 2 features.

    You would be able to do clustering in the high dimension and just display using the pca.

  9. The reason why the samples are denser in the center is because you were generating it by randomizing the magnitude then the angles. randomizing the magnitude will give you the probability of the samples lay within r*mag to the center of the circle. However, the area in which those samples can be placed (pi*(r*mag)**2) doesn't grow at the same rate as r*mag. thus, the greater mag is, the lower the density. Bertrand paradox illustrates this phenomenon really well.

  10. Rand is uniform, but to scatter points uniformly on a disc, you need to use mag=sqrtf(rand_float()).
    Because otherwise, you will have the same amount of points per magnitude value(on average), which means points near the center will be closer to each other.

Leave a Reply