Python pandas and space rocks

by: Marcel-Jan, 7 years ago

Last edited: 7 years ago

After watching the pandas and matplotlib videos I've been playing with these libraries and and an asteroid database. It was a nice surprise to see how easy it was to create a graph of data in a JSON file. Really had a lot of fun with that.

I've made a video tutorial which explains (hopefully) everything.
https://youtu.be/iXjJNc8zGsM



You must be logged in to post. Please login or register an account.



Awesome stuff! Great presentation and quality. Keep making videos!

I suggest you use Mean Shift on those clusters in your next tutorial :)

https://pythonprogramming.net/hierarchical-clustering-mean-shift-machine-learning-tutorial/

-Harrison 7 years ago

You must be logged in to post. Please login or register an account.


I managed to read the whole asteroid catalog in memory. I had many memory errors  (after waiting half an hour to read the entire file) and struggled a bit to get only the columns I wanted in a smaller dataframe.

The solution? Picking out only the attributes I needed for my graphs. And now I have a pickle file ready!
The crude code is here:
https://github.com/Marcel-Jan/Fun-with-Python/blob/master/pandas_asteroids_fullset_1a.py

Next step is Mean Shift. But not on the whole set probably :)

-Marcel-Jan 7 years ago
Last edited 7 years ago

You must be logged in to post. Please login or register an account.

We have a mean shift going on! :). Getting the algorithm to work after watching the hierarchical clustering video was much, much easier than reading the large file, that I accomplished yesterday.

I used only the first 10,000 discovered asteroids for this, because I .. ah.. didn't want to wait all day.

The MeanShift algorithm found 17 clusters over the whole set. Cool result! I suspect some clusters consisted of only one very far, lonely asteroids. In a next iteration I want to leave those out of the data set.

Here is a graph and the centroids placed by the algorithm.
https://github.com/Marcel-Jan/Fun-with-Python/blob/master/mean_shift_asteroids_10K_zoomedout.png

The coloring of different clusters didn't work. Oh well, you can't win them all on the first try.

And here is the code:

import pandas as pd
import matplotlib.pyplot as plt
import pickle
import numpy as np
import ijson
from sklearn.cluster import MeanShift

filename = "D:Stuurmpcorb_10kasteroids.json"

asteroid_data = []

with open(filename, 'r') as f:
    objects = ijson.items(f, 'item')
    for row in objects:
     selected_row = dict()
     selected_row = {"a": row["a"], "e": row["e"], "i": row["i"]}
     asteroid_data.append(selected_row)

asteroid_df = pd.DataFrame.from_dict(asteroid_data)
print(asteroid_df.describe())

asteroid_df.drop(['i'], 1, inplace=True)

ms = MeanShift()
ms.fit(asteroid_df)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)

colors = 10*['r','g','b','c','k','y','m']

for i in range(len(asteroid_df)):
    plt.scatter(asteroid_df.a, asteroid_df.e, label='asteroid data', c=colors[labels[i]], marker='.', s=1)

plt.scatter(cluster_centers[:,0], cluster_centers[:,1], label='centroids', color='b', marker='x', s=150, linewidths = 5, zorder=10)

plt.xlabel('Semimajor axis')
plt.ylabel('Eccentricity')

plt.title('All known asteroids')
plt.show()


Actually when I ran MeanShift against all three columns it came up with 12 clusters. I would really have loved to see that, but I didn't get the Axes3D stuff to work, even without the clustering algorithm. I got this error:
python axes3d TypeError: unsupported operand type(s) for *: 'float' and 'decimal.Decimal'

I was pretty sure the data already was decimal. All right, there's always more stuff to do.

-Marcel-Jan 7 years ago

You must be logged in to post. Please login or register an account.


Try to convert all your datatypes to float. So all values before you append them to the lists/arrays for graphing do float(values) to them.

-Harrison 7 years ago

You must be logged in to post. Please login or register an account.


Cool, the 3D graph works as well.
https://youtu.be/mY_LgS4Y_Z0

-Marcel-Jan 7 years ago

You must be logged in to post. Please login or register an account.

Last result: a combination of MeanShift and a colored 3D graph ends with a Memory Error on my laptop. But I have a more powerful computer I haven't tried yet.

-Marcel-Jan 7 years ago

You must be logged in to post. Please login or register an account.

The combination of MeanShift and a colored 3D graph is giving Memory Errors whatever I try. It's just too much, let alone the whole asteroid database. My other computer has 32GB RAM, but I'm guessing that there are some restrictions in Python to use that?
Another thing I tried, was to remove all the data I didn't need for my graph. That was about 10%, so that didn't really solve the issue.

Also, funnily enough, when I reduced the data to only the limits of the original 2D graph (2 to 3.5 astronomical units), the algorithm sees only one cluster, so it isn't exactly necessary to do anything with colors. :)

Still clearly there are some "clumps" within the data, but probably, because it's in other "asteroid clouds" they are not recognised. I wonder if there's any other algorithm for that. I doubt it.

-Marcel-Jan 7 years ago
Last edited 7 years ago

You must be logged in to post. Please login or register an account.