How do we know what we know about stars?

A research about astronomy and machine learning

Ever heard a scientific claim about those tiny spots in the night sky, something like 'that is a young star running on hydrogen and helium fuel at its core'? Or about how one of the three Marys went extinct hundreds of years ago? The aim of this research is to learn the basics of how the scientists are able to determine if what we are looking at is a star, a comet, a distant galaxy, a star inside that galaxy, a supernova or any other source of light, even when all we can get is just a pixel in the biggest camera ever built.


Any beam of light can give us a lot of information if we look at it closely. For instance, if we refract it in a prism, just like Mason-Waters did some time ago, we would able to recognize the spectrum of the elements that emitted that light.

Light is a wave. Different elements emit or absorb light of really specific wavelengths (colors), so by analyzing the light that passes through a prism, scientists can arrive to conclusions about what are the elements that shone that light, or those in the atmosphere that absorbed it before reaching our eyes.

We can even know how far a star is by the same method of refraction. The thing is, the universe is always expanding, Einstein was the first to predict this 100 years ago (and mistakenly ignored it). As it is expanding, everything is moving away from us, what's farther is moving faster away because there is more expanding space in the middle. This creates a Doppler effect, like the one we feel with the difference in pitch in the ambulance that is coming toward us (higher pitch) and when it is moving away from us (lower pitch). The same difference in frequencies applies to light: light is a wave and the frequency of that wave gets wider (redder) as the source is moving away from us. This is called redshift.

This is specially important, because if we want to compare stars, we need to focus on their real brightness and colors, and not their apparent ones.

Ideally, just by looking really closely at one of this dots, we can know the elements that compose the star, as well as how far it is. But spectrometry needs a lot of light and a lot of time, and the universe is big [citation needed], so if we want to get to know a lot of information about a lot of tiny far away stars, we need a different approach.


The newest technique is to replace spectrometry with less a specific technique, photometry. Instead of a prism, it measures the amount of light at each range of wavelengths, kind of how the human eye sees colors. Humans are able to differentiate apples from oranges without knowing all the elements that emit those colors.

To achieve this, astronomers are building a telescope in the Atacama Desert, provided with 6 passbands (filters), so that each picture it takes will provide a measurement in only one of those 6 colors. Passbands were designed to filter Ultraviolet, Green, Red, Infrared, Z and Y spectra.

Since we only get a value of brightness, i.e. a measurement from each of the passbands at a certain time, redshift correction has to be done by displacing the passbands to the bluer spectrum to get the absolute brightness.

The key factor of this method, given that one cannot get information from all the passbands at once, is repetition of measurements and time, a lot of time. The telescope is still being built, and they have no idea what it will see, or how to process the expected 15TB of nightly information. So, they created a simulation of the received data over three years, to enroll data scientists all over the world with the task of figuring out what are we looking for in each case. The dataset consists of 7848 stars with 14 different classes, and 1.421.705 measurements.

Each time the telescope takes a picture, lots of observations about different stars (pixels) are produced. Information about the each of the measurements comes in the way:

- Day 59750.4 MJD, right ascension 349º, declination -61.9º, object 615, passband 2, measured magnitude 544.8, measurement error 3.6, object detected.

If we put together all the measurements from a star, and color them by passband, this is what we get.

MJD stands for Modified Julian Date, in astronomy it is useful to use the Julian dates that started 4713 BC by a guy named Julian, in this calendar days are just decimal, no years, months, hours or minutes. Russians were the ones to start a new calendar named Modified Julian Date, because they did not have the power to process such big dates when they launched their first space mission Sputnik, so they started from scratch again.

At first glance, what we see from this star through these 900 days seems to be a fluctuating noise of light. It’s important to mention the data collection strategy: as the universe is so big they focus on little spots a bunch at a time, so astronomers will collect information of a certain spot in all the passbands for one day, then the following week, and the one after that for a couple months, then leave that spot unattended for a couple of months, and repeat. So what we are mostly looking at is 3 of this bunch of measurements spared in a period of around 900 days.

What can we know so far? Well, from this noise we can get the amplitude at each passband. We know for sure that a supernova shines more than 10.000 times than stars, and that galaxies can shine 100 times brighter than supernovae. But the amplitude we get is the measured one, the real brightness can be obtained by correcting with the calculated distance.

Supernovae shine once and fade away, galaxies can shine pretty constantly, but some stars have a regularly fluctuating nature. Even our own sun is variable, its energy output varies by about 0.1 percent, or one-thousandth of its magnitude, during its 11-year solar cycle.

Cepheids, for example, are a kind of star that pulsate varying their size regularly with a fixed relationship between their period and their brightness. Because of that, since they were discovered back in 1908, in a world without bits and pixels, they were used as galactic measuring sticks when they were embedded in galaxies or nebulas. By watching their period (not variable with distance) astronomers could calculate their real brightness and by comparing with the apparent brightness, they could have a pretty solid estimate of how far away it was.

A pulsating star is not easy to detect though, the main problem is that we have randomly collected data, as a result of the telescope method to cover as much sky as possible. Needless to say that cloudy nights are not the best weather for stargazing. Even a sine wave is difficult to spot if what we have is random measurements.

But if we know the period of this sine wave, we can visualize it much much better by representing the data not with the time of the observation, but with the phase of the period:

The problem here is that, to do this transformation, first we need to know the periodicity. So, how do we get their period from the measured noise? For that purpose, a Lomb-Scargle algorithm can be used. For those who know some math, it’s similar to Fourier transformation but it is especially designed for measurements obtained irregularly. The Lomb-Scargle periodgram displays the main frequencies in which measured data can be explained.

This star in particular has a 0.32 days phase, it’s short period tells us that we are talking about a Cepheid, because other kind of pulsating stars have a periodicity of more than a couple weeks or months. By changing, then, the visualization from days to phase, just like we did with the sine wave, we finally get something nicer than the initial noise we had.

This is called a light curve. Basically, it gives us information about how stars shine in each of the colors during their cycle. This is not as accurate as a live spectrometry for the duration of a cycle, but it is equally important to validate the hypothesis astronomers can make about the composition and internal reaction different stars can have.

Machine Learning

Combined with photometric data, astronomers are using the power of computing to catalogue the stars we see. The idea is to teach the computer how different star classes are by giving it examples, and then showing the computer a new set of stars and see how it predicts their labels.

So, what happens if we take all of this raw data we’ve got and run a machine learning algorithm to try and predict what the stars are? In machine learning, as we are working almost blindly, when we want to make a model to classify data, what we do is we split the dataset, we take a big part of it to train the computer, and we save a small part for later, to test the trained algorithm by asking it to classify the new information and compare the results with the labels we knew were correct. The result is a confusion matrix, a comparison between what the computer thought the stars were and what we know for sure they were.

We can see in the bluer square, that the computer predicted that 429 stars were class 90 (we don't even know what class 90 means), and that they actually were class 90 –win! But in the second bluest, we find 198 stars that the computer thought they were 90, and they actually were class 42 –not so good. This raw data algorithm threw an accuracy of 30%, one out of three stars were correctly classified. Impressive!

Well, it is not as impressive if we take into account that the class with more stars has 29.47% of the dataset, so what the smart computer did was saying ‘they are all class 90’, and found the biggest possible accuracy out of the information given.

So let’s process the data. Out of the information received for each star in each passband we are going to ask for the maximum brightness, the minimum, the amplitude, the standard deviation, the median, the mean, the slopes, and the cyclic light curves we were describing previously.

What we get is a 68% accuracy in the algorithm, we start seeing bluer squares in the diagonal and this is what we expect, since the diagonal is the place where the predicted label and the true label are the same.

Still not satisfied with the 68% accuracy, we’ve split the dataset between galactic light sources and extragalactic, to see where the problem really was. We can know a star is inside the Milky Way because the measured redshift is 0, as stars are close by, the Doppler effect is almost none.

Galactic classes got a 97% accuracy, while extragalactic threw a 57%. So we’ll focus on how to improve extragalactic classes and that way we’ll improve the overall score. Since their only difference is that the latter has redshift, we can theorize that it’s effect is causing the model to confuse them and fail. So, let’s add redshift correction to each of the light curves and try again.

The result is an increasing 70% accuracy, we still observe that the computer is over predicting the class 90, the label with the most stars. What would happen if we had a dataset with the same number of stars for each class? To achieve that, we resample the dataset by making copies of the stars in the least populated classes to balance it. By doing this, we expect to reduce the ‘most common star’ effect.

By this point, hands were raised in the air celebrating a 97% accuracy score and a really healthy-looking confusion matrix. But as usual in science, what seems too good to be true is usually not true. What happened here is called a data leak: the copied stars that were supposed to be used for training leaked into the test dataset, so we were asking the computer to predict stars exactly the same as the ones it used for training. This was not our goal, since the aim of the study was to create a model that would extrapolate what it sees in the known stars to classify completely new ones.

Once the data leak was corrected (by keeping the copied stars in the training dataset, and not modifying the test dataset), we built a last algorithm with both galactic and extragalactic classes to get the final result.

We still see the predominance of class 90 at predicting labels, but the diagonal looks more populated than before, we have reached a global accuracy of 80% with this basic processing. Later, we can start analyzing why classes 90, 62 and 42 get so mixed up, this would be the first step at improving the algorithm.


The developed model may not be the most accurate or the fastest, it actually took the computer (a Late 2012 Macbook Pro) 3 hours to process 60MB of data, it would take the same computer 80 years to process what the telescope gets in a night. The objective was not to create the best model, but to get a glimpse of how is it that we understand stars, even when the information we get is really limited.

The computer model can be found in this Python notebook (the classifier used was Random Forests), datasets and more information about the competition as well as forums and different models in Kaggle.

Special thanks to Alvaro Ulloa who teamed me up for this research and to PhD Mateo Fernandez Alonso who guided us through the scientific specifications.