Blog
Friends of data

Dr. Melodie Kao, Part 1

Part 1: How a PhD astrophysicist thinks about data
melodiekao

Dr. Melodie Kao is a Heising-Simons 51 Pegasi b Fellow at UC Santa Cruz in the Department of Astronomy & Astrophysics, and she formerly held a NASA Hubble Postdoctoral Fellowship at the School of Earth and Space Exploration at Arizona State University.

We spoke about big data, magnetospheres, transparency in science, and much, much more. This interview has been split into two parts, the first covering Melodie's specific research and use of data, and the second focusing on bigger issues with STEM education and equity. This interview has been condensed and edited for clarity.

Barry: I was hoping we could start with you telling us a little bit about your research. Could you explain it to me at a fifth-grade level?

Melodie: I study radio emissions from very low mass stars, and also a type of object called a brown dwarf. These are failed stars: they weren't massive enough to burn hydrogen in their cores the way that normal stars do. I study brown dwarfs as a means of understanding the engines that drive the magnetic fields they produce, which is interesting because they are magnetic analogues that have similar properties to gas giant planets.

There’s a whole group of people who are trying to detect and study planetary magnetic fields, but I go about it from the opposite direction: I study brown dwarf magnetic fields so that I can understand planetary magnetic fields and their observational and intrinsic characteristics.

Tell me a little bit about how your research might apply to things here on Earth. Obviously, Earth has a magnetic field. Are there things to learn about the way our magnetic field works based on your research with brown dwarfs?

Yeah, I actually talk a fair amount to some terrestrial dynamo modelers (dynamos are the engines that drive magnetic fields). Earth's magnetic field is driven by the molten iron at its core and the motion of iron floating off, releasing its thermal energy, and then falling back down again.

I would say that astronomy has honestly always been about data analytics, it's just that the nature of the data and the analytics have changed over the years.

Instead of molten iron, Jupiter and brown dwarfs have hydrogen, which is under so much pressure that it develops metallic properties. So even though the materials are different, the underlying physics that powers their magnetic fields is still very similar. You're still looking at the transformation of thermal energy into magnetic energy, and the convection of fluids.

Another cool thing is that some of the radio emissions I'm seeing could potentially be because of volcanic or magnetic planets around brown dwarfs. If this actually bears out to be true in the future, these emissions could tell us about the magnetospheric or ionospheric properties of these terrestrial planets.

Jupiter from the Hubble Space Telescope — NASA, ESA, A. Simon (NASA-GSFC), and M. H. Wong (UC Berkeley); Image Processing: J. DePasquale (STScI)
Jupiter from the Hubble Space Telescope — NASA, ESA, A. Simon (NASA-GSFC), and M. H. Wong (UC Berkeley); Image Processing: J. DePasquale (STScI)

When I think about astronomy, my mind naturally goes to people in lab coats peering through big optical telescopes up on a mountain. But the modern field is reliant on analytics and massive data streams from radio telescopes. Tell me a little bit about that, and the workflows and tools that you use to do your job as a modern astronomer.

I would say that astronomy has honestly always been about data analytics, it's just that the nature of the data and the analytics have changed over the years.

Initially, the data was literally just tracking the movement of particularly bright stars, which we now know are planets, or sketching the surface of the moon. Now we have better instruments: instead of using telescopes that are maybe a couple inches across, they are now often 10 meters across. In my case, I use a set of 27 dishes that are 25 meters across, so we have a lot more data.

Because the instruments are so much bigger and better now, they're also much more expensive to build and operate. This drives real pressure to get as much information out of the data as possible, as well as to make actual observing as efficient as possible. More and more telescopes and observatories are moving towards the model of what we call “queue” observing, where instead of getting assigned one or two entire nights out of the year, you're only given three or four hours at a time and someone slots it in when it makes the most sense. When I observe, I literally write down an observing script and send it to the observatory. Maybe a couple of weeks later, I get an email telling me that the telescope has observed my data and it is now ready to download.

Tell me about your workflow with that data. How do you go from that raw radio telescope output to developing the insights of your research?

I actually don't really have a standardized workflow. I think that just reflects the nature of the data analysis that I do, which changes depending on the experiment at hand. My experiments range from very deep single object observations that might be time series data, to trying to tease out comparisons of what's happening between different statistical samples.

I tend to have larger surveys, so by the time I'm done I might have several tens of terabytes of data.

My workflow begins before I get the data, when I spend time really carefully designing an experiment by writing an observing proposal. After I’ve won my time, I make an observing script, the telescope observes the data, and it goes to the archive. Then the National Radio Astronomy Observatory (NRAO) will send this data through a data analysis and data reduction pipeline that they've written mostly in Python, and out spits an "initially calibrated" dataset.

At this point I go back in and actually open it up using the data reduction software that NRAO has written, and I examine the data more closely to pick up any sources of interference or bad data that the pipeline didn't initially flag. From there, I use Python scripts where the basic steps are more or less the same, but I change them depending on each target and what I know about it.

Finally, I'll go ahead and image the data and start to make measurements on the images. If the experiment is more of a statistical setup, then I'll start to compile datasets of all of the different characteristics that I'm interested in. Then I'll feed that through the big Python framework that I've built up over the last two years to get more information out of it. The big thing I want to emphasize is that the quality of my data analysis starts with my experiment design. I’m asking specific questions that drive the rest of my analysis and workflow.

A pair of brown dwarfs — Michael Liu, University of Hawaii
A pair of brown dwarfs — Michael Liu, University of Hawaii

How big are these data services that you’re working with?

The telescope can observe in four different configurations, which give you different noise and spatial resolution properties. Depending on the configuration that you're observing and how long you're observing for, one object might be anywhere between roughly 50 gigabytes to maybe a terabyte. By the time I'm done reducing that data and imaging it, it'll probably have roughly tripled to quadrupled in size. I tend to have larger surveys, so by the time I'm done I might have several tens of terabytes of data.

Where is this computation happening? You’re certainly not doing this on your laptop, right?

No. This is actually a big challenge for radio astronomy in particular — it's very data-intensive. If you talk to optical or UV astronomers, they don't have the same computing concerns as radio astronomers do.

There's always a low-level of fear that we will put in all this work, and then someone will come in and scoop up our data, write up the results, and ultimately get the credit.

NRAO has a computing cluster that I use pretty regularly, but I'm constantly up against my allotment with them. I also have my own workstation, which has about 20 terabytes of storage and somewhere in the order of 200 gigabytes of RAM. I'm getting another computer that will have 150 terabytes of space on it to do all my computing.

That's just the raw radio data, though. All of the computations that I'm running for statistics are just on my laptop. By the time I have statistical data in a text file, I can just open it up and run calculations on my laptop, thankfully.

How did you learn to code? And what was your first language?

The very first time I ever coded anything, I didn't really think of it as coding at the time. I was writing functions for my calculator, and I didn't think of it as coding, it was just a very handy, cool shortcut button. I remember thinking it was strange when everyone thought it was really cool to write this little program, since there's a phase that I went through where coding wasn't a cool thing to do— it was a thing that nerds did!

The first time I felt like I was legitimately coding was at a high school summer program at MIT called the Women's Technology Program. It was four weeks of learning basic circuits, discrete math, and Java. When I came back to MIT after that, it was for an architecture major, so I didn't actually do any more coding until I switched about a year and a half later to aerospace engineering. That was when I was required to take a Java course.

But I'll admit that as an astronomer, I never use Java. The first time I did any sort of real data analysis with coding was for my Junior Lab, where we recreated famous physics experiments and used MATLAB to analyze that data. Half the battle for that class was just learning how to analyze data with MATLAB because we didn't take a MATLAB class preceding it. Luckily, my labmate taught me everything that I know about MATLAB.

After that, I went to Chile for half a year and worked at the Cerro Tololo Inter-American Observatory for an REU (Research Experience for Undergraduates) program funded by the National Science Foundation. My research mentor there, Craig, sat with me for two or three hours every day to teach me how to read science papers and how to code in Perl and Fortran 77.

I did not learn Python until towards the very tail end of grad school, I would say. And then I didn't really use it in earnest until probably two or three years ago, when my science demanded it in order to move to big statistical analyses. A dear friend who is a machine learning researcher and software engineer taught me nearly everything I know about Python.

Milky Way over NRAO's Karl G. Jansky Very Large Array — NRAO/AUI/NSF, Jeff Hellerman
Milky Way over NRAO's Karl G. Jansky Very Large Array — NRAO/AUI/NSF, Jeff Hellerman

Wow! That's quite the journey. If I could switch gears a little bit: What are your thoughts about data transparency in science, and what, if any, problems are there for publishing your work?

That is actually a very complex question. I would love to see greater data transparency and reproducible results, I think those are good goals to have. If you talk to any scientists, people agree that the ideal of that is really great, because science is all about reproducibility— it is about free exchange of data in some ways. We're definitely seeing a push for that in astronomy, including more and more people publishing their code on GitHub.

But reproducible results also include making data freely available, not just the code. For people who use national facilities, typically data automatically becomes public after a certain time period, or even immediately in the case of the Hubble Space Telescope. But this is not necessarily the case for people who build their own smaller instruments or use privately owned facilities. There are also entire nations doing research that don't have a data sharing policy.

So the tricky thing is that the science of that data is what drives results, and it basically becomes our version of Intellectual Property. Right now, our job market, especially for people who are trying to find long-term positions, actually disincentivizes data sharing because there is an omnipresent fear of being "scooped".

Data is a resource that we put a lot of time into winning, by writing and vetting grants and proposals. There's always a low-level of fear that we will put in all this work, and then someone will come in and scoop up our data, write up the results, and ultimately get the credit.

The real question to answer is: how do we move science away from its current scarcity mindset to one oriented toward generosity? That gets into the very human need to feel secure enough to take risks and share. We have more than enough data to go around, but not enough jobs or resources to make the most use of that data. Is that what we want as a society?

One of the questions I've had to really grapple with is that I just cannot personally host all the terabytes of data I'm collecting. So the compromise I've arrived at is my data eventually becoming archivally available on the NRAO website— but it still takes skill to be able to analyze that data. I also plan on publishing the final images that I have so people can go back and re-analyze them, but that means some of the data won't be public until I publish my papers. That's just the best compromise that I've been able to come up with.

Check back soon for part two with Melodie, where we'll cover some of the problems with STEM education, the importance of professional boundaries, and exo-volcanism.

We're not astrophysicists, but we're fascinated by data of all kinds at Hex, where we're creating a platform that makes it easy to build and share interactive data products which can help teams be more impactful.