How Computers Broke Science – and What We Can do to Fix it

Ben Marwick, University of Washington

Reproducibility is one of the cornerstones of science. Made popular by British scientist Robert Boyle in the 1660s, the idea is that a discovery should be reproducible before being accepted as scientific knowledge.

In essence, you should be able to produce the same results I did if you follow the method I describe when announcing my discovery in a scholarly publication. For example, if researchers can reproduce the effectiveness of a new drug at treating a disease, that’s a good sign it could work for all sufferers of the disease. If not, we’re left wondering what accident or mistake produced the original favorable result, and would doubt the drug’s usefulness.

For most of the history of science, researchers have reported their methods in a way that enabled independent reproduction of their results. But, since the introduction of the personal computer – and the point-and-click software programs that have evolved to make it more user-friendly – reproducibility of much research has become questionable, if not impossible. Too much of the research process is now shrouded by the opaque use of computers that many researchers have come to depend on. This makes it almost impossible for an outsider to recreate their results.

Recently, several groups have proposed similar solutions to this problem. Together they would break scientific data out of the black box of unrecorded computer manipulations so independent readers can again critically assess and reproduce results. Researchers, the public, and science itself would benefit.

Computers wrangle the data, but also obscure it

Statistician Victoria Stodden has described the unique place personal computers hold in the history of science. They’re not just an instrument – like a telescope or microscope – that enables new research. The computer is revolutionary in a different way; it’s a tiny factory for producing all kinds of new “scopes” to see new patterns in scientific data.

It’s hard to find a modern researcher who works without a computer, even in fields that aren’t intensely quantitative. Ecologists use computers to simulate the effect of disasters on animal populations. Biologists use computers to search massive amounts of DNA data. Astronomers use computers to control vast arrays of telescopes, and then process the collected data. Oceanographers use computers to combine data from satellites, ships and buoys to predict global climates. Social scientists use computers to discover and predict the effects of policy or to analyze interview transcripts. Computers help researchers in almost every discipline identify what’s interesting within their data.

Computers also tend to be personal instruments. We typically have exclusive use of our own, and the files and folders it contains are generally considered a private space, hidden from public view. Preparing data, analyzing it, visualizing the results – these are tasks done on the computer, in private. Only at the very end of the pipeline comes a publicly visible journal article summarizing all the private tasks.

The problem is that most modern science is so complicated, and most journal articles so brief, it’s impossible for the article to include details of many important methods and decisions made by the researcher as he analyzed his data on his computer. How, then, can another researcher judge the reliability of the results, or reproduce the analysis?

[nextpagelink][/nextpagelink]

Who Will Autonomous Cars Choose to Kill?

Now that Tesla has updated some car models to have a semi-autopilot mode, and researchers at Stanford are programming a modified DeLorean to perform crazy evasive maneuvers, it looks like our automotive future most certainly will eventually be dominated by self-driving cars. A set of researchers at the Toulouse School of Economics are now wondering about an old philosophical problem applied to this new age of automobiles: who will these driverless cars decide to kill in a scenario where it’s impossible to avoid?

An insightful article on the Popular Science website explains the philosophical dilemma and gets into the possible outcomes:

In philosophy, there’s an ethical question called the trolley problem. If you had to push one large person in front of a moving trolley to save a group of people on the tracks, would you? This abstract idea has taken hold in programming self-driving cars: what happens if it’s impossible to avoid everyone?

Researchers from the Toulouse School of Economics decided to see what the public would decide, and posed a series of questions to online survey-takers, including a situation where a car would either kill 10 people and save the driver, or swerve and kill the driver to save the group.

They found that more than 75 percent supported self-sacrifice of the passenger to save 10 people, and around 50 percent supported self-sacrifice when saving just one person. However, respondents didn’t actually think real cars would end up being programmed this way, and would probably save the passenger at all costs.

The researchers posed several different questions along this vein to participants in the Amazon Mechanical Turk program, and found some other interesting results that are detailed in the provocative and excellent article with all the details on the Popular Science website.

Source: PopSci.com – “WHO WILL DRIVERLESS CARS DECIDE TO KILL?

Featured Image Credit: Toulouse School of Economics

This is How Snowden’s Leaks Affected the EU’s Data Privacy Decision

Love him or hate him, it turns out that Edward Snowden’s leaks were at the root of the EU Court of Justice’s irreversible ruling that because, based on the leaked materials, the US law does not provide sufficient safeguards for individual privacy, EU customer data cannot be transferred to databases located on servers in the US.

An excellent and detailed article on WIRED’s website provides the details:

A RULING BY the Europe Union’s highest court today may create enormous headaches for US tech companies like Google and Facebook. But it could also provide more robust privacy protections for European citizens. And they all have Edward Snowden to thank—or blame.

Up until now, these companies have been able to transfer data they collect from users in the European Union to servers in the US, a practice made possible by the EU’s executive branch’s so-called “Safe Harbor Decision” in 2000. Today, the Court of Justice of the European Union ruled that the Safe Harbor Decision was invalid. The ruling cannot be appealed.

Now tech companies have to figure out what the ruling means. Facebook and other companies haven’t been found guilty of any wrongdoing. But quashing the Safe Harbor Decision could open the floodgates to privacy investigations and lawsuits.

Where does Snowden fit in?

The Safe Harbor Decision held that the US provided adequate safeguards for personal information and that no company transferring data from the EU to the US would be prosecuted for doing so. That determination was overruled today as a result of a legal complaint filed against Facebook in Ireland by Austrian activist Maximillian Schrems. Schrems argued that, based on information about the National Security Agency’s practices leaked by Edward Snowden in 2013, the US does not actually provide sufficient protection of private data and that Facebook therefore acted illegally by transferring his private data to its servers in the US.

This ruling will undoubtedly create some complex headaches for companies like Google, Facebook, LinkedIn, and likely even Amazon, as they will be forced to segregate their data storage for customers in the EU to be located on servers within the EU, which is a bit of a technical challenge, if not quite a nightmare.  Many companies automatically replicate their data to various locations around the world both for service performance and data redundancy for mitigating downtime and data loss due to individual server failures.

There are more details on the decision and what it means for online service companies in the detailed article on the WIRED website.

 

Source: WIRED.com – “Tech Companies Can Blame Snowden for Data Privacy Decision

Featured Photo Credit:  PLATON

Ever Wondered How Many Lines of Code Google is? Here’s the Startling Answer

It’s something that maybe you have never thought about as you type words for the thing you’re looking for into Google’s search box, but what an interesting question: how many lines of code is “Google?”

Remarkably, the company has disclosed that the vast majority of its code is stored in a single repository called “Piper” where all the coders at the company have access to it.

An amazing article from WIRED dives in to the details:

Google’s Rachel Potvin came pretty close to an answer Monday at an engineering conference in Silicon Valley. She estimates that the software needed to run all of Google’s Internet services—from Google Search to Gmail to Google Maps—spans some 2 billion lines of code. By comparison, Microsoft’s Windows operating system—one of the most complex software tools ever built for a single computer, a project under development since the 1980s—is likely in the realm of 50 million lines.

So, building Google is roughly the equivalent of building the Windows operating system 40 times over.

The comparison is more apt than you might think. Much like the code that underpins Windows, the 2 billion lines that drive Google are one thing. They drive Google Search, Google Maps, Google Docs, Google+, Google Calendar, Gmail, YouTube, and every other Google Internet service, and yet, all 2 billion lines sit in a single code repository available to all 25,000 Google engineers.

All in all, it may be the single largest code repository in the world, according to Ms. Potvin.

In order to manage this behemoth, Google has invented its own version control system to manage this staggering mass of code. The article continues:

Basically, Google has built its own “version control system” for juggling all this code. The system is called Piper, and it runs across the vast online infrastructure Google has built to run all its online services. According to Potvin, the system spans 10 different Google data centers.

It’s not just that all 2 billion lines of code sit inside a single system available to just about every engineer inside the company. It’s that this system gives Google engineers an unusual freedom to use and combine code from across myriad projects. “When you start a new project,” Potvin tells WIRED, “you have a wealth of libraries already available to you. Almost everything has already been done.” What’s more, engineers can make a single code change and instantly deploy it across all Google services. In updating one thing, they can update everything.

Although Piper contains virtually all of Google’s code, there are still some projects that are only accessible to select programmers, Potvin says. These include the PageRank search algorithm, as well as the Android and Chrome projects.

Even without those projects, Piper contains a stunning 85 terabytes of data and manages over 45,000 code changes made by Google’s 25,000 programmers every single day.  If your not impressed by now, then we recommend that you head over to YouTube (yep, it’s a Google product) and type “cat videos” in the search box. You will be entertained, we promise, without having to think about the insane technology that you just used.

For more amazing details about Google’s Piper and efforts to make a similar code repository available to all via an open source project in progress now, head on over to WIRED to read the rest of the story.

 

Source: WIRED.com – “Google Is 2 Billion Lines of Code—And It’s All in One Place”

Featured Image Credit: Getty Images (from the original article)

Incredible New Technology Scans Your Brain and Blocks Notifications

Ever had a hard time concentrating because your devices keep interrupting you with notifications? According to a very cool article over on Popular Science, researchers from Tufts University are working on an app for that!

Yes, seriously:

In the time between when you start and finish reading this article, you might check various social media notifications, gaze at your texts, maybe read another few paragraphs of that article on potatoes you meant to read last night. You might think you’re multitasking when you do that, but your brain is actually just switching quickly between tasks, and that means that you’re probably doing all of them pretty poorly. Now computer scientists from Tufts University are developing a system that detects your brain waves and, if your mind is busy, the software can quiet the frenetic beeping of your devices so you can actually concentrate, according to the New Scientist.

The project is called “Phylter” and it uses functional near-infrared spectroscopy (fNIS), a measurement of how the blood flows in the brain. It works by attaching a monitor to the user’s forehead with a band, which shoots beams of light into the brain…

The article goes on to describe how Phylter uses machine learning algorithms to figure out which electronic interruptions are actually important to you and lets those come through while blocking others.  Check out the full article and links to the research over on the Popular Science website.

Source: Brain-Scanning Software Blocks Your Notifications While You’re Busy on PopSci.com

Photo Credit: CDC/ Amanda Mills acquired from Public Health Image Library via freestockphotos.biz