Eyetracking research and form design

In recent years, a number of form design related eyetracking studies seem to have captured popular attention amongst the web community. These include a study by Matteo Penzo on label placement and another more recent study by cxPartners.

Not only have these studies been widely circulated on the Internet but, in the case of Penzo’s study, they have formed the basis of some parts of Luke Wroblewski’s popular book “Web Form Design: Filling in the Blanks”.

We think it’s great that the unique design challenge forms represent is getting more attention from the web and user experience community. However, we are a little unsettled about the increasing use of these articles as the basis for best practice. Our concern stems from what we see as fairly major flaws in the methodology that these and similar eyetracking research studies contain.

The two methodological problems lie behind much past forms-related eyetracking research

In many cases, eyetracking studies that have examined different options for the design of forms have suffered from two main shortcomings:

  • insufficient recognition that seeing is not equivalent to attention; and
  • drawing inferences from an inadequate sample.

Seeing does not equate to attention

It is one thing to know that someone has directed their gaze in a particular place. It is another thing entirely to know what they were attending to — or thinking — at that time.

Zimmerman estimates that at any one time, the eyes take in 10,000,000 bits per second of information, yet we pay conscious attention to only 40 bits per second [1]. That’s 40 bits out of 10 million, or attention going to only 0.0004% of what we see.

If you’re not convinced about this phenomenon, try to remember what colour shirt the person you share an office with was wearing yesterday, or even the colour of their eyes. You probably look at both things many times in a working day, but you don’t necessarily attend to them.

The implications for eyetracking research is that such studies give us only part of the picture of what’s going on when someone interacts with a form. In order to truly make informed conclusions, we need to supplement this picture with information from other sources. This might include error and task analysis of the completed forms and/or probing the participant, using protocols such as concurrent “think aloud” or retrospective discussion.

An adequate sample is a prerequisite for drawing inferences

Our second and equally significant concern relates to the design of the samples used to conduct these many eyetracking studies.

As an example, the cxPartners’ study involved only 8 participants: 6 female and 2 male, all of which were in their 20s or 30s and reasonably web savvy. Without considering any other aspects of the study’s design, this is enough to make a statistician break out into a cold sweat.

The statistician’s reaction is because the sample used by cxPartners is highly likely to have been skewed. By skewed we mean that the sample probably doesn’t accurately reflect the greater web-form-filling population. At the very least, it would have been preferable to have included both younger and older participants, not to mention more males.

Furthermore, the sample size — 8 people — is so small that it is likely to be highly influenced by the nature of the particular 8 participants that were involved. Pick a different 8 people and there is a good chance that the findings from the research would be very different.

This is why 30 is the recommended minimum sample size for any study from which inferences for a general population are to be drawn [2]. While there’s a lot more to designing a good sample than having a minimum of 30 participants, this will at least get you into the space where you might be able to calculate statistical significance.

Statistical significance is about knowing which differences are likely to be due to just the particular sample that was selected as opposed to reflecting a true difference in the underlying population. With a small sample size, we cannot calculate statistical significance and thus have no real indication of the reliability of our findings.

(For more on statistical significance and user research, see Caroline Jarrett’s recent article on Usability News titled “Statistically significant usability testing“.)

Being transparent about sample design is important

One thing cxPartners did well in their article is describe the sample that formed the basis of their research. Providing this information empowers the reader to make their own judgement about how to use the findings presented therein.

Conversely, Matteo Penzo’s article doesn’t give many specifics about the design of his sample. He says that the sample included both expert users—primarily designers and programmers, but also some usability experts—and novice users. But we are not given any more detail nor told how many participants there were. One hopes, given the immense popularity of his article, that Penzo’s sample was both representative and large.

Better not to report at all?

To be fair to the team at cxPartners, their eyetracking forms article did begin with note about the potentially invalidity of the study. Isn’t it enough that readers were duly warned? Unfortunately, we think not.

It is our impression that web designers and developers are hungry for guidelines based on research. This hunger is a great thing: it means we all want to know more and create the best sites we can. However, it also means that readers are likely to latch on to the findings of a study and pay little regard to the caveats regarding methodology that are placed around it. This is just human nature. We can work with a guideline; we need a guideline. The perhaps-flimsy basis behind the guideline is just all too often seen as the spoil-sport at the party and pushed to one side.

So what should researchers do with findings based on an inadequate sample? Perhaps controversially, we suggest that rather than report findings with caveats around them, it may be better to not report such findings at all. That way widespread inappropriate use can be prevented.

This is a hard position for many people to accept. Surely it is better to have some findings than nofindings?

The problem is that the “some” findings may be pointing in completely the wrong direction. If we have no data, there’s nothing to suggest one course of action is better than another. But if we have bad data, it can lead us astray, all the while with a false sense of confidence in our decision because, after all, it is based on research findings.

We raise these issues to help progress the field

We did not write this article to embarrass or shame anyone, nor to discourage people from doing forms research. We know from direct experience how unbelievably hard it is to design a sound research study.

Moreover, we think both Matteo Penzo and cxPartners should be congratulated for actually taking the (not insignificant) time and effort to actually do some research and share their findings with the community. A lot of people make demands of such individuals — “Why didn’t you do X?”, “What would have happened if you had tested Y?” etc — but very few people actually take up the gauntlet and run such studies themselves.

Having said that, what we would like to see in the future is for the web community to have a higher awareness of what makes for quality research, and approach published studies with a more critical eye. Formulate has and will be on the receiving end of such critique — see, for example, the comments to our recent research article on A List Apart — but as long as it is informed and considered, we believe this can only help to advance the field.

In the end we hope the web industry will recognise the importance of the sort of rigour that has been commonplace for decades in other fields such as psychology and social research. Not only will this lead to better design decisions, but we believe it will help the industry mature, in turn generating respect for the web as a serious vehicle for communication, transaction and information.

Postscript: 1 July 2009

Since writing this article we have been contacted by cxPartners. They have explained that the guidelines presented in their article were based not only on the eyetracking study that forms the basis of the post, but also on a number of other studies that they have conducted for clients in the past (which they can’t share for commercial confidentiality reasons).

The guidelines recommended by practitioners with many years experience, like the team at cxPartners, are definitely worth taking into consideration. Indeed a lot of the content on the Formulate website has just such a source (although we try to provide independent third party verification where possible).

What still holds from our original article though, is the importance of being clear about what data the recommendations are based on, and making sure that eyetracking studies use well-designed samples.

[1] Zimmerman, M. (1989) “The Nervous System in the Context of Information Theory”. In Zimmerman, M. Schmidt, R. F. & Thews, G. (eds) Human physiology pp. 166-173.

[2] This minimum of 30 can be found in almost any statistics or sampling textbook, e.g. Howell, D.C. (1982) Statistical Methods for Psychology p. 149. The number comes from the fact that given a large population, the greater the sample size, the closer the distribution of means from samples of that size comes to approximating the normal distribution. This in turn makes various sample estimates — including statistical significance — valid (provided some other conditions also hold, but we won’t go into that here!).