Laboratory of Information and Data Analysis
Statistics and Multivariate Analysis
There is a great deal of medical information in journals and books. Statements such as “If one does X, Y will result” or “If one does X, Y will not result” are common, where X is perhaps a kind of food or exercise and Y is an illness or symptom. Of course, some of these statements are likely to be false, and we wish to determine whether the given information is actually correct.
As an extreme example, consider the Japanese saying: “If the wind blows, the bucket makers profit.” (Any event can have unexpected effects). Then, we must examine the profits of bucket makers after a windy day. In fact, the saying actually implies that some time may pass before the profit, but checking this requires doing a time series analysis and makes our example more complicated. Of course, a single day does not suffice and we must do this after every windy day (days where we forget are called missing values) during every season for several years. It is not sufficient to check a single bucket maker; ideally we would check the profits of every bucket maker. This is infeasible, so we select an appropriate sample. Although convenient, it is not sufficient to select only bucket makers that publish their profits on the internet. We need a sample that is not biased in terms of location, size or kind of business, etc. and use random (or stratified) sampling. Of course we must also check profits on days that are not windy. After such exhaustive efforts, perhaps we can make a statement such as “It appears that profits increase compared to non-windy days” or in some situations “There appears to be little difference in profits regardless of the wind.”
A similar effort is needed to check the medical information described above. In the 20th century, comparatively small amounts of data were gathered and calculators and charts were the main tools of analysis. We are now surrounded by so-called “big data”, and massive amounts of data can be easily gathered and it is natural to use software for statistics and visualization. The data is not only numeric, but also contains pictures and natural language. While new techniques are constantly developed, the underlying principles of data analysis remain unchanged.