The data comes from users and also data analysis
yesterday saw caoz wrote "data analysis this thing", very worthy of depth, after reading, very touched, but also here to write about the data analysis of personal views.
first, in the data analysis I can’t pretend not master, many analysis algorithms will not use any statistical tools, will only be silly to stare. But I really like to read all kinds of data. I watch all kinds of hardware tests all day long at university. At graduate stage, I watch countless camera and camera evaluation, then I think about the sales of game consoles and games all over the world every week. The work is particularly love the establishment of various statistical system, see all kinds of data, now all the company’s statistical code I wrote it myself, working every day will take the time to study the data of nearly 30%, at least can be regarded as a data analysis of hundred-percent lovers.
on data analysis, caoz has said very well, and I can only add to my experience, feeling.
1, regardless of statistics or other people’s data, the first step is always the reliability of data acquisition. If it is sampling data, we must take a look at the sampling method, to see what kind of error may exist. If it is their own data, but also to see if the data acquisition itself is scientific, such as statistical user behavior, generally use js callback, if you still use the Apache log to do statistics, the results will not be able to fly.
2, after obtaining the data, it is necessary to establish statistics, at this time, you need to think about what kind of statistical information to establish, in order to better analyze the characteristics of products and users. A lot of times, often a single feature has been difficult to describe, need to integrate a lot of places to see. For example, the web search, often depends on the first CTR, the first three CTR, the last click, and many other factors, and through a variety of different factors combine to make analysis and judgment.
3 is doubtful about the data, especially whether there is a sure cause and effect relationship between the data itself and the conclusions you are trying to reach. For example, web search results if CTR is high experience? Search advertising RPM as an ideal
4, the same data generated, often can have different statistical methods, if the choice is wrong, the conclusions are often very different. For example, to analyze the dependence on the search engine’s Web site, it should be with PV, Session, or UV to do with statistics? If a user visits a day for many times, some are from search engines, some are active access, how to calculate? There is still a deep learning.
5, there are always a lot of noise in the data, how to filter these noises is also very important. Just like voting has a voting machine, some spider will execute your statistics, JS, some users will be late, if not very good filtering and processing, will make the reliability of the data greatly reduced. < >