What data do I need to do ML and AI well?

JoelShapiro · March 2023

Adrianna · March 2023

To do machine learning and AI well, you generally need a large and diverse set of high-quality data relevant to the problem you are trying to solve. The quality of your data plays a critical role in determining the performance and accuracy of your machine-learning models.

The type of data you need depends on the specific problem you are trying to solve. If you upload a data set to Symon.AI for a quarterly financial forecast, you'll need a large set of clean, clearly labelled data to analyze. You can use data prep and cleaning tools to ensure it's ready.

Here are some key characteristics of good data for machine learning and AI:

High-quality: Data should be accurate, consistent, and free from errors or biases.
Large and diverse: A large and varied data set can help ensure that your models are robust and can handle various scenarios.
Relevant: Data should be relevant to the problem you are trying to solve. Irrelevant data leads to inaccurate or unreliable results.
Labelled: Labeled data (data annotated with the correct answers) can be precious for supervised learning tasks, as it allows the model to learn from examples.
Balanced: If your data is imbalanced (e.g., you have many more examples of one class than another), it can lead to biased models that perform poorly on certain types of data.

Overall, understanding the data you need and how to collect and prepare it is crucial for success in machine learning and AI.

AMuresan · March 2023

You will need the exact kind of data you would have had at the time of collection. This may sound a bit confusing, but remember that the point of predictive models is to learn from past observations to predict the future.

The way they do this is we give them snapshots of what an event looked like from a data perspective at the time we would have tried to make the prediction. It is very important to define some dates in the data collection process:

The date data can be collected up to. Note this may not be the date we make predictions, since depending on the availability of fresh data we may only know things that have happened say 3 days before the prediction date. For example web site visit data may need to be cleaned and privacy sanitized by a third party before your company can use it in analytics work. This may take some extra time. This is important because since in production, i.e. when the model is being used to inform business decisions, since the last 3 days of web visits are not available for example, any patters your model learns in that very recent data will be missing. This will drastically reduce performance. (It is important hear to speak to your data department and understand the data lag in your organization)
The date window you want to predict an event occurring on, for example 7-14 days after the prediction date. You may not always be interested in things that happen immediately after the prediction date since at this point it may be too late to intervene and change the outcome. For example, if you know a cell phone provider will cancel by going to another vendor tomorrow you do not have time to get them a loyalty offer to prevent this. (It is important to understand the time it takes to execute on a strategy at your company. It is helpful if the strategy is pre proved and only the individuals being contacted are being decided at a later date)
Now you want to gather data for the population a different "snapshot" or prediction dates keeping in mind the data lag. then attach to these snapshots the truth of what they did in the 7-14 window afterwards. The different dates will be useful to help the model understand if it is good at making predictions across time by having a variety of samples.

It is incredibly important overall that no "future facing data" be allowed into the training data. This s information that happened after the snapshot date. Let's say we were trying to predict survival rates for patients on a given drug. When collecting data for our training data we may be looking at events that occurred last year, if we include the satisfaction surveys those patients were asked to fill out 1 month after the treatment we would have a very string indicator of survival, since only successful treatment would allow a patient to fill out a survey. However, if today I want to predict survival rates on a new host of patients, I do not know who will fill out a survey after the treatment so I trained my model on information I do not have.

That is the most important aspect to make a good prediction.

Here are some more things that will help:

Ensure your data is as clean as possible, having filled in null values where you know the cause for the null. (i.e. null sometimes means no interactions or 0)

Ensure your data does not have any columns with too many unique text values. Most models have a hard time dealing with this and while we have a number of algorithms on the back end to simplify this and help them out, domain knowledge is always useful.

If you have dates in your input data, such as "Date Customer Acquired" or "Date of Last Bill Payment" transform them into relative values, like "Tenure of Customer" or "Days since last bill payment" these values are more general and will allow the model to be accurate in the future.

Ensure you are not giving the model any sensitive data you do not want to be a part of the decision making process. (i.e. ages, sex, gender, ethnicity etc.). Note it may be useful to have this information in order to profile the model results by it. A model unbiased by a particular attribute would have the same distribution of predictions across all values of that attribute, but for example if a model seem to always predict "buy" for younger customers and "not buy" for older, then your model has found proxy data to figure out age.

What data do I need to do ML and AI well?

Comments

Categories