Categorical Feature Encoding in SAS (Bayesian Encoders)
What is Bayesian Encoding? Bayesian Encoding is a type of encoding that takes into account intra-category variation and the target mean when encoding categorical variables. It is a type of targeted encoding that comes with several advantages. For example, Bayesian...
Michael Dixon
Managing Director
Michael is the original platform nerd (his words, not ours). He’s spent the past 30 years immersed in the world of SAS and enterprise analytics — breaking things, fixing them better, and helping organisations do far more with their data than they thought possible. As the founder of Selerity, he brings a rare blend of deep technical knowledge, commercial pragmatism, and a dry sense of humour to every client conversation.
What is Bayesian Encoding?
Bayesian Encoding is a type of encoding that takes into account intra-category variation and the target mean when encoding categorical variables. It is a type of targeted encoding that comes with several advantages. For example, Bayesian Encoding requires minimal effort compared to other encoding methods.
In this blog post, we talk about the different Bayesian encoding techniques and how they work.
1. Target/Mean Encoding
Target or Mean Encoding is one of the most commonly used encoding techniques in Kaggle competitions.
Target encoding is where each class value of the categorical variable is replaced by the mean value of the target variable, with respect to the categorical class in the training dataset.
Hence, we have to specify the target variable in the SAS Mean Encoding Macro, as shown in the code below.
Check out this link to know more information about categorical variable encoding.
SAS Macro for Target/Mean Encoding
%macro mean_encoding(dataset,var,target);
proc sql;
create table mean_table as
select distinct(&var) as gr, round(mean(&target),00.1) As mean_encode
from &dataset
group by gr;
create table new as select d.* , m.mean_encode from &dataset as d left join mean_table as mon &var=m.gr;
quit;
%mend;
2. Weight of Evidence Encoding
“Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique that is used to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry.
WoE will be 0 if the P(Goods) / P(Bads) = 1. That is, if the outcome is random for that group. If P(Bads) > P(Goods), the odds ratio will be < 1, and the WoE will be < 0. If, on the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.
WoE is well suited for Logistic Regression because the logit transformation is simply the log of the odds, i.e. in(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.
SAS Macro for Weight of Evidence Encoding
%macro woe_encoding(dataset,var,target);
proc sql noprint;
create table stats as
select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode
from &dataset
group by gr;
quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001;
me_by_bp=mean_encode/bad_prob;
woe_encode=log(me_by_bp);
run;
proc sql noprint;
create table new as
select d.* , s.woe_encode
from &dataset as d
left join stats as s
on &var=s.gr;
quit;
%mend;
3. Probability Ratio Encoding
“Probability Ratio Encoding” is similar to Weight Of Evidence, the only difference is the ratio of good and bad probability being used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). Then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio.
We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0. Check out this link for more information.
SAS Macro for Probability Ratio Encoding
%macro probability_encoding(dataset,var,target);
proc sql noprint;
create table stats as
select distinct(&var) as gr, round(mean(&target),00.1) as mean_encode
from &dataset
group by gr;
quit;
data stats;
set stats;
bad_prob=1-mean_encode;
if bad_prob=0 then bad_prob=0.0001;
prob_encode=mean_encode/bad_prob;
run;
proc sql noprint;
create table new as
select d.* , s.prob_encode
from &dataset as d
left join stats as s
on &var=s.gr;
quit;
%mend;
Wrapping Up
Categorical Feature Encoding is an important part of cleaning up data for machine learning models. However, each method works in different circumstances so it is important to know about different techniques that fall under the Bayesian category.
If you want to take a look at how the coding operates in a SAS environment, you can find all the SAS Macro Definition code on my GitHub page here.
Solving the Agentic AI Governance Gap: Moving From Exploration to Execution
The enterprise landscape is shifting rapidly. Currently, 79% of organisations have implemented autonomous agents, yet fewer than 25% have operationalised...
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional
Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes.The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.