2/22/20263 min read

Building Predictive Lead Scoring Via Logistic Regression in Google BigQuery ML

What is Predictive Lead Scoring in BigQuery?

Predictive lead scoring uses BigQuery ML (BQML) to analyze historical customer data and assign a probability score to new leads, indicating their likelihood to convert. By using SQL-based machine learning, businesses can build, evaluate, and deploy models like Logistic Regression or Boosted Trees directly within their data warehouse, eliminating the need for complex data pipelines.

Most marketing teams struggle with "Lead Fatigue" spending too much time on cold prospects while hot leads go ignored. By leveraging BigQuery ML, you can turn your SQL workspace into a predictive powerhouse.

For this example, we will load the Bank Marketing dataset from Kaggle where we use past interactions to know if they are and how likely to say "Yes" to a specific bank product.

https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Step 1: Create the "Showroom" (Data Transformation)

Once you create a project, dataset and a table with the raw dataset. It's time to create a view in BigQuery. Why do that?

Abstraction Layer: If you change your underlying table (like adding a new column), you don't have to rebuild your AI model. You just update the View to point the right data to the model.

On-the-Fly Formatting: It allows you to fix data types, like turning a numeric day into a category or making month names consistent without actually changing or duplicating your original database.

Security & Governance: You can hide sensitive columns (like personal phone numbers) from the AI model while still giving it the behavioral data it needs to make a prediction.

CREATE OR REPLACE VIEW `bank-predictive-lead-scoring.bank_data_full.v_bank_marketing_transformed2` AS
SELECT *,
  LOWER(month) AS month_clean,
  CAST(day AS STRING) AS day_str
FROM `bank-predictive-lead-scoring.bank_data_full.bank-lead-scoring`;

Step 2: Train the "AI Brain" (Model Creation)

We trained a Logistic Regression model. We used auto_class_weights=TRUE to force the AI to find the rare "Yes" responses in a sea of "No" data.

CREATE OR REPLACE MODEL `bank-predictive-lead-scoring.bank_data_full.lead_scoring_model2`
OPTIONS(
  model_type='logistic_reg',
  input_label_cols=['y'],
  auto_class_weights=TRUE
) AS
SELECT * EXCEPT(customer_id)
FROM `bank-predictive-lead-scoring.bank_data_full.v_bank_marketing_transformed2`
WHERE y IS NOT NULL;

Step 3: Evaluate the Performance (Quality Check)

We checked the ROC AUC and Recall to ensure the model was actually smart. Our second model hit a 0.90 AUC, proving it's an expert at ranking leads.

This is because in the first model, day was a numeric field and not a categorical variable.

Think of ROC AUC as the AI's overall IQ score for this task, while Recall tells us if our net is wide enough to catch all the fish in the pond.

SELECT * FROM ML.EVALUATE(MODEL `bank-predictive-lead-scoring.bank_data_full.lead_scoring_model2`);

Model 1 Precision: 0.76 (Great when it said "yes," it was usually right). Recall: 0.12 (Terrible, it missed 88% of the actual buyers). ROC AUC: 0.75.

model 1 evaluation.png

Model 2

model 2 evaluation.png

Step 4: Generate the "Hot Leads" (Inference)

Finally, we "unpacked" the AI's hidden probability scores using UNNEST so we could see exactly how likely each individual customer is to buy.

SELECT
  customer_id,
  round(p.prob, 4) AS probability_score,
  * EXCEPT(customer_id, predicted_y, predicted_y_probs)
FROM
  ML.PREDICT(MODEL `bank-predictive-lead-scoring.bank_data_full.lead_scoring_model2`,
    (SELECT * FROM `bank-predictive-lead-scoring.bank_data_full.v_bank_marketing_transformed2`))
  CROSS JOIN UNNEST(predicted_y_probs) AS p
WHERE CAST(p.label AS STRING) = 'true'
ORDER BY probability_score DESC;

probability of lead.png probability of lead2.png

Marketing Spend Efficiency: When you know exactly which leads are likely to convert, you can stop wasting your ad budget on "lookalike" audiences that don't actually perform. You can bid higher on the leads that BQML identifies as high-intent.

Zero-Latency Execution: Because this lives in BigQuery, there is no "hand-off" to a data science team that takes three weeks to return a CSV. The moment a lead hits your warehouse, it's scored. You can trigger an automated email or a salesperson's task in real-time.

ShareX LinkedIn