Skip to content
HomeWorkProcessAboutBlogContact[email protected]
← Back to blog

Building Product Recommendation Engines for E-Commerce

Building Product Recommendation Engines for E-Commerce. How recommendation systems work and how to implement them for your store. Insights from The Proper Motion Company.

The “Customers who bought this also bought…” section might be the most profitable piece of real estate on the internet. Amazon attributes roughly 35% of its revenue to recommendation-driven purchases, and even modest e-commerce operations see measurable lifts when they move beyond hand-curated “featured products” lists. But recommendation systems occupy an odd position in the software landscape: conceptually simple, technically nuanced, and littered with implementation choices that can mean the difference between a 12% increase in average order value and an engine that confidently suggests winter coats to someone who just bought a swimsuit in July.

This post breaks down how recommendation engines actually work, the architectural decisions that matter for e-commerce specifically, and the practical path from “we have no personalization” to “our recommendations demonstrably drive revenue.”

The Three Families of Recommendation Approaches

Every recommendation system falls into one of three categories — or, more commonly, a hybrid of them.

Collaborative filtering looks at behavioral patterns across users. The core idea: if User A and User B both purchased items X, Y, and Z, and User A also purchased item W, then User B might be interested in W. You never need to understand what the products actually are. The signal comes entirely from co-occurrence patterns in purchase, view, and cart-add events.

The classic implementation is matrix factorization. You construct a sparse user-item interaction matrix (users as rows, products as columns, interaction scores as values), then decompose it into two lower-rank matrices whose product approximates the original. The latent factors in those matrices capture abstract “taste dimensions” — without anyone labeling what those dimensions mean. Alternating Least Squares (ALS) is the workhorse algorithm here because it handles implicit feedback (views, clicks) naturally and parallelizes well for large catalogs.

Content-based filtering uses product attributes — category, brand, color, price range, description text — to find items similar to ones a user has already engaged with. If someone bought a pair of running shoes, the system surfaces other running shoes with similar attributes. This approach does not require other users’ behavior and works well for new products with zero purchase history (the “item cold start” problem).

The modern implementation generates vector embeddings from product descriptions and images using pre-trained transformer models (sentence-transformers for text, CLIP for images), then performs approximate nearest-neighbor searches in embedding space using a library like FAISS or a managed service like Pinecone or Weaviate.

Knowledge-based and rule-driven recommendations apply explicit business logic: “always recommend a phone case after a phone purchase,” “never recommend a competing brand,” “surface warranty upsells on items over $200.” Every e-commerce operation has these rules, and they often encode domain expertise that no algorithm will discover on its own.

A production recommendation engine layers all three. Collaborative filtering captures emergent behavioral patterns, content-based filtering fills gaps where behavioral data is sparse, and business rules enforce strategic constraints.

Related: How AI Changes Software Architecture

Data Collection and the Event Pipeline

Recommendations are only as good as the behavioral signal feeding them. The minimum viable event set for e-commerce:

  • Product view (with duration if possible)
  • Add to cart
  • Remove from cart (a negative signal often ignored)
  • Purchase (the strongest positive signal)
  • Search query (tells you intent even when it does not result in a click)
  • Category/filter navigation (reveals preference dimensions)

Each event should include a user identifier (authenticated user ID or anonymous session ID), a timestamp, the product ID, and contextual metadata (device type, referral source, session ID). Events are captured client-side via a JavaScript tracker (a lightweight library similar to Segment’s analytics.js) and sent to an event ingestion endpoint.

The backend architecture follows a standard event-streaming pattern: events land in a message queue (Kafka or Amazon Kinesis), a stream processor enriches them with product metadata (joining against the product catalog), and the enriched events are written to both a real-time feature store and a batch data warehouse. The real-time store (Redis or DynamoDB) powers “trending now” and session-based recommendations. The batch warehouse (BigQuery, Snowflake, or a Parquet-based data lake) feeds nightly model retraining.

A critical implementation detail: anonymous-to-authenticated identity stitching. When a user browses anonymously and then logs in, their anonymous session history must be merged with their authenticated profile. Without this, you lose all pre-login behavioral signal, which is often the majority of browsing activity.

The Recommendation Architecture

A well-structured recommendation system separates three concerns:

Candidate generation narrows the full catalog (potentially millions of items) to a few hundred candidates. This is the collaborative-filtering or embedding-similarity step. Speed matters here — you cannot score every item in the catalog on every request. Pre-computed candidate lists (stored in Redis or a vector index) and approximate nearest-neighbor algorithms keep latency under control.

Ranking takes the candidate set and scores each item based on a richer set of features: user affinity scores, product margin, inventory level, recency of the product, and contextual factors like time of day or device type. A gradient-boosted tree model (XGBoost or LightGBM) trained on historical click-through and conversion data is the standard approach. The model outputs a relevance score, and the candidate set is sorted accordingly.

Post-processing and business rules apply final adjustments: deduplicate items already in the cart, enforce diversity (do not show five black t-shirts in a row), suppress out-of-stock items, apply promotional boosts, and enforce any brand-exclusion rules. This layer is deliberately procedural — it is the escape valve where business stakeholders can influence output without touching the model.

The entire pipeline should execute in under 200 milliseconds for synchronous page loads. Pre-computation helps: for logged-in users, batch jobs can generate personalized recommendation lists nightly, with the real-time layer adjusting based on within-session behavior.

See also: AI Chatbots vs AI Assistants: Choosing the Right Approach

Handling Cold Start Problems

New users and new products both lack the behavioral data that collaborative filtering depends on. Concrete strategies for each:

New user cold start: Start with non-personalized recommendations — bestsellers, trending items, editorially curated lists. As the user interacts (even a single product view), shift to content-based recommendations derived from that interaction. After 5-10 interactions, collaborative filtering begins to contribute meaningful signal. The transition should be seamless; the ranking model’s feature weights naturally shift as more behavioral data becomes available.

New product cold start: Use content-based similarity to place the new product into the recommendation ecosystem immediately. Generate embeddings from the product’s title, description, and images, and find its nearest neighbors in embedding space. Additionally, give new products a temporary “exploration boost” in the ranking model so they receive enough exposure to accumulate behavioral data. Monitor click-through rate during this exploration phase; if the product underperforms after sufficient exposure, let the boost decay.

Seasonal and contextual adjustment: A recommendation engine trained on twelve months of data will overweight winter products entering summer. Time-decay weighting in the training data (exponential decay favoring recent interactions) and explicit seasonal features in the ranking model address this. Some teams maintain separate candidate-generation models for different seasonal contexts.

Measuring Impact and Running Experiments

Recommendation quality is measured at two levels: offline metrics during model development and online metrics in production.

Offline metrics evaluate model accuracy on held-out historical data. Mean Average Precision at K (MAP@K), Normalized Discounted Cumulative Gain (NDCG), and recall at K are standard. These metrics tell you whether the model can recover known-good recommendations from historical data, but they do not tell you whether the recommendations will change user behavior.

Online metrics require A/B testing. The key business metrics:

  • Click-through rate (CTR) on recommendation widgets
  • Add-to-cart rate from recommendation clicks
  • Revenue per session for users exposed to recommendations vs. a control group
  • Average order value uplift
  • Catalog coverage — what percentage of your catalog appears in recommendations? Low coverage means the system is reinforcing popularity bias.

A/B tests should run for at least two full purchase cycles (typically 2-4 weeks for most e-commerce) to account for delayed conversions. Use user-level randomization (not session-level) to avoid contamination from users appearing in both groups across sessions.

One underappreciated metric: discovery rate — how often users purchase items they had not previously viewed or searched for. High discovery rate means the engine is genuinely expanding user awareness of the catalog, not just confirming existing intent.

Practical Implementation Path

For most e-commerce operations, the path from zero to effective recommendations follows this sequence:

Phase 1 (weeks 1-3): Implement event tracking and build the event pipeline. This is foundational and provides value even before model development, as the raw event data enables analytics.

Phase 2 (weeks 3-6): Deploy rule-based and popularity-based recommendations. “Trending in this category,” “frequently bought together” (computed via simple co-purchase frequency), and “similar products” (computed via content embeddings). These baselines are surprisingly effective and establish the measurement framework.

Phase 3 (weeks 6-12): Train collaborative-filtering models on accumulated behavioral data, build the ranking layer, and A/B test against the Phase 2 baselines. This is where the largest lift typically occurs.

Phase 4 (ongoing): Iterate. Add contextual features, experiment with different model architectures, refine business rules based on merchandising feedback, and expand recommendation placements (email, search results, checkout page).

The critical mistake is trying to jump straight to Phase 3 without the data infrastructure and measurement framework of Phases 1 and 2. A sophisticated model trained on bad data and measured by gut feeling will underperform a simple co-purchase table backed by clean event data and rigorous A/B tests.


If you are ready to move beyond generic product grids and build a recommendation engine tailored to your catalog and customers, get in touch with our team. We build these systems from the data layer up, ensuring every recommendation is measurable and every model decision is traceable.

// Related posts