How to Interpret and understand DataHawk’s Sales Estimator Tool
Take an insider’s look at the new and improved Amazon sales estimator, from its inception to its latest version, released in September 2021. After reading this article, you will understand how feature building works at DataHawk and how to use the application-based sales estimator tool. Last but not least, you will see how our sales estimator tool leaves all our competitors in the dust. There is no product on the market more accurate!
- What is the DataHawk Sales Estimator?
- Why DataHawk Created the Sales Estimator Tool
- November 2020 to January 2021: The research phase
- February and March 2021: EMS 2.0 into production
- August and September 2021: EMS 3.0
- Using the Sales Estimator on the DataHawk Dashboard
- Error Rate & Competitors
- What’s next?
What is the DataHawk Sales Estimator?
The Sales Estimator displays a historical view of daily sales estimations. Both of these features are available as soon as you start tracking a product on the DataHawk platform.
Why DataHawk Created the Sales Estimator Tool
At DataHawk, we excel at large-scale product data retrieval. Data, pulled daily from Amazon’s product pages, include the name of the product, seller, best seller rank (referred to as BSR), reviews, ratings, and so forth. You name it, we extract it.
Producing beautiful display charts of historical data is another of our many talents. To support our growth, we’re continuing to bring new members of the data team on board. This includes analysts, scientists, and machine learning experts. Our capacity to scale and leverage Big Data tables for your benefit is also growing exponentially. One of the ways to leverage big data is by creating homemade KPIs (Key Performance Indicators) based on the data we fetch and process. This is how sales estimates are determined.
Since our first sales estimator tool release in 2018, the goal has remained the same: deliver estimated product sales figures using only public data.
The 2018 DataHawk Sales Estimator was our first iteration. At that time, we pulled data from historical customer reviews. If a product had 20 additional reviews each day, we would consider these reviews as a percentage of all the sales made. This simple model worked well for some marketplaces and categories, but less so for others. Our sales estimates weren’t as trustworthy as we wanted. In order to improve the accuracy of sales estimate results, we launched a renewal mission: EMS 2.0 (Estimated Monthly Sales 2.0) in November 2020.
November 2020 to January 2021: The research phase
The obvious answer was the Best Seller Rank.
The vast majority of product listings on Amazon display two different bestseller ranks:
- the rank of a product in its root category (i.e. Toys & Games – #15,710)
- the rank of a product in the category for which it ranks the best (i.e. Toys & Games > Kids’ Electronics > MP3 Players – #13)
As one can expect, the best seller rank of a product in a category is highly correlated with its sales. DataHawk has an extensive repository of historized BSR data because of the hundreds of thousands of products tracked using our platform.
The research phase was documented using Google Sheets. We used a root category in the US marketplace for which we have a lot of data and tried to create a model linking the median of ranks for all the different quantities sold within a month. Here is what our first models looked like:
- Product sales estimates have to be on a Parent level, not on a Child level. For the vast majority of products, sales ranks are given at a parent level (every child has the same sales rank) and we have no solution (right now) to divide the estimated sales between the different child ASINs of a product. Example: If a product has 1000$ of estimated sales ⇒ the sales from all its variants summed up equal to 1000$ ⇒ We can’t divide estimations between variants for now.
- Product sales estimates are product level, not seller level. Sales ranks are also agnostic of the number of active sellers for a product. Thus, the estimate provided is for all of the sellers combined. We have no solution (as of right now) to divide estimated sales between sellers.
- Our model has to follow a logarithmic scale for it to follow the rankings’ evolution.
- We need marketplace-wide models to cover products that were in categories with too few data points.
- Our models should be based on root categories ranks. At least at that time, creating a limited number of models helped us maintain their accuracy.
After reproducing our Google Sheets models for different root categories and marketplaces, we figured that this – pretty straightforward – version worked well enough to go further. We were estimating sales in the same ranges as JungleScout or FBAtoolkit. It was encouraging enough that we decided to transfer our knowledge into automatized Python notebooks.
February and March 2021: EMS 2.0 into production
After mapping out a strategy, we started implementing data intelligence into a Python notebook on DataBricks. Its job was querying the data from a server, creating all the possible data science models linking units sold to ranks (following the discoveries we made in the previous months) using NumPy & Pandas functions.
For those interested in mathematical equations, a model is defined by 3 variables A, B & C and it looks as follows:
$EstimatedUnitsSold = Model (A,B,C,Rank)$
With the data science automation step, 3 variables were defined for all the root categories across marketplaces with enough data points. For every product having a root category BSR or being in a covered marketplace, we could then compute the sales of the last 30 days.
We implemented the new DataHawk sales estimator back in April 2020 and the review-based former estimator became legacy. This was a big moment because it showed us two important things:
- We have the capacity to build data science models and put them into production within a few months.
- Our new estimator worked well and with only a few engineers on the project and several months of work, we were surpassing competition with a much more accurate tool.
Each month, we would refresh our EMS models by launching the notebook and the data pipeline.
Problems of EMS 2.0
After several months of using this new version of the DataHawk sales estimator some problems surfaced:
- We had recently started using Snowflake as a warehouse and dbt as a warehouse manager and we didn’t take advantage of these amazing data tools.
- Our category referential had worsened up and because of wrong category mapping, our estimates coverage (throughout all the tracked products on our platform) went from 75% down to 45%.
- When tracking a product, we sometimes miss out on the price because of the BuyBox being unavailable. We can also miss out on a BSR because of layout changes on the product pages. Even if these glitches are super rare, they caused holes in the sales estimates data.
- Last but not least, we had to launch our modeling and data pipeline each month and we needed an automated system.
August and September 2021: EMS 3.0
In August 2021, intending to resolve all of version 2’s problems, we launched a new project: the third version of our sales estimator.
Browse node referential renewal
We are currently using the Amazon categories referential in several features of the DataHawk application: Sales Rank Tracker, BSR Browser, Best Selling Categories Report, Estimated Monthly Sales Chart, etc.
The way we used to create and update categories within our database wasn’t working anymore. We started working on rebuilding this process. The transition to Snowflake and dbt was a blessing for this project and using views, tables, and procedures, we managed to create a new Amazon Browse Node referential pipeline taking its source in the BSR top 100 data and the product pages data.
With ranks mapped to correct categories, the number of data points per category grew, thus the number of created models grew, and ultimately, our coverage went from a sloppy 45% up to 83% of estimations available for all the tracked products in DataHawk. As you can expect, this renewal mission of our browse node referential has also positively impacted all the features using it, previously cited.
83% of coverage still wasn’t up to the standards we wanted so we started working on filling out data holes, more commonly known as data smoothing.
First of all, as previously stated, we cannot always retrieve the price of a product page. Even if we estimate units sold, we need the price to compute estimated sales. We added a price wherever it was possible (referencing every tracked product), meaning we filled every null price with the last one we saw for that product. Price coverage, for each product, is either 0% or 100%. Overall 90% of products have a 100% price coverage.
Secondly, we had some data holes in the estimations as well (i.e. missing rank for a day meant no estimation for that day). The data smoothing we conducted was a weighted average that gives more importance to the data the closest to the missing data’s date. Thanks to this data smoothing step, we have around a +6% flat increase of coverage for our estimates (from 83% to 89%)
Next, to reach our initial goal of 100% coverage for version 3.0 of the DataHawk Sales Estimator, we used the first version of estimates (v1.0), based on reviews for the 11% of uncovered products remaining. Because of the high error margin for this version, we first decided to shrink this error down by comparing estimated to real data and by having different Review-based models per marketplace. This process worked quite efficiently and we were able to divide the error by a lot (factor of 10 in the US marketplace).
Eventually, we merged both estimations strategies (BSR-based EMS and Review-based EMS) by prioritizing the first one over the latter and the coverage went from 89% to 99%.
To sum up, 89% of our estimations are based on the ranks, 10% of our estimations are based on the reviews of that product and the missing percent of estimates comes from the fact that we need a bit of historical data to have a review-based estimate.
Implementation and release of DataHawk’s newest sales estimator
Using the Sales Estimator on the DataHawk Dashboard
- Project > Products > Tracked Product List > Sales | Units https://demo.datahawk.co/app/project/23522/product
- Project > Product Details > Metrics > Estimated Sales Card https://demo.datahawk.co/app/project/23522/product/1036355/t/product-metrics
On the right of the two metrics, you will see a colored dot representing the confidence score (green, orange, or red).
Product metrics chart
The two metrics on the top left are the same as the one shown in the product list. These are the latest estimated monthly sales and the latest estimated monthly unit sold.
Even though we compute estimated monthly sales and estimated monthly units sold (an estimation of how much a product was sold over the last 30 days), we can switch to a daily vision just by dividing estimated sales and estimated units sold by 30.
This chart is a visualization of estimated daily sales (center blue line & scatter – left axis) & units sold (green bars – right axis).
The top & bottom blue lines are the maximum estimated sales and the minimum estimated sales.
The last metric on the bottom left of the chart is the aggregated estimations. We complie what’s shown on the chart above and display it here. This aggregation changes when the date range input (on the top of the page) is changed.
Confidence score proportions
- Green: High Trust Index ⇒ ModelErrorRate ≤ 50%
- Orange: Moderate Trust Index ⇒ ****50% < ModelErrorRate ≤ 100%
- Red: Low Trust Index ⇒ ModelErrorRate > 100%
Using these parameters to determine confidence score, there are approximately 1/3 of high, 1/3 of moderate, and 1/3 of low confidence estimates throughout all our products. These parameters won’t change and we’re always working to better our algorithms to increase the high and moderate ratios.
The main reason your product could have a low confidence score is the lack of data points within that product’s category. As a DataHawk customer, you should aim to track as many products as possible to provide our algorithm with more data to work with.
Error Rate & Competitors
We asked customers for real sales data to compute an error rate and to compare our estimator with our competitors’. We will not yet discuss the confidence score (computed upstream, depending on the model used) but instead, the error rate computed afterward: the difference between real sales and estimated sales figures.
We follow this strict process to compute our error rate:
- We sum the real quantity sold for all the ASINs for which we have sales data and group this sum by parent ASIN and by month. As said earlier, the estimates are on a parent level.
- We sum the estimated quantity sold and group it by parent ASIN and by month.
- We compare the estimate and real sales for each parent ASIN for each month as follows: $Error Rate = Abs(Estimate-Real)/Real$ As you can see, to always get a positive error rate and ease the analysis, we use the absolute value (always returns a positive difference).
- We take the median error rate of all parents within a marketplace within a month. We believe using the median is the most accurate way to understand our accuracy. The smaller the error the better.
- Another way to look at this is by computing an accuracy rather than an error rate. $Accuracy = 1-Median(ErrorRate)$
We computed the error rate of 20 products around the Amazon-US marketplace across 7 different root categories and compared the results with estimates from our competitors.
Median: The median error stays around 35% (accuracy of 52%), which means 50% of the estimations we give have an accuracy of 65% or more.
3rd Quartile: The third quartile median error rate is 52%. This means that 75% of the estimations we give have an accuracy of 48% or more.
Average: The average error for these estimations is 42%.
- DataHawk’s estimator is by far the most stable out there (see average error differences). Even our worst estimates (3rd quartile of error rates) have a 52% error rate which is two times better than JungleScout’s and four times better than Helium 10’s. DataHawk’s Sales Estimator doesn’t allow estimates to be off. This chart shows that you can trust the vast majority of the estimates we provide.
- When looking at our best estimations (1st quartile and median figures), we can say that in addition to being stable, DataHawk’s estimator has one of the best accuracy rates, especially among other top-selling products.
Of course, we understand our analysis’ sample is rather small and the ranking between competitors could be different using a different one.
The output of this analysis shows an important thing: most Amazon third-party tools, like DataHawk, that provide sales estimates are using similar logarithmic, root category best seller rank-based models to compute these estimates and these estimates are tightly bunched between competitors.
- Statistical differences of listings (reviews, keyword ranks, ratings) between children’s variations of products. With additional analytics and intelligence, we aim to provide a trustworthy way to divide the sales estimates between different variants of the same product.
- Looking at data such as BuyBox winners, and the number of sellers for a product to enable seller-level sales estimates.
- Possibly include historical estimates for a maximum of products by using their ranks in sub-nodes categories thanks to the BSR top 100 data we already have. This would also enable estimates summed up for a whole category and the evolution of categories compared to each other.
- We are aiming to increase our accuracy even more. You can help us achieve this by continuing to track as many products as you have.
Hopefully, you have a better understanding of what it’s like to create a data-science feature from scratch at DataHawk. Those of you who made it this far are honorary DataHawk data analysts now! Thanks to everyone who took the time to read this article and thanks to everyone on the team who worked on this great feature. As an Amazon Vendor or Seller, the DataHawk Sales Estimator is a simple way for you to spy on your competitors’ sales, unveil how competing products are doing and also analyze niches & improve your strategies. Enjoy, we cannot wait for your feedback!