Apratim Biswas

Tampa Bay Area, Florida · (352) 872-2490 · apratimbiswas2003@yahoo.com

Data Scientist and experienced R&D Engineer with a background in Materials Engineering. Knowledgeable of best practices in research, data collection, and data analysis with hands-on experience. Adept in leading diverse teams, engaging internal and external stakeholders, and implementing process innovations.

Portfolio: Selected Projects

App to predict value of a single transaction in a retail marketing campaign of liquid milk products

app preview
Problem Statement

In a marketing campaign that offers discounts, there are several factors other than discount rate that play a role in the final value of revenue per transaction. The factors themselves influence each other, further complicating the revenue dynamics. My goal in this project was to develop and deploy an app that can estimate sales value per transaction for liquid milk products, given a set of paramenters. The deployed model utilizes multiple linear regression to capture the dynamics of sales value per transaction. Some factors I took into account are:

  1. amount and types of discount
  2. number of days since start of campaign
  3. unit volume
  4. brand
  5. type of milk
  6. manufacturer

I used 'The Complete Journey' dataset, which is publicly available from Dunnhumby, to develop this model. Please note that the app was deployed using Heroku's free tier and may take up to a minute to load the first time. [app link] [project link]

Information Retrieval From Customer Reviews on Amazon

clustering customers based on reviews
Problem Statement

Popular businesses, especially ones that offer mutiple products and services, receive large volume of customer feedback in the form of text reviews. Listening to such feedback is essential for a business to succeed. The aim of this project is to use Natural Language Processing (NLP) to extract and aggregate information at the topic level from large volume of customer reviews. Specifically, text reviews from Amazon were used in this project. The model I developed does the following:

  1. Cluster customer reviews based on their content. This helps us learn about customers.
  2. Extract high-level information at the topic level for each cluster. Such information can help us improve and/or enhance products/services.

Some techniques and algorithms that went into this work include DistilBERT embeddings, sentence tokenization, HDBSCAN clustering, UMAP, TF-IDF, Latent Dirichlet allocation. As the name suggests, DistilBERT is a distilled version of BERT (Bidirectional Encoder Representations from Transformers). It uses 40% fewer parameters and runs 60% faster than BERT, all the while preserving over 95% of BERT's performance. project link

Classify Reddit Comment to Identify Originating Subreddit

fingerprint words with high signal to noise ratio
gradient boost: auc-roc curve
Problem Statement

The problem can be broken down as follows:

  1. we have a comment;
  2. we have a selection of two closely relates subreddits;
  3. need to identify which of the two subreddits the comment belongs to.

The goal of the project is to build a model to automate this classification process. The two subreddits that were selected are /r/buildapc and /r/buildapcforme: both based on advising people who want to build their own PC. E.g. recommending parts that are compatible. project link

Identifying Potential Toronto Neighborhood(s) To Open A Profitable Coffee Shop

preferred locations in toronto to open a coffee shop

This is a hypothetical case. Our client, who is an entrepreneur, wants to open an independent coffee shop in Toronto. The goal of this project is to use exploratory data analysis and machine learning to recommend a list of neighborhoods best suited for the purpose. Following factors were considered in making the recommendations:

  1. Geographical Coordinates
  2. Population
  3. Average income
  4. Walkability
  5. Businesses
  6. Debt Risk Score
  7. Venues including Coffee shops, Parks and Playgrounds
project link

Motor Vehicle Collisions in NYC: An analysis of risk factors

Project Team: Apratim Biswas, Reggie DePiero, and Christopher Villafuerte

contributing factors for crashes in Lincoln Tunnel

Time-series analysis

Geographical distribution of crashes


In 2018 alone, there were 228,047 car accidents in New York City. The goal of the project is to identify high risk factors within NYC traffic. The final deliverable is a report that:

  1. Studies the contributing factors and pre-condition.
  2. Examines vehicular collisions data as a time series and build a model to predict number of crashes, and
  3. Create a map of risks, both: spatial and temporal.
The motivating factors for creating such a map are:
  1. first responders can use it to navigate the city during motor Vehicle crashes, natural disaster, and emergencies; and,
  2. it be used by city planers to improve traffic performance and make the roads safer.
In our time-series model, we focused on Lincoln tunnel and its surrounding areas. This is because the tunnel serves as one of the major arteries that feeds traffic from New jersey. project link

State-Level Participation In ACT and SAT: Exploratory Data Analysis

participation in SAT and ACT in 2017
participation in SAT and ACT in 2018

Perform detailed exploratory analysis of the landscape of college entrance testing in the United States. Use results from the analysis to provide recommendations for what CollegeBoard needs to do to increase participation in SAT. project link

Residential Evictions in New York City: Exploratory Data Analysis

evictions in NYC by neighborhood

evictions in NYC by zipcode

New York City has one of the most competitive and expensive real estate in the country. As to be expected, a large portion of the businesses and homes are leased properties. High population density, lack of space for horizontal growth and extensive vertical growth make evictions an unfortunate reality for residents of the city. The goal is exploratory analysis of residents' household income, their level of education and prevalence of certain types of crime in various neighborhoods and how distribution of such factors relate to distribution of evictions per capita.

This study focuses solely on residential properties and their renters. project link



  • Linux
  • Windows
  • IBM Cloud
  • Google Colaboratory
Programming Languages & Tools
  • Python | SQL | Pandas | Numpy | Matplotlib | Seaborn | Tableau
  • Scikit-learn | Keras | TensorFlow | Streamlit | Jupyter Lab
  • Microsoft Excel


Data Science Immersive Fellow (Full Time)

General Assembly

Completed a full-time intensive Data Science program. Gained hands-on experience analyzing large real-world datasets and modeling them using various machine learning algorithms. I also added to my experience in exploratory data analysis and translating data to meaningful stories. Here are couple project highlights:

  • Information Retrieval From Customer Reviews on Amazon: Listening to customers is central to a successful enterprise. However, when we have multiple products/services in the market, or the volume of customer reviews is very large, say >50,000, it can be challenging to keep up with the information. Using natural language processing, I developed a tool to:
    • i) Cluster customer reviews based on their content. This helps us learn about customers.
    • ii) From each cluster extract high-level information at the topic level from the reviews. Such information can help us improve and/or enhance products/services.
  • Classify Reddit Comment to Identify Originating Subreddit: Developed a model using natural language processing to classify any randomly selected comment as belonging to one of two very similar subreddits. Model classified comments accurately 96% of the times.
  • Aug 2020 - Nov 2020

    Corporate Metallurgist / R&D Engineer

    Gopher Resource

    Led corporate research and development, including creating mass balance models and furnace feeds to improve sustainability and enhance production models. Led corporate research and development, including creating mass balance models and furnace feeds to improve sustainability and enhance production models. Here are some highlights from my work at Gopher Resource:

  • Oversaw process development from lab scale through full production. Managed large plant-scale floor experiments and oversaw process and operational improvements. Facilitated root cause analysis. Supervised external process consultants.
  • Designed and implemented process improvements that delivered 80% increase in selenium yield in lead and 90% decrease in selenium in discharged water.
  • Developed and implemented new process to make solid waste ‘slag’ created in smelting process non-hazardous, resulting in >70% decrease in volume of hazardous slag.
  • Improved lead production outputs with root cause analysis on low production in special circumstances.
  • Developed new procedures to collect and analyze accurate data from process design through implementation and maintenance.
  • Sept 2012 - Mar 2019

    Graduate Research Fellow

    University of Florida
  • Conducted research on the development of processes to synthesize TiO2-SiO2 nanofibermats for use in the lunar environment, including synthesis and characterization of nanofibers and filtration properties.
  • Developed processes to make more flexible fibermats in comparison to ceramic fibers currently in use.
  • Taught undergraduates in materials science and engineering.
  • Aug 2007 - Dec 2011


    University of Florida

    Doctorate (PhD)
    Materials Science & Engineering
    August 2007 - May 2012

    Indian Institute of Engineering Science & Technology

    Bachelor of Engineering
    Metallurgy & Materials Engineering
    August 2003 - May 2007