# STATSML 101: Learn Basic Python Online ## Scenario Remember the goal of the connectivity to the Data Distiller from Python is to extract a table for analysis. This table is typically a sample that will be stored as a "DataFrame" (a table within Python) and subsequent operations will operate on this local DataFrame. This is no different from downloading the results of a SQL query in Data Distiller as a local file and then reading that into Python. For our training, we will assume that you have extracted this table as a CSV file locally.

## Create an Account in Kaggle To learn Python on the go, we will be using the notebook editor at [Kaggle.](https://www.kaggle.com/) All you need is an email address to login.

Create a New Notebook. You can also add a New Dataset

{% hint style="danger" %} **Warning:** Do not use Kaggle for uploading any client or customer data. Even if it means that you are sampling the data for prototyping the algorithm. Kaggle is owned by Google and the data is kept on the cloud. You should use Kaggle for learning Python with example data. If you want to prototype with customer data, your best option is a local installation of Python with Jupyterlab as the frontend UI. But make sure you know what the governance policies are in your organization or department. {% endhint %} ## Upload Test Data {% file src="" %} To upload the data, click on the "+" on the Homepage and "New Dataset". Upload the dataset from your local machine and name it. ## Create a New Notebook Make sure you name the notebook and also add the dataset you uploaded.

Add Python101Data data source along with the CSV data to your notebook.

To find the path to this file, you will need to click through on the datasets and see the path:

Make sure you click on the data source and access the CSV file to get the full path.

## {% hint style="warning" %} **Warning:** As you go through this tutorial, you will need to copy and paste the code line by line into a notebook so that the code works correctly. I have intentionally made it this way so that you do not skip key learning points. {% endhint %} ## What is Pandas The word "pandas" originated from "panel data", a term used for describing multi-dimensional data that vary over time. Pandas is perhaps the most important library as far as we are concerned as it allows for manipulation and analysis in the Python programming language. The key features that we will be using: 1. **Data Structures:** Pandas introduces two main data structures, the Series and the DataFrame. A Series is essentially a one-dimensional array with labeled indices, while a DataFrame is a two-dimensional tabular data structure, ***similar to a spreadsheet or a SQL table.*** 2. **SQL-Like Operations:** 1. Much like SQL engines,. Pandas provide powerful tools for manipulating and transforming data. You can filter, sort, aggregate, and reshape data easily. 2. It has functions to handle missing data, duplicate values, and data type conversions, making data-cleaning tasks more manageable. 3. You can combine multiple data sources through operations like merging, joining, and concatenating DataFrames. If you need to use Pandas locally, you'll need to install it first. You can install it using the following command in your Python environment: ```python pip install pandas ``` {% hint style="info" %} **Tip:** SQLAlchemy is very similar in functionality to Pandas but is a library that is geared more toward SQL users with object-centric thinking where SQL constructs like TABLE etc are first-class constructs. Even if you lov SQL, Pandas is important for you to learn. {% endhint %} ## Create a DataFrame Execute the following piece of code and ensure that the CSV file path is specified correctly as mentioned earlier ```python import pandas as pd; # Create a variable by reading the CSV file data = pd.read_csv('Python101Data'); # Create a DataFrame df = pd.DataFrame(data); # Print the full contents of the DataFrame print(df); ``` You should see the results as:

print(df) summarizes the results much like a SELECT * on a table.

Note the following: * `import pandas`: This part indicates that you want to use the functionality provided by the Pandas library in your program. * `as pd`: This part gives an alias to the imported library. By using pd, you're saying that instead of typing out "pandas" every time you want to use a Pandas function, you can use the shorter alias "pd." * `DataFrame`: As mentioned earlier, this is a class provided by the Pandas library that represents a tabular data structure, similar to a table in a database or a spreadsheet. * `data`: This is a variable that holds the data you want to use to create the DataFrame. It's usually in the form of a dictionary, where the keys represent the column names and the values represent the data for each column. Here the keys are Name, Age, and Gender. * `print(df)`will display the entire DataFrame by default if it's not too large. However, if the DataFrame is large, Pandas will display a summarized view, showing only a subset of rows and columns with an ellipsis (`...`) to indicate that there's more data not shown. ## Show Statistics Let us now execute: ```python print(df.describe()); ```

The only column that will havee statistics is the id column. `df.describe()` will generate statistics on the numerical columns of the DataFrame. This is very similar to ANALYZE TABLE command for computing statistics in Data Distiller. * **`count`**`:` The number of non-null values in each column. * **`mean`**`:` The mean (average) value of each column. * **`std`**`:` The standard deviation, which measures the dispersion of values around the mean. * **`min`**`:` The minimum value in each column. * **`25%`**`:` The 25th percentile (also known as the first quartile). * **`50%`**`:` The 50th percentile (also known as the median). * **`75%`**: The 75th percentile (also known as the third quartile). * **`max`**`:` The maximum value in each column. ## Preview the DataFrame Let us try to preview the first 10 rows by executing: ```python print(df.head(10)); ```

Preview the first 10 rows. We can change the parameter from 10 to higher or a lower number.

## Aggregation Functions Let us count the number of each gender type in the population ```python grouped_gender = df.groupby('gender').count(); print(grouped_gender); ``` Remember that `grouped_gender` is a DataFrame. When you use the `groupby()` function and then apply an aggregation function like `count()`, it returns a DataFrame with the counts of occurrences for each gender type. The above code is very similar to an aggregation `COUNT` with `GROUP BY` in SQL. The answer that you will get should look like this:

Various types of gender present in the dataset.

Other functions that you could have used in place of count() are `sum()`, `mean()`, `std()`, `var()`, `min(), max()`, and `median().` ## Define a Function Let us create a function that computes the percentage of total for all the gender types ```python # Define the function def percent_of_total(column): return 100*column/column.sum(); # Apply the function to the 'gender' column percent_of_total_gender = percent_of_total(grouped_gender); print(percent_of_total_gender); ``` Note the following: 1. The hash sign `#` is used to create comments. 2. Note that `def` has a semi-colon. 3. `return` should be indented properly. 4. The function `percent_of_total` is applied to each individual element in the column. 5. `percent_of_total_gender` is also a DataFrame as will be obvious from the answers below. The answers you will get will look like this:

To just retrieve a single column, let us use: ```python print(percent_of_total_gender['id']); ``` This gives

Alternatively, we could have also created a Series object instead of a DataFrame for `percent_of_total_gender` ```python percent_of_total_gender = percent_of_total(grouped_gender['id']); print(percent_of_total_gender); ``` And that would give us the exact same answer. Let us persist these results that are a Series object into a new DataFram ```python percent_of_total_df = percent_of_total_gender.to_frame(name='Percentage); print(percent_of_total_df); ``` Results are

New Dataframe object created from the Series object

## Generate a Randomized Yearly Purchase Column We are going to emulate the random number generation as in the example here: {% content-ref url="broken-reference" %} [Broken link](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-8-data-distiller-statistics-and-machine-learning/broken-reference) {% endcontent-ref %} Also, let us take this new column and add it to the DataFrame. Let us execute this code: ```python import random; # Function to generate random purchases def generate_random_purchases(column): return random.randint(1000, 10000) # Modify the range as needed # Apply the function to generate purchases for each row df['YearlyPurchases'] = df['id'].apply(generate_random_purchases) print(df) ``` The results show that a new column was added:

To learn more about thee `random` library, read [this](#appendix-random-library). ## Visualize the Results Let us make our first foray into visualizing the histogram: ```python import matplotlib.pyplot as viz; viz.hist(df['gender'], bins=10, edgecolor='black'); viz.xlabel('Gender'); viz.ylabel('Frequency'); viz.title('Histogram of Gender'); viz.show(); ``` `matplotlib.pyplot` is a visualization library in Python. It is unfortunate that it sounds very similar to MATLAB which also has plot commands. To plot a chart like the histogram, you can use this [site](https://matplotlib.org/stable/api/pyplot_summary.html#binned) as a reference. The code is no different from what we used for creating a DataFrame. You first initialize a handle on a library and then access the functions within that library. The function names should be self-explanatory as to what they do. The results look like this:

The histogram looks messy so let us clean this up: ```python import matplotlib.pyplot as viz; viz.hist(df['gender'], bins=10, edgecolor='black'); viz.xlabel('Gender'); viz.ylabel('Frequency'); viz.title('Histogram of Gender'); viz.tick_params(axis='x', rotation=45) viz.tight_layout() viz.show(); ``` We added two extra functions * The `viz.tick_params(axis='x', rotation=45)` rotates the x-labels by 45 degrees * The viz`.tight_layout()` improves the spacing and layout of the plot elements to avoid overlapping. ## (Extra Credit) Advanced Visualizations There is one last thing we want to do here. What if we wanted to plot a histogram and a bar graph together at the same time? The answer is that if you have ever used MATLAB, the following code will seem similar: ```python # Create a 1x2 grid of plots fig, axes = viz.subplots(1, 2, figsize=(12, 5)) # Histogram Plot axes[0].hist(df['gender'], bins=10, edgecolor='black', color='skyblue') axes[0].set_xlabel('Gender') axes[0].set_ylabel('Frequency') axes[0].set_title('Histogram of Gender') axes[0].tick_params(axis='x', rotation=45) # Bar Plot gender_counts = df['gender'].value_counts() axes[1].bar(gender_counts.index, gender_counts.values, color='salmon') axes[1].set_xlabel('Gender') axes[1].set_ylabel('Frequency') axes[1].set_title('Bar Plot of Gender') axes[1].tick_params(axis='x', rotation=45) # Adjust layout for better spacing fig.tight_layout() #Display viz.show() ``` The results will look like:

Plotting two different visualizations side by side.

Note the following in the code: 1. The heart of this code is `fig, axes = viz.subplots(1, 2, figsize=(12, 5))` Much like MATLAB, this function call creates a grid of subplots in a single figure. * `1`: The number of rows in the grid. * `2`: The number of columns in the grid * `figsize=(12, 5)`: This specifies the size of the entire figure in inches. `(12, 5)` means the figure will be 12 inches wide and 5 inches tall. * The function returns two objects: * `fig`: The figure object, which represents the entire figure. * `axes`: A 2D array of subplot axes. In this case, it's a 1x2 array, meaning there's one row and two columns of subplot axes. 2. `fig.tight.layout()` is done at the entire figure level rather than individual charts. That is how this library has been designed. ## (Extra Credit) Exploring the `random` library Generating random data is a good skill to acquire especially in the world of data science. The `random` library in Python is a built-in module that provides functions for generating random numbers and performing various random operations. Here are some of the key functions provided by the `random` library: 1. **Generating Random Numbers:** * `random.random()`: Generates a random float between 0 and 1. * `random.randint(a, b)`: Generates a random integer between `a` and `b` (inclusive). * `random.uniform(a, b)`: Generates a random float between `a` and `b`. 2. **Generating Random Sequences:** * `random.choice(sequence)`: Returns a random element from the given sequence * `random.sample(sequence, k)`: Returns a list of `k` unique random elements from the sequence. * `random.shuffle(sequence)`: Shuffles the elements in the sequence randomly. 3. **Random Selection:** * `random.choices(population, weights=None, k=1)`: Returns a list of `k` elements randomly selected from the population, possibly with specified weights. 4. **Randomness Simulation:** * `random.seed(a=None)`: Initializes the random number generator with a seed. Providing the same seed will produce the same sequence of random numbers. * `random.random()` These functions generate pseudo-random numbers, which appear random but are actually determined by an initial state (seed) of the random number generator. Here is some example code to try out: {% code overflow="wrap" %} ```python import random libraries = ["NumPy", "Pandas", "Matplotlib", "TensorFlow", "Scikit-learn", "PyTorch"] # Choose a random library from the list random_library_choice = random.choice(libraries) # Choose 2 random libraries without replacement (no duplicates) random_library_sequence = random.sample(libraries, 2) # Shuffle the list of libraries in place random.shuffle(libraries) random_library_shuffle = libraries print("Randomly selected library:", random_library_choice) print("Randomly selected sequence of libraries:", random_library_sequence) print("Shuffled list of libraries:", random_library_shuffle) # Set the seed for reproducibility seed_value = 23 # Initialize the random number generator random.seed(seed_value) # Generate random float between 0 and 1 random_numbers = [random.random() for i in range(1,10,1)] # print the values print("Random numbers:", random_numbers) print("Random numbers:", random_numbers) print("Random numbers generated with seed", random_numbers) ``` {% endcode %} Remember the syntax for `for` loop ```python for element in iterable: # Code block to be executed for each element # Indentation is crucial in Python to define the block of code inside the loop ``` You can also use a `for` loop with the `range()` function to iterate over a sequence of numbers: ```python for number in range(1, 10, 1): print(number) # Code to be executed in each iteration # The starting value is 1 of the sequence and it will be included # The ending value of the sequence is 10 and it will be excluded # The step size between each number. It's optional; the default step is 1. ``` ## Appendix ### Other Important Python Libraries 1. **Scientific** 1. **NumPy**: Similar to MATLAB. A fundamental package for scientific computing with Python. It provides support for arrays and matrices, along with mathematical functions to operate on these structures efficiently. 2. **Machine Learning** * **Scikit-learn**: A machine learning library that provides simple and efficient tools for data mining and data analysis. It includes a wide variety of machine-learning algorithms and tools for tasks like classification, regression, clustering, and more. * **TensorFlow**: An open-source machine learning framework developed by Google. It's widely used for building and training deep learning models, especially neural networks. * **PyTorch**: Another popular open-source machine learning framework, developed by Facebook's AI Research lab. It's known for its dynamic computation graph and ease of use in building neural networks. It is very popular in the research community. 3. **SQLAlchemy**: A SQL toolkit and Object-Relational Mapping (ORM) library for Python. It simplifies database interactions and allows you to work with databases in a more Pythonic way. This is required for Data Distiller. 4. **Requests**: A simple and elegant HTTP library for making HTTP requests to interact with web services and APIs. This is useful for working with Adobe Experience Platform APIs. ### Download Notebook Code {% file src="" %}