# EXPLORE 101: Exploring Ingested Batches in a Dataset with Data Distiller

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FZU8dzfjIRjs2VrX1z1Mj%2FScreen%20Shot%202023-06-22%20at%203.14.56%20PM.png?alt=media&#x26;token=0c3a093f-5324-434e-8146-89e5ed61c185" alt=""><figcaption></figcaption></figure>

## Data Ingestion Primer

One of the key questions that you will need to answer at some point is verifying and validating the records within a batch that has been successfully ingested into the Adobe Experience Platform.&#x20;

Remember that the concept of "batch" is a data ingestion concept where a collection of records contained in a file or otherwise, batch or streaming is materialized as a "unit" on the data lake. In essence, it is a materialization construct used by AEP.&#x20;

Records that are ingested have to pass through several checks before such materialization can take place. This is handled during the mapping part of the data ingestion process. There are several categories of issues that can arise and you need to be aware of them. They will manifest themselves with error codes if you peek into a dataset

1. Navigate to the **Datasets** pane and if you are unlucky, click on a batch that has failed:

   <figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FPsRJXJQ4hHqxopJugJSA%2FScreen%20Shot%202023-06-02%20at%208.54.27%20AM.png?alt=media&#x26;token=f3ceceec-2a71-4f61-b886-b47eda5bbf51" alt=""><figcaption></figcaption></figure>
2. You will see a bunch of errors that look like this perhaps:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Fwz5Rd9WgAnHT6U3YiSUV%2FScreen%20Shot%202023-06-02%20at%208.51.10%20AM.png?alt=media&#x26;token=2ee03002-d94e-43f5-9186-db5aeb494bff" alt=""><figcaption></figcaption></figure>

Some bad things have happened to our data ingestion. Let us understand the error codes:

1. **ERROR:** These are the most egregious of errors possible where data corruption or non-conformance to a format was not followed. Such types of failures are serious and the entire batch will fail.&#x20;
2. **DCVS:**  Not seen in the above example but these are less serious than data corruption issues such as a missing required field. All of these rows are just skipped. A separate dataset containing such records is NOT available as a dataset on the data lake. These records are kept in a separate location and accessible through the error diagnostics tools (UI or API). The reality of dealing with such situations is that if those skipped records are critical for your use case, you will need surgically identify them in the source system and re-ingest the data. And if that is&#x20;
3. **MAPPER:** These appear to be the least harmful of the three but you need to pay attention to them because these are rows that make it to the final dataset BUT the data may have been altered in the process. The mapping process tries to do a data type conversion of the string data that is at its input to the output datatype. When it cannot do so because of a malformed string, it will NULLs in the result. If you were not paying attention, you now have a field that has been NULLs that possibly could have been rectified by you. Thus batches with MAPPER warnings become a good candidate for some data exploration to see what is going on.

## Accessing Dataset Batch Metadata

In order to see what system fields are available in the dataset, set the following in a session:

```sql
set drop_system_columns=false;
```

```sql
select * from movie_data
```

By doing so, you will see two new columns that will appear to the far right: acp\_system\_metadata and *\_ACP\_BATCHID.*

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Fd3A4C0GuE6oRwxYgKFCz%2FScreen%20Shot%202023-06-26%20at%206.50.40%20PM.png?alt=media&#x26;token=34a5c901-8d77-4006-b568-34e78a2ca7cc" alt=""><figcaption><p>movie_data table has now metadata columns available.</p></figcaption></figure>

As data gets ingested into the platform, a logical partition is assigned to the data based on what data is coming at the input. \_*acp\_system\_metadata.sourceBatchId* is a logical partition and *\_ACP\_BATCHID* is a physical partition after the data has been mastered into the data lake in Adobe Experience Platform.

Let us execute the following query:

```sql
select  _acp_system_metadata, count(distinct _ACP_BATCHID) from movie_data
group by _acp_system_metadata
```

The results are:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FD23mHGTHUp4RZn77Cpam%2FScreen%20Shot%202023-06-26%20at%206.59.08%20PM.png?alt=media&#x26;token=75e7c598-1c4d-4796-a6c4-cdad4919eb5b" alt=""><figcaption><p>A grouping by on source batches shows the number of output batches</p></figcaption></figure>

This means that number of batches in the input need not correspond to the number of batches written. In fact, the system decides the most efficient way to batch and master the data onto the data lake. Let me explain this through an example below.

Let's run this on a different dataset below. For those of you who are motivated, you need to ingest this data using XDM mapping into the Adobe Experience Platform.&#x20;

{% file src="<https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FRBgpXxZOp7A8RqJvM0ez%2FDrug_checkout_data.json?alt=media&token=45ebd328-daa4-4765-9408-d9c56ee493c4>" %}

This file is a deeply nested set of 35,000 records and they look like this:

```sql
select * from drug_orders
```

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FN6VdkbfsxzhJur5JP0Qz%2FScreen%20Shot%202023-06-26%20at%207.03.03%20PM.png?alt=media&#x26;token=d499ddbf-b8d3-4ebe-a389-fba664b05a8e" alt=""><figcaption><p>Preview of the first set of records in the JSON-based drug_orders dataset.</p></figcaption></figure>

Let us generate some batch-based statistics on this dataset:

{% code overflow="wrap" %}

```sql
select  _acp_system_metadata, count(distinct _ACP_BATCHID) as numoutputbatches, count( _ACP_BATCHID) as recordcount from drug_orders
group by _acp_system_metadata
```

{% endcode %}

The answers look like this:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FJOzphsPIe7vdQbgrRXvq%2FScreen%20Shot%202023-06-26%20at%207.02.34%20PM.png?alt=media&#x26;token=108e2035-16db-4335-afbc-db150926e59b" alt=""><figcaption><p>Distribution of how input batches were mastered at a time with record counts.</p></figcaption></figure>

The above shows that I created 3 input batches where I ingested 2000, 24000, and 9000 records each time. However, when they got mastered, there was only one unique batch each time.

{% hint style="info" %}
Remember that all records visible within a dataset are the ones that successfully got ingested. That does not mean that all the records that were sent at the source input are present. You will need to look at the data ingestion failures to find the batches/records that did not make it in.&#x20;
{% endhint %}

## **Querying a Batch in a Dataset**

1. If you want to simulate the creation of a batch go to Movie Genre Targeting Example and complete the [**section** ](https://data-distilller.gitbook.io/adobe-data-distiller-guide/unit-2-data-distiller-data-exploration/broken-reference)on ingesting CSV files.
2. If you open up the dataset pane, you will see this:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2Fmd688rvXsuQIpyRCvKkF%2FScreen%20Shot%202023-06-02%20at%2010.03.48%20AM.png?alt=media&#x26;token=7359ee6b-3247-41a0-a738-ca9f247ea9b6" alt=""><figcaption></figcaption></figure>

3. Copy the batch ID by going to the panel on the right:

<figure><img src="https://1899859430-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FEhcgqFIfGdE0GXJzi5yR%2Fuploads%2FfO7X3eVR5SoWXKfSofVY%2FScreen%20Shot%202023-06-02%20at%2010.05.09%20AM.png?alt=media&#x26;token=b6c7bcdc-5b9c-4a60-8fec-10c0905a22e3" alt=""><figcaption></figcaption></figure>

4. Now use the following query to retrieve all the records that made it into the dataset as part of that batch:

```sql
select * from movie_data
where _ACP_BATCHID='01H00BKCTCADYRFACAAKJTVQ8P'
LIMIT 1;
```

**\_ACP\_BATCHID** is the keyword that we will be used to filter the Batch ID. The LIMIT clause is useful if you want to restrict the number of rows displayed. A filter condition is more desirable.&#x20;

5. If you executed this query in the Query Editor in the Adobe Experience Platform, the results will be truncated at 100 rows. The editor was designed as a quick preview tool. To get up to 50,000 rows, you need to use a third-party tool like DBVisualizer (my favorite).  DBeaver is also another tool used by all. Keep in mind, that these editor tools are advanced and mostly free.&#x20;

##
