In this blog I intend to explain the process that runs from a backup in Azure until the report of those backups in Power BI is generated.
For this to happen, the corresponding data of each one of the Backups must be treated by an ETL tool (Extract, Transform and Load) of Pentaho called Data Integration (PDI). This tool will collect the data from the backups of the different sources, transform and adapt them to later store them in their corresponding tables inside a common database that will feed the data shown in the Power BI reports. This Database is called SQLR.
Azure: It has two backup data sources.
- Data extracted through REST services from Azure’s own API.
- CSV files generated by a script created by Infrastructure partners with information on Virtual Machine Backups, Databases and Logs that can not be extracted from REST services. This information is extracted from the Azure Database.
These data sources will be collected by an ETL process created for each one and will be transformed to be saved in the SQLR database in their corresponding tables.
At the end of the data load in our SQLR database, a report will be generated in a PDF format. This PDF will consist of several tables that group the results of the backups by virtual machines, databases and LOGS. Each of them will be grouped, in turn, by each of the environments (PRE and PRO).
Finally, a report will be built for each data source in Power BI with the Power BI Desktop tool, which will access the data stored in the SQLR database through a Datasource. Once the report has been prepared, it will be published on the https://app.powerbi.com page in its corresponding group.
A periodic update of the report data can be scheduled from the Power BI administrator. To do this, a gateway must be configured on the SQLR machine so that Power BI has external access to the data of that database and thus the content of the reports can be updated.
Backups data processing
To obtain the detail of the backups made in Azure, we will create an ETL process with the Pentaho Data Integration tool.
The overall scheme of the process will be as follows:
This process will be composed of a set of steps. The first of them will define the variables with the data that will make up the header of the successive calls to the Azure REST API.
After defining this input data, it will proceed to log in to the Microsoft page invoking the REST service:
This service will provide us with an access token to Azure Rest services. This token will be used for the subsequent invocations to the other Rest services that will be needed during the process.
Each result of each rest call will be collected in Json format and transformed, during a later step, in variables to be able to be transformed and used in the other steps.
After obtaining the Azure access token, we will proceed to obtain all the Recovery Services from the client with whom we initially logarize. For this we will invoke the following Rest service:
From each Recovery Service obtained we have to extract two types of information.
These are the steps to obtain both data flows:
- On the one hand, you have to extract the Backup Items to know the number of restoration points of each one.
For this we will invoke two Rest services consecutively:
- First we obtain the Backup Protected Items with the following call:
- Click here
- It will be filtered so that the Virtual Machines type backups (AzureIaasVM) are obtained
- Subsequently, for each Backup Item, its detailed information will be extracted, among which is the number of Restoration Points.
- Click here
- On the other hand, it is necessary to extract the Backup Jobs of each Recovery Service with the objective of obtaining the rest of the data that we need to know about each Backup. Such as the state, the duration the start time …
- Click here
- This Rest call includes a filter of all the Backup Jobs started and finished during the 2 days prior to the execution of the ETL process in order to cover the Jobs that could not be completed during the same day.
This information obtained separately is subsequently unified from a field called Friendly Name that is used as an Alias for each Backup Item. In this way we will have all the information obtained from the Backup Jobs and the Backup Items in the same data flow.
With this unified information we will invoke a last Rest service that will serve us to obtain the data that we needed, the Backup size (Backup Size)
In the last phase of the etl process, the date of the last successful Backup will be obtained.
For this we will check if the Jobs backup that is being collected has been completed correctly. If so, the end date of this Backup will be the date of the last successful Backup.
If not, we will look in the Azure_Backup_Jobs table where we store the data collected from the Rest services the date of the last backup performed correctly in previous executions.
Now with all the information we have collected, it will be saved in the SQLR Database, inside the Azure_Backup_Jobs table, which will consume the report created in Power BI for Azure.
Azure csv (BDD origin)
There is some information that is not provided by the Rest of Azure services and that are needed to add to the Power BI report. This information remains stored in the database used by Azure and is extracted by SQL Server statements. The result of these statements is saved in csv files that are copied to a shared folder included in the SQLR machine.
The ETL process in charge of processing this information will have as data source those csv files located in the shared folder, which is a local folder for the ETL process.
The files will contain the following information distributed in the following header:
- Backup_set_id: Backup identifier. It is an internal code to uniquely identify the backup referenced in that row.
- AlmacenJob: Name of the Recovery Services store.
- EntityName: Name of the entity. It would be the alias of the Backup that I mentioned in the previous process.
- Operation: Operation carried out
- Status: Backup Status
- StartTime: Start date and time.
- Duration: End date and time.
- ProcessingRate: Processing rate.
- BackupSize: Backup size.
- RecoveryPoints: Number of restoration points.
- LastSuccessBackup: Date of last Backup made correctly.
These files will be of two types. One will contain the Backups of Virtual Machines and the other type of files will contain the Backups of the Logs.
The file name of the Virtual Machine Backups will have the following format:
- <NOMBRE SERVIDOR>-BCK-<FULL-DIFF>-<FECHA YYYYMMDD>.csv
The log backup files will be named with the following format:
- <NOMBRE SERVIDOR>-BCK-<LOGS>-<FECHA YYYYMMDD>.csv
In addition, the content of these files will consist of the backups made the day before and indicated in the name of the file.
The loading process will read the name of the files and extract the information of the server name, the type of backup and the date of extraction of that data. In addition, it will classify the content depending on the type of file it is.
With the information obtained from the file name, the process will remove from the table Azure_Backup_Jobs the records that match the server name and date indicated in the file name, since the values of the file will be inserted if it is being processed in a later step.
This is done in this way, in the case a file of a server and a specific date has been previously processed and reprocessed again with the same or different data.
After some transformation and adaptation performed on any of the collected values, the information is saved in the Azure_Backup_Jobs table.
At the end of the loading of the information about the Azure Backups in our Database, a Backup Summary report will be generated. This report will be sent by email to the client so that he can review the results of the backups of the previous day instantly.
This is done with the intervention of two elements. On the one hand, an ETL process that extends the information from our newly updated SQLR database and, on the other hand, a Java library that we have created that will be responsible for the generation of the report in PDF format.
This Java library will receive all the information extracted about the Azure Backups made the day before the generation of the report in JSON format. This JSON chain will be structured in a summary of totals and in an ordered list of the backups made.
The resulting JSON string will be sent as a parameter to our Java library.
As can be seen, in this step of the ETL process, the constructPdf method of our Java library will be invoked and the path where the PDF will be generated will be passed as parameters, the name it will have (since it depends on the date of execution) and the variable that will contain the JSON chain with the bakcups results. The library will be located in the libext folder of the Pentaho Data Integration installation path.
The JSON string will be parsed to Java objects and the PDF object will be generated, to which both a summary table and each of the sections that make up the report will be added:
- Virtual machines
As a result, a string with content in JSON format will be returned, which will indicate the final result of the generation of the report and the final route of the generated PDF report.
With this information, this report will be sent by email to the target customers that we have configured.
The final report will look like this.
With all the information centralized in the same SQLR database, there will be a report made in Power BI that consumes them and shows them in dashboards with filters and tables within their platform.
To create these reports and publish them in the Microsoft Power BI application (https://app.powerbi.com ) we will use the “Power BI Desktop” application.
As I mentioned before, each report will consume the data from the tables created in the sqlr database. For this to be possible, a data source must be created, which in this case will be SQL-Server. It will also be necessary to configure the connection data with said database.
We will select the tables that contain the data that we want to show in the report:
And it will show us a list of columns and contained data to be able to operate with them and in this way to be able to elaborate the report.
With this information you can start to configure the report. To do this, elements in it will be added, whether they are table, filters, graphics …
Each element that we incorporate into the report will show some values depending on the field of the table that we select in the data source.
To which you can apply some filters, add new columns, add measures calculated from the values of the existing fields …
In the Azure example you can see that there are 2 filters and a table, but you can add other types such as diagrams, graphs, etc …
Once the report has been created or modified, it must be published so that the client can access it from the Power BI web application. To do this we will click on the Publish icon located in the toolbar:
We will select in destination the work area in which the report will be located. And in the Power BI web application the report will have been added / updated.
For the data to be synchronized with the execution of the ETL processes that feed the database and tables, a periodic update of the report data must be scheduled.
This is achieved by accessing the Data Set options from the Power BI web application panel.
Prior to this, to make the update work properly, a gateway must be configured for the Power BI web application to access the report data sources.