Delete files from Azure Data Lake

Posted on Posted in Data & Business Intelligence

In today’s post we will describe how to delete old files from Azure Data Lake.

We are using Azure Data Lake to store raw data (unprocessed) and as an archive to our data-warehouse (processed data).

After we processed the raw data we are no longer using it, so we decided that we want to delete old files.

The structure of the directories is we have a “processed directory” and inside it we have a lot of sub directories, each of them contain JSON files.

This is the script we used: It basically scans all the directories under the “processed directory” (see parameters below) and for each file it checks if we need to delete it or not (by “last write time” and a time period parameter).

Parameters:

  • subscriptionName – In case that the script is running from a different subscription than the Azure Data Lake, you need to set this parameter to the Data Lake subscription. This is because some of the commands we will use don’t have an option to define the subscription, so we used Select-AzureRmSubscription -SubscriptionName commands to set the default Azure subscription.
  • Cred – We are using Get-AutomationPSCredential, so you’ll need to change ‘Automation’ to your own user (read more about it).
  • dataLakeAccountName – The name of the Data Lake store.
  • dls_path – The path inside the Data Lake you wish to delete files from.
  • limit –  Represents the time period, files which the last write was limit days ago will be deleted.

9 thoughts on “Delete files from Azure Data Lake

  1. Thanks Neta Iser,
    I have created a new RUNBOOK and the type is ‘PowerShell Workflow RunBook’.After I did some changes to above script and tried to test it from Test pane but it showing error like below Could you please help me what it is exactly(I am really new to PowerShell)
    Error message:
    BadRequest: The Runbook definition is invalid. Case-insensitive switch statements are not supported in a Windows PowerShell Workflow. Supply the -CaseSensitive flag, and ensure that case clauses are written appropriately. To write a case-insensitive case statement, first convert the input to either uppercase or lowercase, and update the case clauses to match.
    Case-insensitive switch statements are not supported in a Windows PowerShell Workflow. Supply the -CaseSensitive flag, and ensure that case clauses are written appropriately. To write a case-insensitive case statement, first convert the input to either uppercase or lowercase, and update the case clauses to match.

    1. Hi Shekar,

      The script is written as a regular Runbook and it appears you are trying to use it as a Workflow.
      There are some syntax adjustments that might be required to do this.
      This specific error you are experiencing requires adding the -CaseSensitive flag after the switch statement in line #25.
      If you want to share the modified script it will be easier to assist you with this.

      Regards,
      Dima

      1. Below is my script please review and correct the errors.

        # Deletes Azure Data Lake old files
        #General configuration
        $subscriptionName = ‘MySubscriptionName’
        #Login
        #$Cred = Get-AutomationPSCredential -Name ‘RemoveFiles’
        #$null = Add-AzureRmAccount -Credential $Cred
        Select-AzureSubscription -SubscriptionName $subscriptionName | Out-Null

        #Data lake configuration
        $dataLakeAccountName = ‘MyDataLakeName’
        $dls_path = “folderPath”
        $limit = (Get-Date).AddDays(-90)

        #Get all sub-directories folders
        #$done_folders = Get-AzureRmDataLakeStoreChildItem -Account $dataLakeAccountName -Path $dls_path (Get-AzureRmDataLakeStoreChildItem cmdlets is not supporting so i commented)

        $done_folders = Get-AzureStorageAccount -StorageAccountName $dataLakeAccountName -Path $dls_path (Error: Get-AzureStorageAccount : A parameter cannot be found that matches parameter name ‘Path’.
        )

        foreach($done_folder in $done_folders)
        {
        switch($done_folder.Type)
        {
        “DIRECTORY”
        {
        $folderName = $done_folder.Name

        #get all files in each directory folder and check last modified time and delete old files
        #$done_files = Get-AzureRmDataLakeStoreChildItem -Account $dataLakeStoreAccountName -Path($dls_path + “/” + $folderName)

        $done_files = Get-AzureStorageAccount -StorageAccountName $dataLakeStoreAccountName -Path($dls_path + “/” + $folderName)
        foreach($file in $done_files)
        {
        switch($file.Type)
        {
        “FILE”
        {
        $fileName = $file.Name
        $fileLastModTime = $file.LastWriteTime
        if($file.LastWriteTime -lt $limit)
        {
        Write-Output “$fileName is before $limit and will be deleted”
        Remove-AzureRmDataLakeStoreItem -AccountName $dataLakeAccountName -Path($dls_path + “/” + $folderName + “/” + $fileName) -Force
        }
        }
        }
        }
        Write-Output “Deleted all file older than $limit days”
        }
        }
        }

          1. Yes, The Path is copied from Data Lake store folder properties even its showing error like below
            Error:
            Get-AzureRmDataLakeStoreChildItem : The term ‘Get-AzureRmDataLakeStoreChildItem’ is not recognized as the name of a
            cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was included, verify
            that the path is correct and try again.
            At line:17 char:17
            + $done_folders = Get-AzureRmDataLakeStoreChildItem -Account $dataLakeA …
            + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            + CategoryInfo : ObjectNotFound: (Get-AzureRmDataLakeStoreChildItem:String) [], CommandNotFoundException
            + FullyQualifiedErrorId : CommandNotFoundException

  2. Thanks Dima,
    Above Issue resolved when I ran this script in PowerShell Runbook.
    But when I click on Start button to execute it again showing below error.

    Select-AzureSubscription : The subscription name doesn’t exist.
    Parameter name: name
    At line:7 char:1
    + Select-AzureSubscription -SubscriptionName $subscriptionName #| Out-N …
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : CloseError: (:) [Select-AzureSubscription], ArgumentException
    + FullyQualifiedErrorId : Microsoft.WindowsAzure.Commands.Profile.SelectAzureSubscriptionCommand

    Get-AzureStorageAccount : A parameter cannot be found that matches parameter name ‘Path’.
    At line:15 char:82
    + … StorageAccount -StorageAccountName $dataLakeAccountName -Path $dls_pa …
    + ~~~~~
    + CategoryInfo : InvalidArgument: (:) [Get-AzureStorageAccount], ParameterBindingException
    + FullyQualifiedErrorId :
    NamedParameterNotFound,Microsoft.WindowsAzure.Commands.ServiceManagement.StorageServices.GetAzureStorageAccountCommand

    Could you please help me on this.

    Thanks,
    Shekar

Leave a Reply

Your email address will not be published. Required fields are marked *