Uploading bulk content notes/Attachments to salesforce using node js and bulk api
Recently i was working on a data migration project, the task is to upload a bulk Content notes from SQL Server to salesforce Content Note (The record size is about 390K) and relate the content notes to a custom object record using ContentDocumentLink. my first thought process is to query each individual Content Note record from sql server insert the content note in salesforce using salesforce rest api with Content in Base 64 format and update the salesforce id in the associated SQL Server Content table record. my second thought process is to generate Sql Server Content Note records content as .txt files, prepare a .csv file with links to the text files, upload them to salesforce using bulk api (binary attachments) and process the results to get the salesforce id and update back in the SQL Server.
As i need to upload to two salesforce full sandbox instances at the same time i applied both thought process one for each instance. First process is pretty much straight forward loop the SQL Server Content Records ordered based on the primary key insert each record in salesforce using REST API update salesforce id back in SQL Server table(It uses the limit of number api calls allowed for 24Hrs based on the type of salesforce instance), But the Second process is a bit tricky, maintaining relationship between the primary key of SQL Server table and salesforce ID as we can not create custom field in content note to store the primary key, loading 390K Records with content as files as there is a limit of 1000 files in a single batch. realised that there is no straight forward ETL tool to do, build a script using node js to implement the second process. used node js because it does not need a framework to be installed in the machine as well as to enhance my javascript skills. the script orchestrates the below process steps in synchronous manner.
First step script prepares the zip files needed for the Bulk API all files of the content and request.txt csv file for loading the records are generated in the order of the primary key of Sql Server Content Note so that I can straight away map the salesforce id from the results to Sql Server Content Note records. Each zip file contains of 1000 files along with csv file to avoid the limit of 1000 files per batch.
In second step script submits the bulk api request with the zip files prepared in the first step and prepare the results file in the same order and store it in the local drive.
Third step is to process the results file and update the SQL Server Content Table with salesforce ID's Decopuled step 2 and step 3 to avoid any issue in processing the batch file should not affect the Sql Server DB Updates as well as for the cleaner processing of results files.
The time taken for processing the content notes in synchronous api calls is around 24 - 30 hrs and the time taken for Bulk API Binary uploads is around 5 - 7 hrs may vary based on the size of the content note files.
The same pattern can be adopted for loading attachments to salesforce
Technical gotchas
- Take a note that maximum number of content notes/Content Versions(Contnet note interanlly creates one content version record in salesforce) that can be published for a 24hrs is generally 200000 for full sandbox and production instances and can be increased to 500000 limit and 500000 is a hard limit.
- make sure that /n are replaced with <BR> as content notes considers <br> for line breaks.
- the zip can't be more than 10MB if more than 10 MB break the batch to a smaller size, the best way is align the size of the contentnote or attachments based on the size and divide them accordingly in to multiple folders as well as tables in sql server so that the order is maintained. (gzip compression is supported).
- maximum of 1000 files in a batch is inclusive of request.txt csv file as well.
- Even though ZIP_JSON is also allowed for binary attachments bulk api, the content records are getting created but the content is blank not sure if it is an issue with salesforce. ZIP_CSV worked well.
- Mark all file names with numbers (0 - max batch size and extension) in the batch zip file,so that we can reduce the size file names. (There is a limit of 512 bytes for the file names.)
NPM packages used in the node js script are
- csv-parser(for parsing the DB records as csv).
- fs (for reading files from the drive).
- axios(for making api callouts).
- xml2js(for converting xml responses of bulk api requests from xml to JSON).
- adm-zip. for preparing the zip files.