Tech 2 minutes read

AWS Lambda to split a multipage PDF into separate pages

Hello kids! Today we want to introduce you to a process we needed to create for a specific task. Given that the AWS platform is a common BlueGrid.io playground this is where we'll do the work now. The task we needed solving recently was trivial yet interesting. We needed AWS Lambda to split a multipage PDF file into separate single-page files.

We started completing this task locally with python script we have created. The script, below, was tested on the AWS Lambda to split a multipage PDF file:

From PyPDF import PdfFileWriiter, PdfFileReader
	for I in range(input-do.numPages):
		output = PdfFileWriter()
		output.addPage(inputpdf.getPage(1))
		with open(“document-page%s.pdf” % I, “”wb) as outputStream:
			output.write(outputStream)

Since the local test passed successfully we needed to give it a test run on Lambda. However, the error has appeared:

[ERROR] OSError: [Errno 30] Read-only file system

The working directory is /var/task and by definition in the code above (relative path to "document-page%s.pdf"), the open() function will try to create files there. This, however, will cause the error above and we need to update the open() function to create files in /tmp directory:

From PyPDF import PdfFileWriiter, PdfFileReader
	for I in range(input-do.numPages):
		output = PdfFileWriter()
		output.addPage(inputpdf.getPage(1))
    with open(‘/tmp/document-page.pdf’ % I, ”wb”) as outputStream:
			output.write(outputStream)

The /tmp location is available during the execution of a Lambda function. Lambda will reuse the function when possible, and when it does, the content of /tmp will be preserved along with any process. However, Lambda doesn’t guarantee that a function invocation will be reused, so the contents of /tmp could disappear at any time.

Another important note is that AWS Lambda limits the amount of computing and storage resources. /tmp directory storage has a limit of 512MB.