Convert Html to PDF Python/Django on Unix Platform


Convert Html to PDF Python/Django on Unix Platform



I’m working on a functionality where I need to convert a huge html file (size more than 1 mb) into pdf. I’ve tried below two open-source python libraries.
1. Xhtml2pdf (Pisa)
2. Weasyprint



But none of them solves my problem as they take around 4-5 mins in generating 1 MB PDF file (around 500 pages) causing my app server’s worker process (Gunicorn and Nginx) to get down and throwing ‘GATEWAY TIMEOUT ERROR’ on browser. CPU utilization also goes up to 100% while PDF conversion is in process.



If anybody is having any idea which API/library will be a best suit for large html files.





Possible duplicate of Render HTML to PDF in Django site
– Ajay Singh
Jun 29 at 10:49





I've already used pisa but it takes around 5 minutes in conversion
– Sachin Chauhan
Jun 29 at 11:14




2 Answers
2



Generating a 500 pages PDF will take time whatever technologie you use, so the solution is to send the job to an async task queue (celery, huey, django-queue, ...), eventually with some polling to show a progressbar. Even if you manage to optimize the crap out of the generation process, it will STILL takes too much time to fit in an HTTP request/response cycle (from the user's POV at least even one minute is already way to long)



NB : having your CPU maxing out is nothing surprising either - generating a huge PDF not only takes time, it's also a computation-heavy process, and one that easily eats your memory too. This by itself is another reason to use a distributed task queue so you can run the process on a distinct node and avoid killing your front server).





Thanks @Bruno. We can schedule an async task through celery but the problem is wherever that task will execute, it will eat up CPU and memory so other processes will also get impacted on the server or in worst case, they may stop working. Any comment on how to handle CPU utilization please.
– Sachin Chauhan
Jun 29 at 11:50





Limiting the CPU usage of a process is to be dealt with at the OS level so you'll have to ask your sysadmin for this part. wrt/ memory, the simplest solution is to make sure you run this task on a dedicated (or mostly dedicated) node with enough ram to handle your biggest expected PDF - ram is not that expensive nowadays. Oh and yes, make sure your celery workers have MAX_TASK_PER_CHILD=1 so they release the ram after each generation.
– bruno desthuilliers
Jun 29 at 12:00






ok. Thank you Bruno.
– Sachin Chauhan
Jun 29 at 12:15



It's just a guess, I never used it, but I found this answer:
C++ Library to Convert HTML to PDF?
And as far as I know there is Cython, which can be used to combine C/C++ and Python.
Probably that will speed things up.



Otherwise you would need to either break it into small peices and merge them or do something with timeout parameter inside classes, that are responsible for it, but this has to be done on both sides - server and client. But I guess you would need to calculate it dynamically depending on file size and needed time and it doesn't sound to me like best desicion, but just in case...





Thanks @Constantine but seems they are targeting only windows platform while I need it for Unix.
– Sachin Chauhan
Jun 29 at 11:20






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

The forked VM terminated without saying properly goodbye. VM crash or System.exit called