New book unveiled: Computer Vision with PyTorch

In the rapidly evolving technological landscape, the field of computer vision stands out, pushing back the boundaries of how machines can interpret and understand the visual world. 


At Fieldbox, we’re at the forefront of transforming industrial operations through the power of computer vision. Today, we’re delighted to share an exclusive interview with Yassine Alouini, an experienced data scientist and machine learning engineer with advanced expertise in computer vision applications. He shares with us details of his new book: “Computer Vision with PyTorch”. In this interview, we’ll look at the book’s background, the projects that have shaped its expertise, and his vision of the future of computer vision.

Can you introduce yourself?


I am an experienced data scientist and machine learning engineer with expertise in computer vision. I worked at Fieldbox for more than 4 years on various industrial projects with mainly computer vision applications. Before that, I worked for 5 years in a startup specialized in urban mobility on various machine learning engineering topics.

Beyond my professional activities, I try to remain at the cutting edge of technology. I actively participate in Kaggle competitions to enhance my skills, and I enjoy contributing to open-source projects, which allow me to give back to the community and keep up-to-date with the latest advances in the field.


What are the specificities of industrial settings for computer vision?


Industrial computer vision demands high precision and robustness in challenging conditions, and often operates with limited datasets, leveraging transfer learning to improve outcomes. Integration with existing systems is crucial, with some applications requiring real-time processing for tasks such as defect detection, while others can afford slight delays.


Can you share examples of Fieldbox’s most challenging computer vision projects?


At Fieldbox, we have tackled a variety of complex computer vision challenges, tailored to the unique needs of our customers. Two notable projects illustrate our commitment to pushing the boundaries of what’s possible in this field, each with distinct challenges and objectives:

  • The first project involved deploying a robot with advanced depth estimation capabilities to navigate in an environment too dangerous for humans. The challenge was to create a system capable of accurately estimating the 3D structure of an environment from RGB images. This required not only technical innovation in the development of a monocular depth model, DPT (monocular in the sense it uses only one image at a time), but also a practical approach to data collection, involving the capture of depth video using specialized cameras. The stakes were high, as the project’s success involved enabling safe remote inspection and operation in inaccessible locations, fundamentally changing the way our customer could conduct its operations.
  • In the second notable project, the aim was to significantly improve safety measures by automating the detection of equipment defects. Using instance segmentation techniques, we aimed not only to identify but also to classify the various objects in an image, pinpointing equipment with potential safety issues. This project involved a meticulous process of data collection and annotation, relying on pre-trained models to speed up the process, and culminated in the training of a robust Mask-RCNN model. The ultimate goal was to streamline and accelerate the detection process, reducing human error and increasing the efficiency and safety of our customer’s operations.

Both projects have produced very concrete and tangible benefits, showing how computer vision technology can effectively solve real-world problems.


Why did you decide to write this book?


When I started working as a data scientist, it wasn’t clear to me how to build an end-to-end application. For instance, deploying a model or getting the necessary infrastructure are two subjects that are hard to master for a new data scientist. This book is a hands-on answer to most of the questions a data scientist might ask when working with machine learning applications in general and computer vision ones in particular.


How long did writing the book take, and what was the process like?


Fieldbox gave me the opportunity to split my time, which enabled me to work on writing the book alongside my other professional activities, for around 18 months. After the initial writing phase, several months were spent refining the final version, meticulously editing, and preparing the book for printing. At the end of 2023, we received the first batch of printed copies.

The development of a technical book is a difficult journey, requiring a balance between originality and precision in the presentation of concepts. This project is my first experience as an author, and I discovered many aspects I didn’t know about, beyond the technical aspects: I learned to improve my writing skills, I found out about the subtleties of publishing, and I navigated the nuances of print preparation.

For those interested in the behind-the-scenes, the book was crafted using a Python framework, Jupyter Book.

Finally, I would like to thank Fieldbox colleagues for their precious feedback while reviewing the different chapters as well as my friends for their additional comments, once the book was ready.


How do you see the future of computer vision?


I see two particularly interesting areas of development, each promising to have a significant impact on our technological capabilities:


  • Generative computer vision models, in particular, diffusion models: this fast-evolving field offers fascinating perspectives, in particular diffusion models, renowned for their ability to generate high-quality, realistic images. In industrial contexts, these models are groundbreaking, providing a solution to the eternal challenge of data scarcity. They can generate synthetic datasets that mimic real-world conditions, which would otherwise be difficult or costly to collect. This capability is invaluable for training machine learning models where real-world data is limited or sensitive, improving both the breadth and depth of applications without compromising privacy or logistical constraints.


  • Computer vision for robots: Although they have received less attention than generative models, advances in computer vision for robotics are just as transformative. The development of 3D vision models has dramatically improved a robot’s ability to accurately perceive and manipulate objects in three-dimensional space. This precision is crucial for tasks requiring delicate handling or complex maneuvers, opening up new frontiers for automation and robotics. What’s more, combining 3D vision models with few-shot object manipulation enables robots to adapt quickly to new tasks with a minimum of instruction, speeding up the learning process and enabling more flexible and efficient automation solutions.


These two areas highlight the transformative impact of computer vision technology, which promises to revolutionize approaches across the entire industry. They pave the way for safer, more efficient industrial operations and advanced robotics for complex tasks, making computer vision a key driver of future technological breakthroughs.


Do you plan to write a new book?


Not in the short-term but maybe later on. The computer vision field is always evolving and many break-throughs happen all the time. 

A particularly interesting area for exploration is the synergy between computer vision and language models. The emergence of advanced multimodal models, which seamlessly integrate visual and linguistic data, is indicative of this potential. LLAVA, a model designed to answer image-based questions, is an excellent example of this type of innovation. This model illustrates how the combination of visual understanding and natural language processing can create systems capable of interpreting images and answering related questions with previously unreached depth and nuance. As these technologies evolve, we can expect further advances that will bridge the gap between the way machines see the world and the way they understand and communicate about it, opening up new avenues for applications in a variety of fields.



Yassine’s new book on computer vision will be available online soon. We’re excited to share a few copies with our computer vision partners. If you’re working on an industrial project and need some computer vision expertise, feel free to get in touch. We’re here to help and look forward to working together to bring your project to life.

Article contributors
Yassine Alouini Karine Marini